Hi all, I'm working on resizing IO integration tests in Beam and I'd like to ask for the community's opinion.
Right now each IO integration test has a set of four predetermined sizes (1000, 100k, 1M and 100M elements). For every size there is a pre calculated hash for read correctness checking. As it is now, measuring throughput in a IOIT is very costly - accessing memory for each PCollection element increases the runtime of the test manyfold, which changes the runtime measurements. My proposed improvements change the test sizes, add dataset size reporting to metrics (throughput will be possible to calculate at dashboard level) and change the way test parameters are passed. The changes are in a PR here <https://github.com/apache/beam/pull/9638>. Tests were resized to about 1GB each. Test configurations would be set by one string parameter in pipeline options (eg. "testConfigName=XML_1GB" instead of "numberOfRecords=1000000"). What in general do you think about this approach? Do you think that 1GB test datasets are reasonable? Thanks, Michal -- Michał Walenia Polidea <https://www.polidea.com/> | Software Engineer M: +48 791 432 002 <+48791432002> E: michal.wale...@polidea.com Unique Tech Check out our projects! <https://www.polidea.com/our-work>