Hi all,
I'm working on resizing IO integration tests in Beam and I'd like to ask
for the community's opinion.

Right now each IO integration test has a set of four predetermined sizes
(1000, 100k, 1M and 100M elements).
For every size there is a pre calculated hash for read correctness checking.
As it is now, measuring throughput in a IOIT is very costly - accessing
memory for each PCollection element increases the runtime of the test
manyfold, which changes the runtime measurements.

My proposed improvements change the test sizes, add dataset size reporting
to metrics (throughput will be possible to calculate at dashboard level)
and change the way test parameters are passed.
The changes are in a PR here <https://github.com/apache/beam/pull/9638>.
Tests were resized to about 1GB each.
Test configurations would be set by one string parameter in pipeline
options (eg. "testConfigName=XML_1GB" instead of
"numberOfRecords=1000000").

What in general do you think about this approach? Do you think that 1GB
test datasets are reasonable?
Thanks,

Michal

-- 

Michał Walenia
Polidea <https://www.polidea.com/> | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.wale...@polidea.com

Unique Tech
Check out our projects! <https://www.polidea.com/our-work>

Reply via email to