Abacn commented on PR #30253: URL: https://github.com/apache/beam/pull/30253#issuecomment-1942809288
Tested on Dataflow, legacry runner, with streaming engine enabled. On the one hand, there is a data generator generating 10,000 messages several times, until downscale event happens. This means the data sent to the queue is always x * 10000 . The pipeline is then drained after data generator stops. Test 1: without requireDeduping. jobId `2024-02-13_14_16_07-13831713569820732260` - Dataflow UI shows ``` Elements added (Approximate) 70,170 ``` - BigQuery gets `70,090` records. Checked that there are 90 duplications (elements of 8 occurrance), no data loss (element with 6 or less occurrance) Test 2: with requireDeduping, jobId `2024-02-13_14_45_48-4947086524311243096` - Dataflow UI shows ``` Elements added (Approximate) 39,942 ``` This is under-estimate. BigQuery actually shows 40006 records. There are 6 duplications (record occurred 5 times); no data loss (at least 4 times) ------ Relevant query: ``` select count(*) FROM `*.*.jms_sink` select seq, count(*) as `num` FROM `google.com:clouddfe.yathu_test.jms_sink` group by `seq` order by `num` desc limit 100 -- with or without desc ``` Conclusion: dedup is effective. Duplication decreased from ~20 elements per worker that downscaled to < 1 element per worker that downscaled The con is that this requires streaming engine on Dataflow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
