Abacn commented on PR #30253:
URL: https://github.com/apache/beam/pull/30253#issuecomment-1942809288

   Tested on Dataflow, legacry runner, with streaming engine enabled.
   
   On the one hand, there is a data generator generating 10,000 messages 
several times, until downscale event happens. This means the data sent to the 
queue is always x * 10000 . The pipeline is then drained after data generator 
stops.
   
   Test 1: without requireDeduping. jobId 
`2024-02-13_14_16_07-13831713569820732260`
   
   - Dataflow UI shows 
   ```
   Elements added (Approximate)
   70,170
   ```
   
   - BigQuery gets `70,090` records. Checked that there are 90 duplications 
(elements of 8 occurrance), no data loss (element with 6 or less occurrance)
   
   Test 2:  with requireDeduping, jobId 
`2024-02-13_14_45_48-4947086524311243096`
   
   - Dataflow UI shows
   ```
   Elements added (Approximate)
   39,942
   ```
   This is under-estimate. BigQuery actually shows 40006 records. There are 6 
duplications (record occurred 5 times); no data loss (at least 4 times)
   
   ------
   
   Relevant query:
   
   ```
   select count(*) FROM `*.*.jms_sink`
   select seq, count(*) as `num`  FROM 
`google.com:clouddfe.yathu_test.jms_sink` group by `seq` order by `num` desc 
limit 100  -- with or without desc
   ```
   
   Conclusion: dedup is effective. Duplication decreased from ~20 elements per 
worker that downscaled to < 1 element per worker that downscaled
   
   The con is that this requires streaming engine on Dataflow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to