Re: [I] [Bug]: Reducing parallelism in FlinkRunner leads to a data loss [beam]

via GitHub Sun, 15 Oct 2023 15:01:44 -0700


psolomin commented on issue #25975:
URL: https://github.com/apache/beam/issues/25975#issuecomment-1763515682


   Hi @je-ik 
   
   >  Does that mean you compare the outputs of the rescaled job with the 
provided inputs
   
   Yes, I publish messages with IDs to Kinesis, then consume them, save in 
files and check those files with other tools (like Spark). Is there a better 
approach to check data integrity when we run parallel consumer instances?
   
   > does this still happen on 2.50.0?
   
   I ran a couple of tests with 2.50.0 too:
   
   1. Set N of Kinesis stream shards to 1, left pipeline parallelism = 3 and 
restarted from a savepoint with parallelism = 2 - this did not gave data loss.
   
   2. Set N of Kinesis stream shards to 3, left pipeline parallelism = 3 and 
restarted from a savepoint with parallelism = 2 - this gave data loss. After 
another restart with parallelism = 3 all data came through.
   
   3. Ran same test in Flink with `KafkaIO` and **was not able to reproduce** 
the issue with a similar setup of 3 partitions and initial parallelism = 3
   
   I will assign the issue to myself and will run more tests and try to narrow 
down the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Bug]: Reducing parallelism in FlinkRunner leads to a data loss [beam]

Reply via email to