psolomin commented on issue #25975: URL: https://github.com/apache/beam/issues/25975#issuecomment-1763515682
Hi @je-ik > Does that mean you compare the outputs of the rescaled job with the provided inputs Yes, I publish messages with IDs to Kinesis, then consume them, save in files and check those files with other tools (like Spark). Is there a better approach to check data integrity when we run parallel consumer instances? > does this still happen on 2.50.0? I ran a couple of tests with 2.50.0 too: 1. Set N of Kinesis stream shards to 1, left pipeline parallelism = 3 and restarted from a savepoint with parallelism = 2 - this did not gave data loss. 2. Set N of Kinesis stream shards to 3, left pipeline parallelism = 3 and restarted from a savepoint with parallelism = 2 - this gave data loss. After another restart with parallelism = 3 all data came through. 3. Ran same test in Flink with `KafkaIO` and **was not able to reproduce** the issue with a similar setup of 3 partitions and initial parallelism = 3 I will assign the issue to myself and will run more tests and try to narrow down the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
