doc-laowu opened a new issue, #34890: URL: https://github.com/apache/beam/issues/34890
### What happened? ### The beam task consumes Kafka, and then the runner is Flinkrunner. After the job recovers from the checkpoint, it repeatedly consumes Kafka messages I have investigated the cause of the issue, which is that Flink's SourceOperator # open function restored the FlinkSourceSplit of the snapshot from the ReaderState, and then triggered the FlinkSourceSplitAnalyzer # start function in the SourceCoordinator # start function to calculate the split. Then, the AddSpliceEvent message was sent to the SourceOperator and FlinkSourceReaderBase # sourceSplits, resulting in the final FlinkUnboundeSourceReader being created repeatedly, leading to duplicate message consumption in Kafka. 我查看了问题的原因是flink的SourceOperator#open函数中从readerState中恢复了快照的FlinkSourceSplit,然后在SourceCoordinator#start函数中又触发了FlinkSourceSplitEnumerator#start函数进行split的计算,然后通过AddSplitEvent消息发送到了SourceOperator中并且FlinkSourceReaderBase#sourceSplits中,导致最终的FlinkUnboundedSourceReader被重复创建,进而导致kafka的消息消费出现了重复的现象。 ### Issue Priority Priority: 2 (default / most bugs should be filed as P2) ### Issue Components - [ ] Component: Python SDK - [x] Component: Java SDK - [ ] Component: Go SDK - [ ] Component: Typescript SDK - [ ] Component: IO connector - [ ] Component: Beam YAML - [ ] Component: Beam examples - [ ] Component: Beam playground - [ ] Component: Beam katas - [ ] Component: Website - [ ] Component: Infrastructure - [ ] Component: Spark Runner - [x] Component: Flink Runner - [ ] Component: Samza Runner - [ ] Component: Twister2 Runner - [ ] Component: Hazelcast Jet Runner - [ ] Component: Google Cloud Dataflow Runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org