chenshzh opened a new pull request, #7697: URL: https://github.com/apache/hudi/pull/7697
### Change Logs To fix flink WriteMetadataEvents lost when committing instant: we add a wait and ack mechanism when StreamWriteOperatorCoordinator executes notifyCheckpointComplete. 1. **Reasons why we need an ack mechanism:** In some extreme cases, when checkpoints in the writting functions are completed and sending back their meta events, but due to network latency, the coordinator notifyCheckpointComplete might be invoked before handleEventFromOperator to handle metas. Thus, we will commit the instant with un-completed meta events by mistake. A wait and ack mechanism to verify that all last meta events(lastBatch = true)from each task are received, to rescue this commit 2. **Reasons why we need a specific ack thread but not in the common single thread executor:** It'll be a DEAD lock between notifyCheckpointComplete verification and handleEventFromOperator in the single thread executor ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update 1. `write.metadata.event.ack.timeout`: default 10_000L, means 10 seconds Timeout limit for StreamWriteCoordinator notifyCheckpointComplete to wait in the ack thread until all meta events from tasks are received and handled ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
