rmahindra123 opened a new pull request #4021: URL: https://github.com/apache/hudi/pull/4021
## What is the purpose of the pull request [HUDI-2671] When there are two sink workers running, there can be a case where one participant joins after the coordinator starts a first commit, which needs to be rolled back later since the other participant does not receive the START_COMMIT message for the transaction. In this case, later on in a new commit, `writeRecords()` can miss records because `ongoingTransactionInfo.getLastWrittenKafkaOffset()` is behind the record offsets in the buffer. This causes missing records in the target Hudi table. This PR fixes the issue by not accepting kafka offsets that are above the current expected value, and resets the kafka offset when such situation happens. The corresponding tests are also added. [HUDI-2672] When there is no data in kafka, the coordinator keeps starting a new commit, that leads to the existing commit to be rolled back, creating a bunch of rollback commits, which are also not archived. This PR fixes this issue by resuing existing commit when all write statuses are empty. ## Brief change log - Fixed an issue with missing records due to participant accepting kafka offsets higher than last written offset. - Fixed issue where coordinator keeps creating rollback commits when no data is present in kafka. - ## Verify this pull request - added and improved the unit tests for participant's handling of kafka offset. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
