rmahindra123 opened a new pull request #4021:
URL: https://github.com/apache/hudi/pull/4021


   ## What is the purpose of the pull request
   
   [HUDI-2671] When there are two sink workers running, there can be a case 
where one participant joins after the coordinator starts a first commit, which 
needs to be rolled back later since the other participant does not receive the 
START_COMMIT message for the transaction.  In this case, later on in a new 
commit, `writeRecords()` can miss records because 
`ongoingTransactionInfo.getLastWrittenKafkaOffset()` is behind the record 
offsets in the buffer.  This causes missing records in the target Hudi table.
   
   This PR fixes the issue by not accepting kafka offsets that are above the 
current expected value, and resets the kafka offset when such situation 
happens. The corresponding tests are also added.
   
   [HUDI-2672] When there is no data in kafka, the coordinator keeps starting a 
new commit, that leads to the existing commit to be rolled back, creating a 
bunch of rollback commits, which are also not archived. This PR fixes this 
issue by resuing existing commit when all write statuses are empty.
   
   
   ## Brief change log
   - Fixed an issue with missing records due to participant accepting kafka 
offsets higher than last written offset.
   - Fixed issue where coordinator keeps creating rollback commits when no data 
is present in kafka.
   - 
   ## Verify this pull request
   - added and improved the unit tests for participant's handling of kafka 
offset.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to