codope commented on pull request #4118:
URL: https://github.com/apache/hudi/pull/4118#issuecomment-979383743


   Will close this PR. After discussing with @nsivabalan offline, and also 
confirming from the code that this will avoid the duplicate key issue but it 
will create duplicate data with different file ids. This adversely affects data 
correctness. 
   
   This scenario would happen only when there is no data or so less data that 
deltastreamer finishes one round pretty fast, even before clustering, and there 
is no min sync interval between rounds. I think it's okay to fail the 
clustering due to duplicate key in this scenario. As a workaround users could 
set OCC mode or add delay between rounds of delta sync. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to