codope commented on pull request #4118: URL: https://github.com/apache/hudi/pull/4118#issuecomment-979383743
Will close this PR. After discussing with @nsivabalan offline, and also confirming from the code that this will avoid the duplicate key issue but it will create duplicate data with different file ids. This adversely affects data correctness. This scenario would happen only when there is no data or so less data that deltastreamer finishes one round pretty fast, even before clustering, and there is no min sync interval between rounds. I think it's okay to fail the clustering due to duplicate key in this scenario. As a workaround users could set OCC mode or add delay between rounds of delta sync. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
