Re: [PR] feat(storage): add config to allow duplicates while writing to HFiles [hudi]

via GitHub Mon, 26 Jan 2026 14:46:05 -0800


suryaprasanna commented on PR #17495:
URL: https://github.com/apache/hudi/pull/17495#issuecomment-3802119129


   > hey @suryaprasanna : again, trying to gauge the necessity of this feature. 
we can't have duplicates within RLI partition in MDT right. So, how would this 
be helpful. would you mind clarifying the need please.
   
   Consider a scenario where the dataset already contain duplicates. Now, when 
we do record_index bootstrap the job will fail and thereby stopping the 
ingestion as well, as ingestion would require record_index to be created. Apart 
from Record index there are also other indexes that are supported by Hudi but 
they are not performant enough. So, idea here is to resume ingestion by 
rebootstrapping the record_index with duplicates, and in the offline run 
duplicates removal tool. Since, the feature is under a config the current 
production still works. Let me know, let me know if we need to detailed 
discussion on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(storage): add config to allow duplicates while writing to HFiles [hudi]

Reply via email to