suryaprasanna commented on PR #17495: URL: https://github.com/apache/hudi/pull/17495#issuecomment-3802119129
> hey @suryaprasanna : again, trying to gauge the necessity of this feature. we can't have duplicates within RLI partition in MDT right. So, how would this be helpful. would you mind clarifying the need please. Consider a scenario where the dataset already contain duplicates. Now, when we do record_index bootstrap the job will fail and thereby stopping the ingestion as well, as ingestion would require record_index to be created. Apart from Record index there are also other indexes that are supported by Hudi but they are not performant enough. So, idea here is to resume ingestion by rebootstrapping the record_index with duplicates, and in the offline run duplicates removal tool. Since, the feature is under a config the current production still works. Let me know, let me know if we need to detailed discussion on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
