nsivabalan opened a new pull request, #6581: URL: https://github.com/apache/hudi/pull/6581
### Change Logs Hudi has partition aware clustering strategy and recent partitions based strategy as well for clustering. This plays out well if partitioning is based on dates. but what incase partitioning is based on some other random field. So, this patch introduces a clustering filter mode to filter based on recently altered files. For eg, if a user configures clustering to run every 5 commits, every time clustering runs, it will consider only the file groups touched in the last 5 commits. This will avoid triggering repeated clustering for already clustered file groups as well and clustering will be very fast since only delta file groups are considered. Added a new config named, `hoodie.clustering.plan.filter.mode` whose possible values are NONE, RECENTLY_UPDATED_FILES and RECENTLY_INSERTED_FILES. RECENTLY_INSERTED_FILES would also benefit those users who are just trying to sort the records based on some column leveraging clustering. It may not make sense to re-cluster(or re sort) a file group which is already clustered/sorted. So, with this filtering logic, one can filter for those file groups which had inserts in the last N commits whenever clustering gets triggered. ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: low/medium** This is a feature or enhancement to clustering which could benefit some users based on their need. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
