nsivabalan opened a new pull request, #6581:
URL: https://github.com/apache/hudi/pull/6581

   ### Change Logs
   
   Hudi has partition aware clustering strategy and recent partitions based 
strategy as well for clustering. This plays out well if partitioning is based 
on dates. but what incase partitioning is based on some other random field. 
   
   So, this patch introduces a clustering filter mode to filter based on 
recently altered files. 
   
   For eg, if a user configures clustering to run every 5 commits, every time 
clustering runs, it will consider only the file groups touched in the last 5 
commits. This will avoid triggering repeated clustering for already clustered 
file groups as well and clustering will be very fast since only delta file 
groups are considered. 
   
   Added a new config named, `hoodie.clustering.plan.filter.mode` whose 
possible values are NONE, RECENTLY_UPDATED_FILES and RECENTLY_INSERTED_FILES. 
   
   RECENTLY_INSERTED_FILES would also benefit those users who are just trying 
to sort the records based on some column leveraging clustering. It may not make 
sense to re-cluster(or re sort) a file group which is already clustered/sorted. 
So, with this filtering logic, one can filter for those file groups which had 
inserts in the last N commits whenever clustering gets triggered. 
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: low/medium**
   
   This is a feature or enhancement to clustering which could benefit some 
users based on their need. 
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to