sivabalan narayanan created HUDI-4773:
-----------------------------------------

             Summary: Adding filter mode to Clustering to filter for recent 
files
                 Key: HUDI-4773
                 URL: https://issues.apache.org/jira/browse/HUDI-4773
             Project: Apache Hudi
          Issue Type: Improvement
          Components: clustering
            Reporter: sivabalan narayanan


We have partition aware clustering strategy and recent partitions based 
strategy as well for clustering. This plays out well if partitioning is based 
on dates. but what incase partitioning is based on some other random field. 

 

So, we might need another clustering filtering strategy to consider only those 
file groups which got touched in the last N commits. 

For eg, if a user configures clustering to run every 5 commits, every time 
clustering runs, it will consider only the file groups touched in the last 5 
commits. This will avoid triggering repeated clustering for already clustered 
file groups as well and clustering will be very fast only delta file groups are 
considered. 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to