sivabalan narayanan created HUDI-4773:
-----------------------------------------
Summary: Adding filter mode to Clustering to filter for recent
files
Key: HUDI-4773
URL: https://issues.apache.org/jira/browse/HUDI-4773
Project: Apache Hudi
Issue Type: Improvement
Components: clustering
Reporter: sivabalan narayanan
We have partition aware clustering strategy and recent partitions based
strategy as well for clustering. This plays out well if partitioning is based
on dates. but what incase partitioning is based on some other random field.
So, we might need another clustering filtering strategy to consider only those
file groups which got touched in the last N commits.
For eg, if a user configures clustering to run every 5 commits, every time
clustering runs, it will consider only the file groups touched in the last 5
commits. This will avoid triggering repeated clustering for already clustered
file groups as well and clustering will be very fast only delta file groups are
considered.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)