hudi-bot opened a new issue, #15408: URL: https://github.com/apache/hudi/issues/15408
We have partition aware clustering strategy and recent partitions based strategy as well for clustering. This plays out well if partitioning is based on dates. but what incase partitioning is based on some other random field. So, we might need another clustering filtering strategy to consider only those file groups which got touched in the last N commits. For eg, if a user configures clustering to run every 5 commits, every time clustering runs, it will consider only the file groups touched in the last 5 commits. This will avoid triggering repeated clustering for already clustered file groups as well and clustering will be very fast only delta file groups are considered. ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-4773 - Type: Improvement -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
