codope commented on issue #4891: URL: https://github.com/apache/hudi/issues/4891#issuecomment-1058149325
> How does normally people perform inline clustering or async clustering on partitions with large amount of data? Do you expect driver memory should be larger than the clustering data size? Currently, [the union](https://github.com/apache/hudi/blob/a4ba0fff07861bdff338074fb5d84726f430883f/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java#L108) will collect hudi records (as per the plan generated for corresponding replacecommit) into the driver. > What are the configurations I should be using to perform clustering on these large tables? > In PROD I will have 1.8 billion records (each record 3-4 KB in memory) on each day, so is it advised to perform clustering frequently (every 10 to 20 commits) or daily? As you said, you could run clustering more frequently. Configs will also depend on clustering strategy. Looking at your configs, you can adjust the `hoodie.clustering.plan.strategy.daybased.lookback.partitions` to a lower value. You could also tune `hoodie.clustering.plan.strategy.small.file.limit` to a value such that even though each round of clustering is picking just a few partitions but it is able to utilize the memory as much as possible. What's the nature of your workload? Is it like append-only, or frequent upserts but to more recent partitions? > Does MOR table supports async clustering with OCC assurance? Yes, if the lock provider is configured. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
