codope commented on a change in pull request #3525: URL: https://github.com/apache/hudi/pull/3525#discussion_r696767387
########## File path: website/blog/2021-08-23-async-clustering.md ########## @@ -0,0 +1,153 @@ +--- +title: "Asynchronous Clustering using Hudi" +excerpt: "How to setup Hudi for asynchronous clustering" +author: codope +category: blog +--- + +In one of the [previous blog](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro) posts, we introduced a new +kind of table service called clustering to reorganize data for improved query performance without compromising on +ingestion speed. We learnt how to setup inline clustering. In this post, we will discuss what has changed since then and +see how asynchronous clustering can be setup using HoodieClusteringJob as well as DeltaStreamer utility. + +## Introduction + +On a high level, clustering creates a plan based on a configurable strategy, groups eligible files based on specific +criteria and then executes the plan. Hudi's [MVCC model](https://hudi.apache.org/docs/concurrency_control) provides +snapshot isolation between multiple table services, which allows writers to continue with ingestion while clustering +runs in the background. For a more detailed overview of the clustering architecture please check out the previous blog +post. + +## Clustering Strategies + +As mentioned before, clustering plan as well as execution depends on configurable strategy. These strategies can be +broadly classified into three types: clustering plan strategy, execution strategy and update strategy. + +### Plan Strategy + +This strategy comes into play while creating clustering plan. It helps to decide what file groups should be clustered. +Let's look at different plan strategies that are available with Hudi. Note that these strategies are easily pluggable +using this [config](https://hudi.apache.org/docs/next/configurations#hoodieclusteringplanstrategyclass). + +1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on + the [small file limit](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit) + of base files and creates clustering groups upto max file size allowed per group. The max size can be specified using + this [config.](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategymaxbytespergroup) +2. `SparkRecentDaysClusteringPlanStrategy`: It looks back previous 'N' days partitions and creates a plan that will + cluster the 'small' file slices within those partitions. This is the default strategy. +3. `SparkSelectedPartitionsClusteringPlanStrategy`: In case you want to cluster only specific partitions within a range, + no matter how old or new are those partitions, then this strategy could be useful. To use this partition, one needs + to set below two configs additionally (both begin and end partitions are inclusive): + +``` +hoodie.clustering.plan.strategy.cluster.begin.partition +hoodie.clustering.plan.strategy.cluster.end.partition +``` + +**NOTE**: All the strategies are partition-aware and the latter two are still bound by the size limits of the first +strategy. + +### Execution Strategy + +After building the clustering groups in the planning phase, Hudi applies execution strategy, for each group, primarily +based on sort columns and size. The strategy can be specified using +this [config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringexecutionstrategyclass). + +`SparkSortAndSizeExecutionStrategy` is the default strategy. Users can specify the columns to sort the data by, when +clustering using +this [config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysortcolumns). Apart from +that, we can also set [max file size](https://hudi.apache.org/docs/next/configurations/#hoodieparquetmaxfilesize) Review comment: Yeah that's right. Both go hand in hand. We calculate total size based on base file size or the parquet max file size and then check if it has crossed clustering max bytes in group. https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/plan/strategy/SparkSizeBasedClusteringPlanStrategy.java#L76-L85 ``` totalSizeSoFar += currentSlice.getBaseFile().isPresent() ? currentSlice.getBaseFile().get().getFileSize() : getWriteConfig().getParquetMaxFileSize(); // check if max size is reached and create new group, if needed. if (totalSizeSoFar >= getWriteConfig().getClusteringMaxBytesInGroup() && !currentGroup.isEmpty()) { int numOutputGroups = getNumberOfOutputFileGroups(totalSizeSoFar, getWriteConfig().getClusteringTargetFileMaxBytes()); LOG.info("Adding one clustering group " + totalSizeSoFar + " max bytes: " + getWriteConfig().getClusteringMaxBytesInGroup() + " num input slices: " + currentGroup.size() + " output groups: " + numOutputGroups); fileSliceGroups.add(Pair.of(currentGroup, numOutputGroups)); currentGroup = new ArrayList<>(); totalSizeSoFar = 0; } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
