[GitHub] [hudi] codope commented on a change in pull request #3525: [HUDI-2346] Async clustering usage blog

GitBox Thu, 26 Aug 2021 08:56:38 -0700


codope commented on a change in pull request #3525:
URL: https://github.com/apache/hudi/pull/3525#discussion_r696767387




##########
File path: website/blog/2021-08-23-async-clustering.md
##########
@@ -0,0 +1,153 @@
+---
+title: "Asynchronous Clustering using Hudi"
+excerpt: "How to setup Hudi for asynchronous clustering"
+author: codope 
+category: blog
+---
+
+In one of the [previous 
blog](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro) posts, we 
introduced a new
+kind of table service called clustering to reorganize data for improved query 
performance without compromising on
+ingestion speed. We learnt how to setup inline clustering. In this post, we 
will discuss what has changed since then and
+see how asynchronous clustering can be setup using HoodieClusteringJob as well 
as DeltaStreamer utility.
+
+## Introduction
+
+On a high level, clustering creates a plan based on a configurable strategy, 
groups eligible files based on specific
+criteria and then executes the plan. Hudi's [MVCC 
model](https://hudi.apache.org/docs/concurrency_control) provides
+snapshot isolation between multiple table services, which allows writers to 
continue with ingestion while clustering
+runs in the background. For a more detailed overview of the clustering 
architecture please check out the previous blog
+post.
+
+## Clustering Strategies
+
+As mentioned before, clustering plan as well as execution depends on 
configurable strategy. These strategies can be
+broadly classified into three types: clustering plan strategy, execution 
strategy and update strategy.
+
+### Plan Strategy
+
+This strategy comes into play while creating clustering plan. It helps to 
decide what file groups should be clustered.
+Let's look at different plan strategies that are available with Hudi. Note 
that these strategies are easily pluggable
+using this 
[config](https://hudi.apache.org/docs/next/configurations#hoodieclusteringplanstrategyclass).
+
+1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on
+   the [small file 
limit](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit)
+   of base files and creates clustering groups upto max file size allowed per 
group. The max size can be specified using
+   this 
[config.](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategymaxbytespergroup)
+2. `SparkRecentDaysClusteringPlanStrategy`: It looks back previous 'N' days 
partitions and creates a plan that will
+   cluster the 'small' file slices within those partitions. This is the 
default strategy.
+3. `SparkSelectedPartitionsClusteringPlanStrategy`: In case you want to 
cluster only specific partitions within a range,
+   no matter how old or new are those partitions, then this strategy could be 
useful. To use this partition, one needs
+   to set below two configs additionally (both begin and end partitions are 
inclusive):
+
+```
+hoodie.clustering.plan.strategy.cluster.begin.partition
+hoodie.clustering.plan.strategy.cluster.end.partition
+```
+
+**NOTE**: All the strategies are partition-aware and the latter two are still 
bound by the size limits of the first
+strategy.
+
+### Execution Strategy
+
+After building the clustering groups in the planning phase, Hudi applies 
execution strategy, for each group, primarily
+based on sort columns and size. The strategy can be specified using
+this 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringexecutionstrategyclass).
+
+`SparkSortAndSizeExecutionStrategy` is the default strategy. Users can specify 
the columns to sort the data by, when
+clustering using
+this 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysortcolumns).
 Apart from
+that, we can also set [max file 
size](https://hudi.apache.org/docs/next/configurations/#hoodieparquetmaxfilesize)

Review comment:
       Yeah that's right. Both go hand in hand. We calculate total size based 
on base file size or the parquet max file size and then check if it has crossed 
clustering max bytes in group.
   
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/plan/strategy/SparkSizeBasedClusteringPlanStrategy.java#L76-L85
   
   ```
         totalSizeSoFar += currentSlice.getBaseFile().isPresent() ? 
currentSlice.getBaseFile().get().getFileSize() : 
getWriteConfig().getParquetMaxFileSize();
         // check if max size is reached and create new group, if needed.
         if (totalSizeSoFar >= getWriteConfig().getClusteringMaxBytesInGroup() 
&& !currentGroup.isEmpty()) {
           int numOutputGroups = getNumberOfOutputFileGroups(totalSizeSoFar, 
getWriteConfig().getClusteringTargetFileMaxBytes());
           LOG.info("Adding one clustering group " + totalSizeSoFar + " max 
bytes: "
               + getWriteConfig().getClusteringMaxBytesInGroup() + " num input 
slices: " + currentGroup.size() + " output groups: " + numOutputGroups);
           fileSliceGroups.add(Pair.of(currentGroup, numOutputGroups));
           currentGroup = new ArrayList<>();
           totalSizeSoFar = 0;
         }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] codope commented on a change in pull request #3525: [HUDI-2346] Async clustering usage blog

Reply via email to