[GitHub] [hudi] xuzifu666 commented on a change in pull request #3525: [HUDI-2346] Async clustering usage blog

GitBox Sun, 29 Aug 2021 03:38:13 -0700


xuzifu666 commented on a change in pull request #3525:
URL: https://github.com/apache/hudi/pull/3525#discussion_r697994100




##########
File path: website/blog/2021-08-23-async-clustering.md
##########
@@ -0,0 +1,159 @@
+---
+title: "Asynchronous Clustering using Hudi"
+excerpt: "How to setup Hudi for asynchronous clustering"
+author: codope
+category: blog
+---
+
+In one of the [previous 
blog](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro) posts, we 
introduced a new
+kind of table service called clustering to reorganize data for improved query 
performance without compromising on
+ingestion speed. We learnt how to setup inline clustering. In this post, we 
will discuss what has changed since then and
+see how asynchronous clustering can be setup using HoodieClusteringJob as well 
as DeltaStreamer utility.
+
+<!--truncate-->
+
+## Introduction
+
+On a high level, clustering creates a plan based on a configurable strategy, 
groups eligible files based on specific
+criteria and then executes the plan. Hudi's [MVCC 
model](https://hudi.apache.org/docs/concurrency_control) provides
+snapshot isolation between multiple table services, which allows writers to 
continue with ingestion while clustering
+runs in the background. For a more detailed overview of the clustering 
architecture please check out the previous blog
+post.
+
+## Clustering Strategies
+
+As mentioned before, clustering plan as well as execution depends on 
configurable strategy. These strategies can be
+broadly classified into three types: clustering plan strategy, execution 
strategy and update strategy.
+
+### Plan Strategy
+
+This strategy comes into play while creating clustering plan. It helps to 
decide what file groups should be clustered.
+Let's look at different plan strategies that are available with Hudi. Note 
that these strategies are easily pluggable
+using this 
[config](https://hudi.apache.org/docs/next/configurations#hoodieclusteringplanstrategyclass).
+
+1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on
+   the [small file 
limit](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit)
+   of base files and creates clustering groups upto max file size allowed per 
group. The max size can be specified using
+   this 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategymaxbytespergroup).
 This
+   strategy is useful for stitching together medium-sized files into larger 
ones to reduce lot of files spread across
+   cold partitions.
+2. `SparkRecentDaysClusteringPlanStrategy`: It looks back previous 'N' days 
partitions and creates a plan that will
+   cluster the 'small' file slices within those partitions. This is the 
default strategy. It could be useful when the
+   workload is predictable and data is partitioned by timestamp.
+3. `SparkSelectedPartitionsClusteringPlanStrategy`: In case you want to 
cluster only specific partitions within a range,
+   no matter how old or new are those partitions, then this strategy could be 
useful. To use this partition, one needs
+   to set below two configs additionally (both begin and end partitions are 
inclusive):
+
+```
+hoodie.clustering.plan.strategy.cluster.begin.partition
+hoodie.clustering.plan.strategy.cluster.end.partition
+```
+
+:::note
+All the strategies are partition-aware and the latter two are still bound by 
the size limits of the first strategy.
+:::
+
+### Execution Strategy
+
+After building the clustering groups in the planning phase, Hudi applies 
execution strategy, for each group, primarily
+based on sort columns and size. The strategy can be specified using
+this 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringexecutionstrategyclass).
+
+`SparkSortAndSizeExecutionStrategy` is the default strategy. Users can specify 
the columns to sort the data by, when
+clustering using
+this 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysortcolumns).
 Apart from
+that, we can also set [max file 
size](https://hudi.apache.org/docs/next/configurations/#hoodieparquetmaxfilesize)
+for the parquet files produced due to clustering. The strategy uses bulk 
insert to write data into new files, in which
+case, Hudi implicitly uses a partitioner that does sorting based on specified 
columns. In this way, the strategy changes
+the data layout in a way that not only improves query performance but also 
balance rewrite overhead automatically.
+
+Now this strategy can be executed either as a single spark job or multiple 
jobs depending on number of clustering groups
+created in the planning phase. By default, Hudi will submit multiple spark 
jobs and union the results. In case you want
+to force Hudi to use single spark job, set the execution strategy
+class 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringexecutionstrategyclass)
+to `SingleSparkJobExecutionStrategy`.
+
+### Update Strategy
+
+Currently, clustering can only be scheduled for tables/partitions not 
receiving any concurrent updates. By default,
+the [config for update 
strategy](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringupdatesstrategy)
 is
+set to ***SparkRejectUpdateStrategy***. If some file group has updates during 
clustering then it will reject updates and
+throw an exception. However, in some use-cases updates are very sparse and do 
not touch most file groups. The default
+strategy to simply reject updates does not seem fair. In such use-cases, users 
can set the config to ***SparkAllowUpdateStrategy***.
+
+We discussed the critical strategy configurations. All other configurations 
related to clustering are
+listed 
[here](https://hudi.apache.org/docs/next/configurations/#Clustering-Configs). 
Out of this list, a few
+configurations that will be very useful are:
+
+|  Config key  | Remarks | Default |
+|  -----------  | -------  | ------- |
+| `hoodie.clustering.async.enabled` | Enable running of clustering service, 
asynchronously as writes happen on the table. | False |
+| `hoodie.clustering.async.max.commits` | Control frequency of async 
clustering by specifying after how many commits clustering should be triggered. 
| 4 |
+| `hoodie.clustering.preserve.commit.metadata` | When rewriting data, 
preserves existing _hoodie_commit_time. This means users can run incremental 
queries on clustered data without any side-effects. | False |
+
+## Setup Asynchronous Clustering
+
+Previously, we have seen how users
+can [setup inline 
clustering](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro#setting-up-clustering).
+Additionally, users can
+leverage 
[HoodieClusteringJob](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-SetupforAsyncclusteringJob)
+to setup 2-step asynchronous clustering.
+
+### HoodieClusteringJob
+
+With the release of Hudi version 0.9.0, we can schedule as well as execute 
clustering in the same step. We just need to
+specify the `—mode` or `-m` option. There are three modes:
+
+1. `schedule`: Make a clustering plan. This gives an instant which can be 
passed in execute mode.
+2. `execute`: Execute a clustering plan at given instant which means 
--instant-time is required here.
+3. `scheduleAndExecute`: Make a clustering plan first and execute that plan 
immediately.
+
+A sample spark-submit command to setup HoodieClusteringJob is as below:
+
+```bash
+spark-submit \
+--class org.apache.hudi.utilities.HoodieClusteringJob \
+/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.9.0-SNAPSHOT.jar
 \
+--props /path/to/config/clusteringjob.properties \
+--mode scheduleAndExecute \
+--base-path /path/to/hudi_table/basePath \
+--table-name hudi_table_schedule_clustering \
+--spark-memory 1g
+```
+
+### HoodieDeltaStreamer

Review comment:
       the content of clusteringjob.properties need to show for reader, 
otherwise hard to use for newer




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xuzifu666 commented on a change in pull request #3525: [HUDI-2346] Async clustering usage blog

Reply via email to