[hudi] branch asf-site updated: [DOCS] Update clustering docs (#7985)

codope Mon, 03 Apr 2023 19:53:17 -0700

This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 76b212fed0a [DOCS] Update clustering docs (#7985)
76b212fed0a is described below

commit 76b212fed0a766fe0a2edd4c04215bb52e718343
Author: Sagar Sumit <[email protected]>
AuthorDate: Tue Apr 4 08:22:31 2023 +0530

    [DOCS] Update clustering docs (#7985)
---
 website/docs/clustering.md                         | 231 ++++++++++++++-------
 .../assets/images/clustering_small_files.gif       | Bin 0 -> 668806 bytes
 website/static/assets/images/clustering_sort.gif   | Bin 0 -> 735437 bytes
 3 files changed, 159 insertions(+), 72 deletions(-)

diff --git a/website/docs/clustering.md b/website/docs/clustering.md
index 9e157de785b..d2ceb196d02 100644
--- a/website/docs/clustering.md
+++ b/website/docs/clustering.md
@@ -10,6 +10,17 @@ last_modified_at:
 Apache Hudi brings stream processing to big data, providing fresh data while 
being an order of magnitude efficient over traditional batch processing. In a 
data lake/warehouse, one of the key trade-offs is between ingestion speed and 
query performance. Data ingestion typically prefers small files to improve 
parallelism and make data available to queries as soon as possible. However, 
query performance degrades poorly with a lot of small files. Also, during 
ingestion, data is typically co-l [...]
 <!--truncate-->
 
+## How is compaction different from clustering?
+
+Hudi is modeled like a log-structured storage engine with multiple versions of 
the data.
+Particularly, [Merge-On-Read](/docs/table_types#merge-on-read-table)
+tables in Hudi store data using a combination of base file in columnar format 
and row-based delta logs that contain
+updates. Compaction is a way to merge the delta logs with base files to 
produce the latest file slices with the most
+recent snapshot of data. Compaction helps to keep the query performance in 
check (larger delta log files would incur
+longer merge times on query side). On the other hand, clustering is a data 
layout optimization technique. One can stitch
+together small files into larger files using clustering. Additionally, data 
can be clustered by sort key so that queries
+can take advantage of data locality.
+
 ## Clustering Architecture
 
 At a high level, Hudi provides different operations such as 
insert/upsert/bulk_insert through it’s write client API to be able to write 
data to a Hudi table. To be able to choose a trade-off between file size and 
ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be 
able to configure the smallest allowable file size. Users are able to configure 
the small file [soft 
limit](https://hudi.apache.org/docs/configurations/#hoodieparquetsmallfilelimit)
 to `0` to force new [...]
@@ -22,13 +33,13 @@ Clustering table service can run asynchronously or 
synchronously adding a new ac
 
 
 
-### Overall, there are 2 parts to clustering
+### Overall, there are 2 steps to clustering
 
 1.  Scheduling clustering: Create a clustering plan using a pluggable 
clustering strategy.
 2.  Execute clustering: Process the plan using an execution strategy to create 
new files and replace old files.
 
 
-### Scheduling clustering
+### Schedule clustering
 
 Following steps are followed to schedule clustering.
 
@@ -37,7 +48,7 @@ Following steps are followed to schedule clustering.
 3.  Finally, the clustering plan is saved to the timeline in an avro [metadata 
format](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieClusteringPlan.avsc).
 
 
-### Running clustering
+### Execute clustering
 
 1.  Read the clustering plan and get the ‘clusteringGroups’ that mark the file 
groups that need to be clustered.
 2.  For each group, we instantiate appropriate strategy class with 
strategyParams (example: sortColumns) and apply that strategy to rewrite the 
data.
@@ -51,8 +62,147 @@ NOTE: Clustering can only be scheduled for tables / 
partitions not receiving any
 ![Clustering 
example](/assets/images/blog/clustering/example_perf_improvement.png)
 _Figure: Illustrating query performance improvements by clustering_
 
-### Setting up clustering
-Inline clustering can be setup easily using spark dataframe options. See 
sample below
+## Clustering Usecases
+
+### Batching small files
+
+As mentioned in the intro, streaming ingestion generally results in smaller 
files in your data lake. But having a lot of
+such small files could lead to higher query latency. From our experience 
supporting community users, there are quite a
+few users who are using Hudi just for small file handling capabilities. So, 
you could employ clustering to batch a lot
+of such small files into larger ones.
+
+![Batching small files](/assets/images/clustering_small_files.gif)
+
+### Cluster by sort key
+
+Another classic problem in data lake is the arrival time vs event time 
problem. Generally you write data based on
+arrival time, while query predicates do not sit well with it. With clustering, 
you can re-write your data by sorting
+based on query predicates and so, your data skipping will be very efficient 
and your query can ignore scanning a lot of
+unnecessary data.
+
+![Batching small files](/assets/images/clustering_sort.gif)
+
+## Clustering Strategies
+
+On a high level, clustering creates a plan based on a configurable strategy, 
groups eligible files based on specific
+criteria and then executes the plan. As mentioned before, clustering plan as 
well as execution depends on configurable
+strategy. These strategies can be broadly classified into three types: 
clustering plan strategy, execution strategy and
+update strategy.
+
+### Plan Strategy
+
+This strategy comes into play while creating clustering plan. It helps to 
decide what file groups should be clustered
+and how many output file groups should the clustering produce. Note that these 
strategies are easily pluggable using the
+config 
[hoodie.clustering.plan.strategy.class](/docs/configurations#hoodieclusteringplanstrategyclass).
+
+Different plan strategies are as follows:
+
+#### Size-based clustering strategies
+
+This strategy creates clustering groups based on max size allowed per group. 
Also, it excludes files that are greater
+than the small file limit from the clustering plan. Available strategies 
depending on write client
+are: `SparkSizeBasedClusteringPlanStrategy`, 
`FlinkSizeBasedClusteringPlanStrategy`
+and `JavaSizeBasedClusteringPlanStrategy`. Furthermore, Hudi provides 
flexibility to include or exclude partitions for
+clustering, tune the file size limits, maximum number of output groups, as we 
will see below.
+
+`hoodie.clustering.plan.strategy.partition.selected`: Comma separated list of 
partitions to be considered for
+clustering.
+
+`hoodie.clustering.plan.strategy.partition.regex.pattern`: Filters clustering 
partitions that matched regex pattern.
+
+`hoodie.clustering.plan.partition.filter.mode`: In addition to previous 
filtering, we have few additional filtering as
+well. Different values for this mode are `NONE`, `RECENT_DAYS` and 
`SELECTED_PARTITIONS`.
+
+- `NONE`: do not filter table partition and thus the clustering plan will 
include all partitions that have clustering
+  candidate.
+- `RECENT_DAYS`: keep a continuous range of partitions, works together with 
the below configs:
+   - `hoodie.clustering.plan.strategy.daybased.lookback.partitions`: Number of 
partitions to list to create
+     ClusteringPlan.
+   - `hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions`: 
Number of partitions to skip from latest when
+     choosing partitions to create ClusteringPlan. As the name implies, 
applicable only if partitioning is day based.
+- `SELECTED_PARTITIONS`: keep partitions that are in the specified range based 
on below configs:
+   - `hoodie.clustering.plan.strategy.cluster.begin.partition`: Begin 
partition used to filter partition (inclusive).
+   - `hoodie.clustering.plan.strategy.cluster.end.partition`: End partition 
used to filter partition (inclusive).
+- `DAY_ROLLING`: cluster partitions on a rolling basis by the hour to avoid 
clustering all partitions each time.
+
+**Small file limit**
+
+`hoodie.clustering.plan.strategy.small.file.limit`: Files smaller than the 
size in bytes specified here are candidates
+for clustering. Larges file groups will be ignored.
+
+**Max number of groups**
+
+`hoodie.clustering.plan.strategy.max.num.groups`: Maximum number of groups to 
create as part of ClusteringPlan.
+Increasing groups will increase parallelism. This does not imply the number of 
output file groups as such. This refers
+to clustering groups (parallel tasks/threads that will work towards producing 
output file groups). Total output file
+groups is also determined by based on target file size which we will discuss 
shortly.
+
+**Max bytes per group**
+
+`hoodie.clustering.plan.strategy.max.bytes.per.group`: Each clustering 
operation can create multiple output file groups.
+Total amount of data processed by clustering operation is defined by below two 
properties (Max bytes per group * Max num
+groups. Thus, this config will assist in capping the max amount of data to be 
included in one group.
+
+**Target file size max**
+
+`hoodie.clustering.plan.strategy.target.file.max.bytes`: Each group can 
produce ’N’ (max group size /target file size)
+output file groups.
+
+#### SparkSingleFileSortPlanStrategy
+
+In this strategy, clustering group for each partition is built in the same way 
as `SparkSizeBasedClusteringPlanStrategy`
+. The difference is that the output group is 1 and file group id remains the 
same,
+while `SparkSizeBasedClusteringPlanStrategy` can create multiple file groups 
with newer fileIds.
+
+#### SparkConsistentBucketClusteringPlanStrategy
+
+This strategy is specifically used for consistent bucket index. This will be 
leveraged to expand your bucket index (from
+static partitioning to dynamic). Typically, users don’t need to use this 
strategy. Hudi internally uses this for
+dynamically expanding the buckets for bucket index datasets.
+
+:::note The latter two strategies are applicable only for the Spark engine.
+:::
+
+### Execution Strategy
+
+After building the clustering groups in the planning phase, Hudi applies 
execution strategy, for each group, primarily
+based on sort columns and size. The strategy can be specified using the
+config 
[hoodie.clustering.execution.strategy.class](/docs/configurations/#hoodieclusteringexecutionstrategyclass).
 By
+default, Hudi sorts the file groups in the plan by the specified columns, 
while meeting the configured target file
+sizes.
+
+The available strategies are as follows:
+
+1. `SPARK_SORT_AND_SIZE_EXECUTION_STRATEGY`: Uses bulk_insert to re-write data 
from input file groups.
+   1. Set `hoodie.clustering.execution.strategy.class`
+      to 
`org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy`.
+   2. `hoodie.clustering.plan.strategy.sort.columns`: Columns to sort the data 
while clustering. This goes in
+      conjunction with layout optimization strategies depending on your query 
predicates. One can set comma separated
+      list of columns that needs to be sorted in this config.
+2. `JAVA_SORT_AND_SIZE_EXECUTION_STRATEGY`: Similar to 
`SPARK_SORT_AND_SIZE_EXECUTION_STRATEGY`, for the Java and Flink
+   engines. Set `hoodie.clustering.execution.strategy.class`
+   to 
`org.apache.hudi.client.clustering.run.strategy.JavaSortAndSizeExecutionStrategy`.
+3. `SPARK_CONSISTENT_BUCKET_EXECUTION_STRATEGY`: As the name implies, this is 
applicable to dynamically expand
+   consistent bucket index and only applicable to the Spark engine. Set 
`hoodie.clustering.execution.strategy.class`
+   to 
`org.apache.hudi.client.clustering.run.strategy.SparkConsistentBucketClusteringExecutionStrategy`.
+
+### Update Strategy
+
+Currently, clustering can only be scheduled for tables/partitions not 
receiving any concurrent updates. By default,
+the [config for update 
strategy](/docs/configurations/#hoodieclusteringupdatesstrategy) is set to ***
+SparkRejectUpdateStrategy***. If some file group has updates during clustering 
then it will reject updates and throw an
+exception. However, in some use-cases updates are very sparse and do not touch 
most file groups. The default strategy to
+simply reject updates does not seem fair. In such use-cases, users can set the 
config to ***SparkAllowUpdateStrategy***.
+
+We discussed the critical strategy configurations. All other configurations 
related to clustering are
+listed [here](/docs/configurations/#Clustering-Configs). Out of this list, a 
few configurations that will be very useful
+for inline or async clustering are shown below with code samples.
+
+## Inline clustering
+
+Inline clustering happens synchronously with the regular ingestion writer, 
which means the next round of ingestion
+cannot proceed until the clustering is complete. Inline clustering can be 
setup easily using spark dataframe options.
+See sample below
 
 ```scala
 import org.apache.hudi.QuickstartUtils._
@@ -80,83 +230,20 @@ df.write.format("org.apache.hudi").
         save("dfs://location");
 ```
 
-## Async Clustering - Strategies
-For more advanced usecases, async clustering pipeline can also be setup. See 
an example 
[here](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-SetupforAsyncclusteringJob).
+## Async Clustering
 
-On a high level, clustering creates a plan based on a configurable strategy, 
groups eligible files based on specific
-criteria and then executes the plan. Hudi supports 
[multi-writers](https://hudi.apache.org/docs/concurrency_control#enabling-multi-writing)
 which provides
+Async clustering runs the clustering table service in the background without 
blocking the regular ingestions writers.
+Hudi supports 
[multi-writers](https://hudi.apache.org/docs/concurrency_control#enabling-multi-writing)
 which provides
 snapshot isolation between multiple table services, thus allowing writers to 
continue with ingestion while clustering
 runs in the background.
 
-As mentioned before, clustering plan as well as execution depends on 
configurable strategy. These strategies can be
-broadly classified into three types: clustering plan strategy, execution 
strategy and update strategy.
-
-### Plan Strategy
-
-This strategy comes into play while creating clustering plan. It helps to 
decide what file groups should be clustered.
-Let's look at different plan strategies that are available with Hudi. Note 
that these strategies are easily pluggable
-using this [config](/docs/configurations#hoodieclusteringplanstrategyclass).
-
-1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on
-   the [small file 
limit](/docs/configurations/#hoodieclusteringplanstrategysmallfilelimit)
-   of base files and creates clustering groups upto max file size allowed per 
group. The max size can be specified using
-   this 
[config](/docs/configurations/#hoodieclusteringplanstrategymaxbytespergroup). 
This
-   strategy is useful for stitching together medium-sized files into larger 
ones to reduce lot of files spread across
-   cold partitions.
-2. `SparkRecentDaysClusteringPlanStrategy`: It looks back previous 'N' days 
partitions and creates a plan that will
-   cluster the 'small' file slices within those partitions. This is the 
default strategy. It could be useful when the
-   workload is predictable and data is partitioned by time.
-3. `SparkSelectedPartitionsClusteringPlanStrategy`: In case you want to 
cluster only specific partitions within a range,
-   no matter how old or new are those partitions, then this strategy could be 
useful. To use this strategy, one needs
-   to set below two configs additionally (both begin and end partitions are 
inclusive):
-
-```
-hoodie.clustering.plan.strategy.cluster.begin.partition
-hoodie.clustering.plan.strategy.cluster.end.partition
-```
-
-:::note
-All the strategies are partition-aware and the latter two are still bound by 
the size limits of the first strategy.
-:::
-
-### Execution Strategy
-
-After building the clustering groups in the planning phase, Hudi applies 
execution strategy, for each group, primarily
-based on sort columns and size. The strategy can be specified using this 
[config](/docs/configurations/#hoodieclusteringexecutionstrategyclass).
-
-`SparkSortAndSizeExecutionStrategy` is the default strategy. Users can specify 
the columns to sort the data by, when
-clustering using
-this [config](/docs/configurations/#hoodieclusteringplanstrategysortcolumns). 
Apart from
-that, we can also set [max file 
size](/docs/configurations/#hoodieparquetmaxfilesize)
-for the parquet files produced due to clustering. The strategy uses bulk 
insert to write data into new files, in which
-case, Hudi implicitly uses a partitioner that does sorting based on specified 
columns. In this way, the strategy changes
-the data layout in a way that not only improves query performance but also 
balance rewrite overhead automatically.
-
-Now this strategy can be executed either as a single spark job or multiple 
jobs depending on number of clustering groups
-created in the planning phase. By default, Hudi will submit multiple spark 
jobs and union the results. In case you want
-to force Hudi to use single spark job, set the execution strategy
-class [config](/docs/configurations/#hoodieclusteringexecutionstrategyclass)
-to `SingleSparkJobExecutionStrategy`.
-
-### Update Strategy
-
-Currently, clustering can only be scheduled for tables/partitions not 
receiving any concurrent updates. By default,
-the [config for update 
strategy](/docs/configurations/#hoodieclusteringupdatesstrategy) is
-set to ***SparkRejectUpdateStrategy***. If some file group has updates during 
clustering then it will reject updates and
-throw an exception. However, in some use-cases updates are very sparse and do 
not touch most file groups. The default
-strategy to simply reject updates does not seem fair. In such use-cases, users 
can set the config to ***SparkAllowUpdateStrategy***.
-
-We discussed the critical strategy configurations. All other configurations 
related to clustering are
-listed [here](/docs/configurations/#Clustering-Configs). Out of this list, a 
few
-configurations that will be very useful are:
-
 |  Config key  | Remarks | Default |
 |  -----------  | -------  | ------- |
 | `hoodie.clustering.async.enabled` | Enable running of clustering service, 
asynchronously as writes happen on the table. | False |
 | `hoodie.clustering.async.max.commits` | Control frequency of async 
clustering by specifying after how many commits clustering should be triggered. 
| 4 |
 | `hoodie.clustering.preserve.commit.metadata` | When rewriting data, 
preserves existing _hoodie_commit_time. This means users can run incremental 
queries on clustered data without any side-effects. | False |
 
-## Asynchronous Clustering
+## Setup Asynchronous Clustering
 Users can leverage 
[HoodieClusteringJob](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-SetupforAsyncclusteringJob)
 to setup 2-step asynchronous clustering.
 
diff --git a/website/static/assets/images/clustering_small_files.gif 
b/website/static/assets/images/clustering_small_files.gif
new file mode 100644
index 00000000000..d1ce621e752
Binary files /dev/null and 
b/website/static/assets/images/clustering_small_files.gif differ
diff --git a/website/static/assets/images/clustering_sort.gif 
b/website/static/assets/images/clustering_sort.gif
new file mode 100644
index 00000000000..f7e4ed4fdf5
Binary files /dev/null and b/website/static/assets/images/clustering_sort.gif 
differ

[hudi] branch asf-site updated: [DOCS] Update clustering docs (#7985)

Reply via email to