[
https://issues.apache.org/jira/browse/HUDI-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405306#comment-17405306
]
ASF GitHub Bot commented on HUDI-2346:
--------------------------------------
codope commented on a change in pull request #3525:
URL: https://github.com/apache/hudi/pull/3525#discussion_r696744033
##########
File path: website/blog/2021-08-23-async-clustering.md
##########
@@ -0,0 +1,153 @@
+---
+title: "Asynchronous Clustering using Hudi"
+excerpt: "How to setup Hudi for asynchronous clustering"
+author: codope
+category: blog
+---
+
+In one of the [previous
blog](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro) posts, we
introduced a new
+kind of table service called clustering to reorganize data for improved query
performance without compromising on
+ingestion speed. We learnt how to setup inline clustering. In this post, we
will discuss what has changed since then and
+see how asynchronous clustering can be setup using HoodieClusteringJob as well
as DeltaStreamer utility.
+
+## Introduction
+
+On a high level, clustering creates a plan based on a configurable strategy,
groups eligible files based on specific
+criteria and then executes the plan. Hudi's [MVCC
model](https://hudi.apache.org/docs/concurrency_control) provides
+snapshot isolation between multiple table services, which allows writers to
continue with ingestion while clustering
+runs in the background. For a more detailed overview of the clustering
architecture please check out the previous blog
+post.
+
+## Clustering Strategies
+
+As mentioned before, clustering plan as well as execution depends on
configurable strategy. These strategies can be
+broadly classified into three types: clustering plan strategy, execution
strategy and update strategy.
+
+### Plan Strategy
+
+This strategy comes into play while creating clustering plan. It helps to
decide what file groups should be clustered.
+Let's look at different plan strategies that are available with Hudi. Note
that these strategies are easily pluggable
+using this
[config](https://hudi.apache.org/docs/next/configurations#hoodieclusteringplanstrategyclass).
+
+1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on
+ the [small file
limit](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit)
+ of base files and creates clustering groups upto max file size allowed per
group. The max size can be specified using
Review comment:
Good suggestion. Added a couple of lines.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Publish blog on async clustering usage
> --------------------------------------
>
> Key: HUDI-2346
> URL: https://issues.apache.org/jira/browse/HUDI-2346
> Project: Apache Hudi
> Issue Type: Sub-task
> Reporter: Sagar Sumit
> Assignee: Sagar Sumit
> Priority: Major
> Labels: pull-request-available
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)