[jira] [Commented] (HUDI-2346) Publish blog on async clustering usage

ASF GitHub Bot (Jira) Thu, 26 Aug 2021 08:31:37 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405306#comment-17405306
 ]


ASF GitHub Bot commented on HUDI-2346:
--------------------------------------

codope commented on a change in pull request #3525:
URL: https://github.com/apache/hudi/pull/3525#discussion_r696744033



##########
File path: website/blog/2021-08-23-async-clustering.md
##########
@@ -0,0 +1,153 @@
+---
+title: "Asynchronous Clustering using Hudi"
+excerpt: "How to setup Hudi for asynchronous clustering"
+author: codope 
+category: blog
+---
+
+In one of the [previous 
blog](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro) posts, we 
introduced a new
+kind of table service called clustering to reorganize data for improved query 
performance without compromising on
+ingestion speed. We learnt how to setup inline clustering. In this post, we 
will discuss what has changed since then and
+see how asynchronous clustering can be setup using HoodieClusteringJob as well 
as DeltaStreamer utility.
+
+## Introduction
+
+On a high level, clustering creates a plan based on a configurable strategy, 
groups eligible files based on specific
+criteria and then executes the plan. Hudi's [MVCC 
model](https://hudi.apache.org/docs/concurrency_control) provides
+snapshot isolation between multiple table services, which allows writers to 
continue with ingestion while clustering
+runs in the background. For a more detailed overview of the clustering 
architecture please check out the previous blog
+post.
+
+## Clustering Strategies
+
+As mentioned before, clustering plan as well as execution depends on 
configurable strategy. These strategies can be
+broadly classified into three types: clustering plan strategy, execution 
strategy and update strategy.
+
+### Plan Strategy
+
+This strategy comes into play while creating clustering plan. It helps to 
decide what file groups should be clustered.
+Let's look at different plan strategies that are available with Hudi. Note 
that these strategies are easily pluggable
+using this 
[config](https://hudi.apache.org/docs/next/configurations#hoodieclusteringplanstrategyclass).
+
+1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on
+   the [small file 
limit](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit)
+   of base files and creates clustering groups upto max file size allowed per 
group. The max size can be specified using

Review comment:
       Good suggestion. Added a couple of lines.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Publish blog on async clustering usage
> --------------------------------------
>
>                 Key: HUDI-2346
>                 URL: https://issues.apache.org/jira/browse/HUDI-2346
>             Project: Apache Hudi
>          Issue Type: Sub-task
>            Reporter: Sagar Sumit
>            Assignee: Sagar Sumit
>            Priority: Major
>              Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2346) Publish blog on async clustering usage

Reply via email to