[GitHub] [hudi] codope commented on a diff in pull request #7985: [DOCS] Update clustering docs

via GitHub Fri, 17 Feb 2023 05:42:14 -0800


codope commented on code in PR #7985:
URL: https://github.com/apache/hudi/pull/7985#discussion_r1109827000



##########
website/docs/clustering.md:
##########
@@ -51,45 +62,32 @@ NOTE: Clustering can only be scheduled for tables / 
partitions not receiving any
 ![Clustering 
example](/assets/images/blog/clustering/example_perf_improvement.png)
 _Figure: Illustrating query performance improvements by clustering_
 
-### Setting up clustering
-Inline clustering can be setup easily using spark dataframe options. See 
sample below
+## Clustering Usecases
 
-```scala
-import org.apache.hudi.QuickstartUtils._
-import scala.collection.JavaConversions._
-import org.apache.spark.sql.SaveMode._
-import org.apache.hudi.DataSourceReadOptions._
-import org.apache.hudi.DataSourceWriteOptions._
-import org.apache.hudi.config.HoodieWriteConfig._
+### Batching small files
 
+As mentioned in the intro, streaming ingestion generally results in smaller 
files in your data lake. But having lot of
+such small files could bring down your query latency. From our experience 
supporting community users, there are quite a
+few users who are using Hudi just for small file handling capabilities. So, 
you could employ clustering to batch a lot
+of such small files into larger ones.
 
-val df =  //generate data frame
-df.write.format("org.apache.hudi").
-        options(getQuickstartWriteConfigs).
-        option(PRECOMBINE_FIELD_OPT_KEY, "ts").
-        option(RECORDKEY_FIELD_OPT_KEY, "uuid").
-        option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
-        option(TABLE_NAME, "tableName").
-        option("hoodie.parquet.small.file.limit", "0").
-        option("hoodie.clustering.inline", "true").
-        option("hoodie.clustering.inline.max.commits", "4").
-        option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
"1073741824").
-        option("hoodie.clustering.plan.strategy.small.file.limit", 
"629145600").
-        option("hoodie.clustering.plan.strategy.sort.columns", 
"column1,column2"). //optional, if sorting is needed as part of rewriting data
-        mode(Append).
-        save("dfs://location");
-```
+![Batching small files](/assets/images/clustering_small_files.gif)
 
-## Async Clustering - Strategies
-For more advanced usecases, async clustering pipeline can also be setup. See 
an example 
[here](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-SetupforAsyncclusteringJob).
+### Cluster by sort key
 
-On a high level, clustering creates a plan based on a configurable strategy, 
groups eligible files based on specific
-criteria and then executes the plan. Hudi supports 
[multi-writers](https://hudi.apache.org/docs/concurrency_control#enabling-multi-writing)
 which provides
-snapshot isolation between multiple table services, thus allowing writers to 
continue with ingestion while clustering
-runs in the background.
+Another classic problem in data lake is the arrival time vs event time 
problem. Generally you write data based on
+arrival time, while query predicates do not sit well with it. With clustering, 
you can re-write your data by sorting
+based on query predicates and so, your data skipping will be very efficient 
and your query can ignore scanning lot of
+unnecessary data.
 
-As mentioned before, clustering plan as well as execution depends on 
configurable strategy. These strategies can be
-broadly classified into three types: clustering plan strategy, execution 
strategy and update strategy.
+![Batching small files](/assets/images/clustering_sort.gif)
+
+## Clustering Strategies
+
+On a high level, clustering creates a plan based on a configurable strategy, 
groups eligible files based on specific
+criteria and then executes the plan. As mentioned before, clustering plan as 
well as execution depends on configurable
+strategy. These strategies can be broadly classified into three types: 
clustering plan strategy, execution strategy and

Review Comment:
   Great point! I totally missed it. I have added now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] codope commented on a diff in pull request #7985: [DOCS] Update clustering docs

Reply via email to