[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

GitBox Sun, 14 Aug 2022 23:41:00 -0700


boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1214672107


   The CI failure seems not relate to the PR.
   
   Thanks to @voonhous, he tested 2 cases, cluster individual parquet files of  
~500MB up to 10GB groups.
   
   After enable `hoodie.clustering.as.row`, it could give us nearly 30% 
performance improvement
   
   ### Test 1
   | clustering as row enabled |Partition hour| total size | runtime(min) |
   | :----:| :----:|:----:|:----:|
   |true|dt=2022-07-28/hh=23|2.0T|76|
   |false|dt=2022-07-28/hh=00|2.0T|123|
   
   ### Test 2
   | clustering as row enabled |Partition hour| total size | File Count | 
runtime(min) |
   | :----:| :----:|:----:|:----:|:----:|
   |true|dt=2022-07-28/hh=14|2.5T|7792|92|
   |false|dt=2022-07-28/hh=15|2.5T|7771|128|
   
   The spark configure used
   
   ```bash
       --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
       --conf 'spark.rpc.askTimeout=600s' \
       --conf 
'spark.driver.extraJavaOptions=-Djava.util.concurrent.ForkJoinPool.common.parallelism=250'
 \
       --conf 'spark.sql.parquet.columnarReaderBatchSize=1024' \
       --conf 'spark.yarn.maxAppAttempts=1' \
       --num-executors 64 \
       --driver-memory 20g \
       --driver-cores 1 \
       --executor-memory 15g \
       --executor-cores 2 \
       --class org.apache.hudi.utilities.HoodieClusteringJob \
       hudi-utilities-bundle_2.12-0.12.0-SNAPSHOT.jar \
       --props 
hdfs://test/2022-07-24_clustering/clusteringjob_optimized.properties \
       --mode scheduleAndExecute \
       --base-path hdfs://test/test/hudi/voon_kafka_test__test_hudi_011_04/ \
       --table-name rank_server_log_hudi_test_1h \
       --spark-memory 15g \
       --parallelism 32
   ```
   
   clusteringjob.properties
   
   ```bash
   hoodie.clustering.async.enabled=true
   hoodie.clustering.async.max.commits=2
   hoodie.clustering.plan.strategy.max.bytes.per.group=10737418240
   hoodie.clustering.plan.strategy.target.file.max.bytes=11811160064
   hoodie.clustering.plan.strategy.small.file.limit=6442450944
   hoodie.clustering.plan.strategy.max.num.groups=10000
   
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
   
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
   hoodie.clustering.plan.partition.filter.mode=SELECTED_PARTITIONS
   hoodie.clustering.plan.strategy.cluster.begin.partition=dt=2022-07-28/hh=15
   hoodie.clustering.plan.strategy.cluster.end.partition=dt=2022-07-28/hh=15
   hoodie.clustering.plan.strategy.sort.columns=partition,offset
   ```
   
   Gentle ping @xiarixiaoyao @XuQianJin-Stars @codope, can you guys help to 
review this if you catch time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

Reply via email to