boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1214672107
The CI failure seems not relate to the PR.
Thanks to @voonhous, he tested 2 cases, cluster individual parquet files of
~500MB up to 10GB groups.
After enable `hoodie.clustering.as.row`, it could give us nearly 30%
performance improvement
### Test 1
| clustering as row enabled |Partition hour| total size | runtime(min) |
| :----:| :----:|:----:|:----:|
|true|dt=2022-07-28/hh=23|2.0T|76|
|false|dt=2022-07-28/hh=00|2.0T|123|
### Test 2
| clustering as row enabled |Partition hour| total size | File Count |
runtime(min) |
| :----:| :----:|:----:|:----:|:----:|
|true|dt=2022-07-28/hh=14|2.5T|7792|92|
|false|dt=2022-07-28/hh=15|2.5T|7771|128|
The spark configure used
```bash
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.rpc.askTimeout=600s' \
--conf
'spark.driver.extraJavaOptions=-Djava.util.concurrent.ForkJoinPool.common.parallelism=250'
\
--conf 'spark.sql.parquet.columnarReaderBatchSize=1024' \
--conf 'spark.yarn.maxAppAttempts=1' \
--num-executors 64 \
--driver-memory 20g \
--driver-cores 1 \
--executor-memory 15g \
--executor-cores 2 \
--class org.apache.hudi.utilities.HoodieClusteringJob \
hudi-utilities-bundle_2.12-0.12.0-SNAPSHOT.jar \
--props
hdfs://test/2022-07-24_clustering/clusteringjob_optimized.properties \
--mode scheduleAndExecute \
--base-path hdfs://test/test/hudi/voon_kafka_test__test_hudi_011_04/ \
--table-name rank_server_log_hudi_test_1h \
--spark-memory 15g \
--parallelism 32
```
clusteringjob.properties
```bash
hoodie.clustering.async.enabled=true
hoodie.clustering.async.max.commits=2
hoodie.clustering.plan.strategy.max.bytes.per.group=10737418240
hoodie.clustering.plan.strategy.target.file.max.bytes=11811160064
hoodie.clustering.plan.strategy.small.file.limit=6442450944
hoodie.clustering.plan.strategy.max.num.groups=10000
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
hoodie.clustering.plan.partition.filter.mode=SELECTED_PARTITIONS
hoodie.clustering.plan.strategy.cluster.begin.partition=dt=2022-07-28/hh=15
hoodie.clustering.plan.strategy.cluster.end.partition=dt=2022-07-28/hh=15
hoodie.clustering.plan.strategy.sort.columns=partition,offset
```
Gentle ping @xiarixiaoyao @XuQianJin-Stars @codope, can you guys help to
review this if you catch time?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]