rubenssoto opened a new issue #2484:
URL: https://github.com/apache/hudi/issues/2484
Hello,
I want to start using Hudi on my datalake, so I'm running some performance
tests comparing current processing time with and without Hudi. We have a lot of
tables in our datalake so we are processing these tables in groups in the same
spark context with different threads.
I made a test processing all table sources again, with regular parquet it
took 15 minutes, with Hudi bulk insert 29 minutes, Hudi has some operations
that regular parquet doesn't have, for example sorting but the big performance
difference was writing parquet, is there any difference writing parquet with
Hudi and regular parquet? I used gzip codec in both.
In Hudi I configured bulk parallelism to 20 and regular parquet I made a
coalesce 20.
Hudi Version: 0.8.0-SNAPSHOT
Spark Version: 3.0.1
11 Executors with 5 cores each and 35g of memory
spark submit:
`spark-submit --deploy-mode cluster --conf spark.executor.cores=5 --conf
spark.executor.memoryOverhead=3000 --conf spark.yarn.maxAppAttempts=1 --conf
spark.executor.memory=35g --conf
spark.serializer=org.apache.spark.serializer.KryoSerializer --packages
org.apache.spark:spark-avro_2.12:2.4.4 --jars
s3://dl/lib/spark-daria_2.12-0.38.2.jar,s3://dl/lib/hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar
--class TableProcessorWrapper
s3://dl/code/projects/data_projects/batch_processor_engine/batch-processor-engine_2.12-3.0.1_0.5.jar
courier_api_group01
`
```
val hudiOptions = Map[String, String](
"hoodie.table.name" -> tableName,
"hoodie.datasource.write.operation" -> "bulk_insert",
"hoodie.bulkinsert.shuffle.parallelism" -> "20",
"hoodie.parquet.small.file.limit" -> "536870912",
"hoodie.parquet.max.file.size" -> "1073741824",
"hoodie.parquet.block.size" -> "536870912",
"hoodie.copyonwrite.record.size.estimate" -> "1024",
"hoodie.datasource.write.precombine.field" -> deduplicationColumn,
"hoodie.datasource.write.recordkey.field" -> primaryKey.mkString(","),
"hoodie.datasource.write.keygenerator.class" -> (if (primaryKey.size
== 1) {
"org.apache.hudi.keygen.SimpleKeyGenerator"
} else {
"org.apache.hudi.keygen.ComplexKeyGenerator" }),
"hoodie.datasource.write.partitionpath.field" ->
partitionColumn,
"hoodie.datasource.write.hive_style_partitioning" -> "true",
"hoodie.datasource.write.table.name" -> tableName,
"hoodie.datasource.hive_sync.table" -> tableName,
"hoodie.datasource.hive_sync.database" ->
databaseName,
"hoodie.datasource.hive_sync.enable" -> "true",
"hoodie.datasource.hive_sync.partition_fields" ->
partitionColumn,
"hoodie.datasource.hive_sync.partition_extractor_class" ->
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.jdbcurl" ->
"jdbc:hive2://ip-10-0-19-157.us-west-2.compute.internal:10000" )
```
Regular Parquet
<img width="1680" alt="Captura de Tela 2021-01-24 às 12 42 13"
src="https://user-images.githubusercontent.com/36298331/105635487-c40cf200-5e41-11eb-99f8-7dd069b26b4e.png">
Hudi has a Rdd conversion Part
<img width="1680" alt="Captura de Tela 2021-01-24 às 12 45 14"
src="https://user-images.githubusercontent.com/36298331/105635542-1cdc8a80-5e42-11eb-8c2c-e0d394a4f8c5.png">
Hudi Write, took double time
<img width="1680" alt="Captura de Tela 2021-01-24 às 12 46 37"
src="https://user-images.githubusercontent.com/36298331/105635569-4e555600-5e42-11eb-85fe-4b924f61b024.png">
<img width="1680" alt="Captura de Tela 2021-01-24 às 12 47 48"
src="https://user-images.githubusercontent.com/36298331/105635590-67f69d80-5e42-11eb-9a6f-7be470417ab8.png">
It was one real world processing that I tried, but I notice this slow
writing on every processing that I use Hudi.
Is it normal? Is there any way to tunning it? Am i doing something wrong?
Thank you so much!!!!!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]