[GitHub] [hudi] rubenssoto opened a new issue #2484: [SUPPORT] Hudi Write Performance

GitBox Sun, 24 Jan 2021 07:50:29 -0800


rubenssoto opened a new issue #2484:
URL: https://github.com/apache/hudi/issues/2484



   Hello,
   
   I want to start using Hudi on my datalake, so I'm running some performance 
tests comparing current processing time with and without Hudi. We have a lot of 
tables in our datalake so we are processing these tables in groups in the same 
spark context with different threads.
   I made a test processing all table sources again, with regular parquet it 
took 15 minutes, with Hudi bulk insert 29 minutes, Hudi has some operations 
that regular parquet doesn't have, for example sorting but the big performance 
difference was writing parquet, is there any difference writing parquet with 
Hudi and regular parquet? I used gzip codec in both. 
   
   In Hudi I configured bulk parallelism to 20 and regular parquet I made a 
coalesce 20.
   
   Hudi Version: 0.8.0-SNAPSHOT
   Spark Version: 3.0.1
   11 Executors with 5 cores each and 35g of memory
   
   spark submit:
   `spark-submit --deploy-mode cluster --conf spark.executor.cores=5 --conf 
spark.executor.memoryOverhead=3000 --conf spark.yarn.maxAppAttempts=1 --conf 
spark.executor.memory=35g --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer --packages 
org.apache.spark:spark-avro_2.12:2.4.4 --jars 
s3://dl/lib/spark-daria_2.12-0.38.2.jar,s3://dl/lib/hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar
 --class TableProcessorWrapper 
s3://dl/code/projects/data_projects/batch_processor_engine/batch-processor-engine_2.12-3.0.1_0.5.jar
 courier_api_group01
   `
   
   ```
   val hudiOptions = Map[String, String](
         "hoodie.table.name"                        -> tableName,
         "hoodie.datasource.write.operation"        -> "bulk_insert",
         "hoodie.bulkinsert.shuffle.parallelism"    -> "20",
         "hoodie.parquet.small.file.limit"          -> "536870912",
         "hoodie.parquet.max.file.size"             -> "1073741824",
         "hoodie.parquet.block.size"                -> "536870912",
         "hoodie.copyonwrite.record.size.estimate"  -> "1024",
         "hoodie.datasource.write.precombine.field" -> deduplicationColumn,
         "hoodie.datasource.write.recordkey.field"  -> primaryKey.mkString(","),
         "hoodie.datasource.write.keygenerator.class" -> (if (primaryKey.size 
== 1) {
                                                            
"org.apache.hudi.keygen.SimpleKeyGenerator"
                                                          } else { 
"org.apache.hudi.keygen.ComplexKeyGenerator" }),
         "hoodie.datasource.write.partitionpath.field"           -> 
partitionColumn,
         "hoodie.datasource.write.hive_style_partitioning"       -> "true",
         "hoodie.datasource.write.table.name"                    -> tableName,
         "hoodie.datasource.hive_sync.table"                     -> tableName,
         "hoodie.datasource.hive_sync.database"                  -> 
databaseName,
         "hoodie.datasource.hive_sync.enable"                    -> "true",
         "hoodie.datasource.hive_sync.partition_fields"          -> 
partitionColumn,
         "hoodie.datasource.hive_sync.partition_extractor_class" -> 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
         "hoodie.datasource.hive_sync.jdbcurl"                   -> 
"jdbc:hive2://ip-10-0-19-157.us-west-2.compute.internal:10000"    )
   ```
   
   Regular Parquet
   <img width="1680" alt="Captura de Tela 2021-01-24 às 12 42 13" 
src="https://user-images.githubusercontent.com/36298331/105635487-c40cf200-5e41-11eb-99f8-7dd069b26b4e.png";>
   
   Hudi has a Rdd conversion Part
   <img width="1680" alt="Captura de Tela 2021-01-24 às 12 45 14" 
src="https://user-images.githubusercontent.com/36298331/105635542-1cdc8a80-5e42-11eb-8c2c-e0d394a4f8c5.png";>
   
   
   Hudi Write, took double time
   <img width="1680" alt="Captura de Tela 2021-01-24 às 12 46 37" 
src="https://user-images.githubusercontent.com/36298331/105635569-4e555600-5e42-11eb-85fe-4b924f61b024.png";>
   <img width="1680" alt="Captura de Tela 2021-01-24 às 12 47 48" 
src="https://user-images.githubusercontent.com/36298331/105635590-67f69d80-5e42-11eb-9a6f-7be470417ab8.png";>
   
   
   
   It was one real world processing that I tried, but I notice this slow 
writing on every processing that I use Hudi.
   
   Is it normal? Is there any way to tunning it? Am i doing something wrong?
   
   Thank you so much!!!!!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] rubenssoto opened a new issue #2484: [SUPPORT] Hudi Write Performance

Reply via email to