[GitHub] [hudi] somebol opened a new issue #1757: Slow Bulk Insert Performance [SUPPORT]

GitBox Mon, 22 Jun 2020 20:09:03 -0700


somebol opened a new issue #1757:
URL: https://github.com/apache/hudi/issues/1757



   Hi Team,
   
   We are trying to load a very large dataset into hudi. The bulk insert job 
took ~16.5 hours to complete. The job was run with vanilla settings without any 
optimisations.
   How can we tune the job to make it run faster?
   
   **Dataset**
   
   Data stored in HDFS / parquet
   size: 5.5 TB
   number of files: 27000
   number of records: ~300 billion
   
   **hudi options**
   .option(RECORDKEY_FIELD_OPT_KEY(), "id")
   .option(PARTITIONPATH_FIELD_OPT_KEY(), "<partition>")
   .option(PRECOMBINE_FIELD_OPT_KEY(), "<ts>")
   .option(HIVE_STYLE_PARTITIONING_OPT_KEY(), "true")
   .option(TABLE_TYPE_OPT_KEY(), MOR_TABLE_TYPE_OPT_VAL())
   .option(OPERATION_OPT_KEY(), BULK_INSERT_OPERATION_OPT_VAL())
   .option(TABLE_NAME, "<name>")
   
   **spark conf**
   conf.set("spark.debug.maxToStringFields", "100");
   conf.set("spark.sql.shuffle.partitions", "2001");
   conf.set("spark.sql.warehouse.dir", "/user/hive/warehouse");
   conf.set("spark.sql.autoBroadcastJoinThreshold", "31457280");
   conf.set("spark.sql.hive.filesourcePartitionFileCacheSize", "2000000000");
   conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic");
   conf.set("mapreduce.input.fileinputformat.input.dir.recursive", "true");
   conf.set("spark.storage.replication.proactive", "true");
   
   **spark submit**
   SPARK_CMD="spark2-submit  \
   --files log4j.properties
   --conf "spark.driver.extraJavaOptions=${log4j_setting}" \
   --conf "spark.executor.extraJavaOptions=${log4j_setting}" \
   --conf spark.kryoserializer.buffer.max=2040M \
   --num-executors 25 \
   --executor-cores 5 \
   --driver-memory 8G \
   --executor-memory 21G \
   --master yarn \
   --deploy-mode client
   
   **Environment Description**
   
   * Hudi version : 0.5.3
   
   * Spark version : 2.40
   
   * Cloudera version : 6.33
   
   * Hadoop version : 3.0.0
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : No
   
   **screenshot of spark stages**
   
![image](https://user-images.githubusercontent.com/29965228/85355750-aa749680-b550-11ea-995b-429bca2c283f.png)
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] somebol opened a new issue #1757: Slow Bulk Insert Performance [SUPPORT]

Reply via email to