rancylive opened a new issue #2887:
URL: https://github.com/apache/hudi/issues/2887


   **Describe the problem you faced**
   
   We are writing hudi files with hive sync enabled using spark hudi library in 
dataproc cluster. The write job is extremely slow. For a dataset of 560 MB, it 
is taking more than an hour to complete the job. Also we observed that the 
shuffle is very high which is more than 42 GB for an input data of 560 MB only. 
Please consider below details and help us with proper guidance on this.
   Cluster Size:
   1 Master Node
   5 Worker Node
   8 Cores
   32 GB Memory
   50 GB Disk Space
   
   Spark Submit Options:
   --driver-memory 2G --executor-memory 2G --executor-cores 2 --num-executors 2 
 --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   1. Create a spark job writting hudi files with below options
   `private static void writeHudiODS(String source, String destination, String 
db, String tableName, SparkSession spark) {
           Dataset<Row> inputDS = spark.read().format("parquet").load(source);
           inputDS.write()
                   .format("org.apache.hudi")
                   .option(HoodieWriteConfig.TABLE_NAME, tableName)
                   .option(DataSourceWriteOptions.OPERATION_OPT_KEY(), 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL())
                   .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY(), 
"COPY_ON_WRITE")
                   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), 
"Tracking_Nbr,Invoice_Id")
                   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), 
"Last_Modified_Date")
                   
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), StringUtils.EMPTY)
                   
.option(HoodieCompactionConfig.CLEANER_COMMITS_RETAINED_PROP, 3)
                   
.option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY(),true)
                   
.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY(), 
NonPartitionedExtractor.class.getName())
                   .option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY(), 
"false")
                   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY(), 
"true")
                   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY(), 
tableName)
                   .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY(), db)
                   .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY(), 
ComplexKeyGenerator.class.getName())
                   .mode(SaveMode.Overwrite)
                   .save(destination);
       }`
   2. Submit it in a cluster of above size
   3. Keep spark submit options as `--driver-memory 2G --executor-memory 2G 
--executor-cores 2 --num-executors 2  --conf 
"spark.serializer=org.apache.spark.serializer.KryoSerializer"`
   4. Observe the time taken by hudi writer
   
   **Expected behavior**
   
   Hudi writer should not take longer time
   
   **Environment Description**
   
   * Hudi version : org.apache.hudi:hudi-spark-bundle_2.11:0.7.0
   
   * Spark version : 2.4.5
   
   * Hive version :  NA
   
   * Hadoop version : Dataproc cluster
   
   * Storage (HDFS/S3/GCS..) : GCS
   
   * Running on Docker? (yes/no) : Dataproc cluster
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   
```[hudi_raw_to_catg_log.txt](https://github.com/apache/hudi/files/6385008/hudi_raw_to_catg_log.txt)```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to