[I] insert_overwrite_table table slow [hudi]

via GitHub Mon, 08 Apr 2024 01:06:09 -0700


wkhappy1 opened a new issue, #10979:
URL: https://github.com/apache/hudi/issues/10979


   we has a table of 149 865 845 rows,with 1650 columns,size in hdfs is 27.1 G.
    table query type is cow. 
   then we use operator code like below to overwrite table
   val dataFrame = otherTable
   dataFrame.write.mode(SaveMode.Append)
   
   we give spark executor 60g memory,6 executors, each executor has 6 cores
   
   use hudi below config overwrite table
   spark.memory.storageFraction=0.6
   "hoodie.datasource.write.operation" -> "insert_overwrite_table",
   "hoodie.insert.shuffle.parallelism" -> "50",
   "hoodie.upsert.shuffle.parallelism" -> "50",
    RECORDKEY_FIELD_OPT_KEY -> "id",
    PRECOMBINE_FIELD_OPT_KEY -> "ts",
    PARTITIONPATH_FIELD_OPT_KEY -> "tenant_id",
    PAYLOAD_CLASS_OPT_KEY -> classOf[OverwriteWithLatestAvroPayload].getName
   hoodie.parquet.compression.ratio->2.0
   hoodie.parquet.max.file.size->41943040
    
   we execute code every 1 hour , But it takes 2 hours for the code to finish 
executing
   
   and cache memory useage below
   <img width="746" alt="1" 
src="https://github.com/apache/hudi/assets/54095696/60ed0497-9345-4467-8bc5-d88ee4d2a424";>
   
   and 
   Getting ExistingFileIds of all partitions
   count at HoodieSparkSqlWriter.scala:645 is slow
   
   
   
   **Environment Description**
   
   * Hudi version :0.11.1
   
   * Spark version :3.2.2
   
   * Hive version :3.1.3
   
   * Hadoop version :3.3.2
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] insert_overwrite_table table slow [hudi]

Reply via email to