wkhappy1 opened a new issue, #10979:
URL: https://github.com/apache/hudi/issues/10979
we has a table of 149 865 845 rows,with 1650 columns,size in hdfs is 27.1 G.
table query type is cow.
then we use operator code like below to overwrite table
val dataFrame = otherTable
dataFrame.write.mode(SaveMode.Append)
we give spark executor 60g memory,6 executors, each executor has 6 cores
use hudi below config overwrite table
spark.memory.storageFraction=0.6
"hoodie.datasource.write.operation" -> "insert_overwrite_table",
"hoodie.insert.shuffle.parallelism" -> "50",
"hoodie.upsert.shuffle.parallelism" -> "50",
RECORDKEY_FIELD_OPT_KEY -> "id",
PRECOMBINE_FIELD_OPT_KEY -> "ts",
PARTITIONPATH_FIELD_OPT_KEY -> "tenant_id",
PAYLOAD_CLASS_OPT_KEY -> classOf[OverwriteWithLatestAvroPayload].getName
hoodie.parquet.compression.ratio->2.0
hoodie.parquet.max.file.size->41943040
we execute code every 1 hour , But it takes 2 hours for the code to finish
executing
and cache memory useage below
<img width="746" alt="1"
src="https://github.com/apache/hudi/assets/54095696/60ed0497-9345-4467-8bc5-d88ee4d2a424">
and
Getting ExistingFileIds of all partitions
count at HoodieSparkSqlWriter.scala:645 is slow
**Environment Description**
* Hudi version :0.11.1
* Spark version :3.2.2
* Hive version :3.1.3
* Hadoop version :3.3.2
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]