codejoyan opened a new issue #2620:
URL: https://github.com/apache/hudi/issues/2620
Hi,
I am seeing some performance issues while upserting data especially in the
below 2 jobs:
15 (SparkUpsertCommitActionExecutor)
17 (UpsertPartitioner)
Attached are some of the stats regarding the slow jobs/stages.
**Configurations used:**
--driver-memory 5G --executor-memory 10G --executor-cores 5 --num-executors
10
Upsert config parameters:
option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY,
"org.apache.hudi.keygen.ComplexKeyGenerator").
option("hoodie.upsert.shuffle.parallelism","2").
option("hoodie.insert.shuffle.parallelism","2").
option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 128 * 1024 *
1024).
option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 128 * 1024 * 1024).
option("hoodie.copyonwrite.record.size.estimate", "40")
Can you please guide how to approach tuning this performance problem? Let me
know if you need any further details.
Below are some of the stats:
<img width="724" alt="Screenshot 2021-03-03 at 1 52 04 AM"
src="https://user-images.githubusercontent.com/48707638/109709933-1240ae00-7bc3-11eb-98ff-2fbdc3c4dc67.png">
**Environment Description**
* Hudi version : 0.7.0
* Storage : GCS
* Running on Docker? (yes/no) : No
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]