Leon Lin created HUDI-9313:
------------------------------
Summary: Job "Doing partition and writing data" performance
regression from Hudi 0.9.0 to 0.14.0
Key: HUDI-9313
URL: https://issues.apache.org/jira/browse/HUDI-9313
Project: Apache Hudi
Issue Type: Bug
Reporter: Leon Lin
User is running the same Hudi upsert application and same hudi configurations
on Hudi versions 0.9.0 and 0.14.0 for performance benchmarking.
The results show that there is about a 2x performance regression on the job
{code:java}
Doing partition and writing data{code}
with ~2.5 mins on 0.9.0 and ~5 mins on 0.14.0.
Is this a known issue on the performance regression and what is the cause of
this regression?
Hudi config
{code:java}
upsert_hudi_config = {
"hoodie.table.name": "[table_name]",
"hoodie.database.name": "[database_name]",
"hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.NonpartitionedKeyGenerator",
"hoodie.datasource.write.operation": "upsert",
"hoodie.datasource.write.precombine.field": "[precombine_key]",
"hoodie.datasource.write.recordkey.field": "[record_key]",
"hoodie.datasource.write.table.name": "[table_name]",
"hoodie.index.type": "BLOOM",
"hoodie.metadata.enable": False,
"hoodie.upsert.shuffle.parallelism": 3,
}{code}
Data Characteristics
{code:java}
Table size: ~5GB uncompressed parquet data
Column count: 310 columns
High NULL density:
- Average NULLs per row: 217.74
- Min NULLs per row: 185
- Max NULLs per row: 230{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)