Leon Lin created HUDI-9313:
------------------------------

             Summary: Job "Doing partition and writing data" performance 
regression from Hudi 0.9.0 to 0.14.0
                 Key: HUDI-9313
                 URL: https://issues.apache.org/jira/browse/HUDI-9313
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Leon Lin


User is running the same Hudi upsert application and same hudi configurations 
on Hudi versions 0.9.0 and 0.14.0 for performance benchmarking.
The results show that there is about a 2x performance regression on the job
{code:java}
Doing partition and writing data{code}
with ~2.5 mins on 0.9.0 and ~5 mins on 0.14.0. 
Is this a known issue on the performance regression and what is the cause of 
this regression?

Hudi config

{code:java}
upsert_hudi_config = {
"hoodie.table.name": "[table_name]",
"hoodie.database.name": "[database_name]",
"hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator",
"hoodie.datasource.write.operation": "upsert", 
"hoodie.datasource.write.precombine.field": "[precombine_key]",
"hoodie.datasource.write.recordkey.field": "[record_key]", 
"hoodie.datasource.write.table.name": "[table_name]", 
"hoodie.index.type": "BLOOM",
"hoodie.metadata.enable": False, 
"hoodie.upsert.shuffle.parallelism": 3,
}{code}
Data Characteristics

{code:java}
Table size: ~5GB uncompressed parquet data
Column count: 310 columns
High NULL density:
  - Average NULLs per row: 217.74
  - Min NULLs per row: 185
  - Max NULLs per row: 230{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to