Vsevolod3 commented on issue #8071:
URL: https://github.com/apache/hudi/issues/8071#issuecomment-1582970788

   We're having a similar issue with write performance. The Hudi stream_write 
task takes between 8 and 10 minutes for a MoR table and between 9 and 11 
minutes for a CoW table to write 600K records. This performance is slow even 
for when the destination is completely empty and the corresponding Glue Catalog 
table doesn't exist. I'm including the details of our testing below:
   
   ## Test parameters:
   - 600,000 records
   - About 128MB of input data
   - UPSERT mode (both CoW and MoR)
   - Flink 1.15.2
   - Hudi 0.13.0
   - Partitioning: this is done by extracting the year and month element of the 
record created_dttm and creating columns de_year and de_month for partitioning; 
this results in hive-style partitions in S3, like de_year=X/de_month=Y.
   
   ## CoW Testing
   
   ### CoW results (sample):
   - For operators:
     - map tasks (4 parallel slots): 174ms
     - bucket assigner (4 parallel slots): 1s
     - stream_write (6 parallel slots): 9m 49s
   - File sizes: ~450KB
   
   ### CoW options being set on HoodiePipeline.Builder:
   ```
   'connector' = 'hudi',
     'index.type' = 'BLOOM',
     'compaction.schedule.enabled' = 'true',
     'clustering.plan.strategy.sort.columns' = 'acct_id',
     'compaction.delta_seconds' = '720',
     'clustering.delta_commits' = '4',
     'clustering.plan.strategy.small.file.limit' = '600',
     'compaction.async.enabled' = 'true',
     'compaction.max_memory' = '100',
     'hoodie.parquet.max.file.size' = '125829120',
     'read.streaming.enabled' = 'false',
     'path' = 's3://*****/*/account/',
     'hoodie.logfile.max.size' = '1073741824',
     'hoodie.datasource.write.hive_style_partitioning' = 'true',
     'hoodie.parquet.compression.ratio' = '0.1',
     'hoodie.parquet.small.file.limit' = '104857600',
     'compaction.tasks' = '4',
     'precombine.field' = 'update_ts',
     'write.task.max.size' = '1024.0',
     'hoodie.parquet.compression.codec' = 'snappy',
     'compaction.delta_commits' = '3',
     'clustering.tasks' = '4',
     'compaction.trigger.strategy' = 'num_or_time',
     'read.tasks' = '4',
     'compaction.timeout.seconds' = '1200',
     'clustering.async.enabled' = 'false',
     'table.type' = 'COPY_ON_WRITE',
     'metadata.compaction.delta_commits' = '10',
     'clustering.plan.strategy.max.num.groups' = '30',
     'write.tasks' = '6',
     'clustering.schedule.enabled' = 'false',
     'hoodie.logfile.data.block.format' = 'avro',
     'write.batch.size' = '256.0',
     'write.sort.memory' = '128'
   ```
   
   ### CoW DAG
   Attached:
   
![CoW_DAG](https://github.com/apache/hudi/assets/40812010/e1d49b08-1bb1-47f8-b530-3d7610f9705f)
    
   ## MoR Testing
   
   ### MoR results (sample):
   - For operators:
     - map tasks (4 parallel slots): 120ms
     - bucket assigner (4 parallel slots): 1s
     - stream_write (6 parallel slots): 9m 29s
     - compact_plan_generate (1 parallel slot): 78ms
     - compact_task (4 parallel slot): 129ms
     - compact_commit (6 parallel slots): 119ms
   - File sizes: ~450KB
   
   
   ### MoR options being set on HoodiePipeline.Builder:
   ```
   'connector' = 'hudi',
     'index.type' = 'BLOOM',
     'compaction.schedule.enabled' = 'true',
     'clustering.plan.strategy.sort.columns' = 'acct_id',
     'compaction.delta_seconds' = '720',
     'clustering.delta_commits' = '4',
     'clustering.plan.strategy.small.file.limit' = '600',
     'compaction.async.enabled' = 'true',
     'compaction.max_memory' = '100',
     'hoodie.parquet.max.file.size' = '125829120',
     'read.streaming.enabled' = 'false',
     'path' = 's3://*****/*/account/',
     'hoodie.logfile.max.size' = '1073741824',
     'hoodie.datasource.write.hive_style_partitioning' = 'true',
     'hoodie.parquet.compression.ratio' = '0.1',
     'hoodie.parquet.small.file.limit' = '104857600',
     'compaction.tasks' = '4',
     'precombine.field' = 'update_ts',
     'write.task.max.size' = '1024.0',
     'hoodie.parquet.compression.codec' = 'snappy',
     'compaction.delta_commits' = '3',
     'clustering.tasks' = '4',
     'compaction.trigger.strategy' = 'num_or_time',
     'read.tasks' = '4',
     'compaction.timeout.seconds' = '1200',
     'clustering.async.enabled' = 'false',
     'table.type' = 'MERGE_ON_READ',
     'metadata.compaction.delta_commits' = '10',
     'clustering.plan.strategy.max.num.groups' = '30',
     'write.tasks' = '6',
     'clustering.schedule.enabled' = 'false',
     'hoodie.logfile.data.block.format' = 'avro',
     'write.batch.size' = '256.0',
     'write.sort.memory' = '128'
   ```
   
   ### MoR DAG
   Attached:
   
![MoR_DAG](https://github.com/apache/hudi/assets/40812010/f693c336-eee9-4c94-912f-813bfb85e6b2)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to