Vsevolod3 commented on issue #8071:
URL: https://github.com/apache/hudi/issues/8071#issuecomment-1587601662
@danny0405 : to update, I did try the bucket index, and the performance is
still in the 9-10 minute range. Here are the timings for the tasks:
- For operators:
- map tasks (4 parallel slots): 51ms
- bucket assigner (4 parallel slots): 1s
- stream_write (6 parallel slots): 9m 36s
Here are the Hudi properties submitted to the sink builder:
```
'connector' = 'hudi',
'compaction.schedule.enabled' = 'true',
'hoodie.index.bucket.engine' = 'SIMPLE',
'hoodie.index.type' = 'BUCKET',
'clustering.plan.strategy.sort.columns' = 'acct_id',
'write.bucket_assign.tasks' = '6',
'compaction.delta_seconds' = '720',
'clustering.delta_commits' = '4',
'clustering.plan.strategy.small.file.limit' = '600',
'compaction.async.enabled' = 'true',
'compaction.max_memory' = '100',
'hoodie.parquet.max.file.size' = '125829120',
'read.streaming.enabled' = 'false',
'path' = 's3://*****/*/account/',
'hoodie.logfile.max.size' = '1073741824',
'hoodie.datasource.write.hive_style_partitioning' = 'true',
'hoodie.parquet.compression.ratio' = '0.1',
'hoodie.parquet.small.file.limit' = '104857600',
'hoodie.bucket.index.hash.field' = 'acct_id',
'compaction.tasks' = '4',
'precombine.field' = 'update_ts',
'write.task.max.size' = '1024.0',
'hoodie.parquet.compression.codec' = 'snappy',
'compaction.delta_commits' = '3',
'clustering.tasks' = '4',
'compaction.trigger.strategy' = 'num_or_time',
'hoodie.bucket.index.num.buckets' = '256',
'read.tasks' = '4',
'compaction.timeout.seconds' = '1200',
'clustering.async.enabled' = 'false',
'table.type' = 'COPY_ON_WRITE',
'metadata.compaction.delta_commits' = '10',
'clustering.plan.strategy.max.num.groups' = '30',
'write.tasks' = '6',
'clustering.schedule.enabled' = 'false',
'hoodie.logfile.data.block.format' = 'avro',
'write.batch.size' = '256.0',
'write.sort.memory' = '128'
```
Also, the metadata on the parquet files written by Hudi still has
hoodie_bloom_filter_type_code=DYNAMIC_V0 and org.apache.hudi.bloomfilter in it,
like it's still using BLOOM index. Is this expected?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]