Guanpx opened a new issue #5150:
URL: https://github.com/apache/hudi/issues/5150
**Describe the problem you faced**
Flink + hudi cow + BUCKET index + bulk_insert
bucket_bulk_insert **so slow** and generate **too many hdfs small flie**
**To Reproduce**
Steps to reproduce the behavior:
1. use bulk_insert cow table with Flink BUCKET index, that data size about
500w(5GB) (Flink batch mode)
**Expected behavior**
data source from hive, sink to hudi with Flink **and** without too many
small file.
**Environment Description**
* Hadoop version : 1.14.3
* Hudi version : master-0.11.0 (2022-03-28 10:00am, UTC+8)
* Hadoop version : 3.0.0
* Storage (HDFS/S3/GCS..) : HDFS
* Running on Docker? (yes/no) : no
**Additional context**
* hudi config
```
'connector' = 'hudi',
'path' = 'hdfs://nameservice-ha/hudi/dw/rds.db/xxxx',
'hoodie.parquet.compression.codec'= 'snappy',
'index.type' = 'BUCKET',
'table.type' = 'COPY_ON_WRITE',
'write.operation' = 'bulk_insert',
'write.tasks' = '6',
'hoodie.bucket.index.num.buckets' = '6',
'write.sort.memory' = '256',
'hoodie.bucket.index.hash.field' = 'id'
```
* bucket_bulk_insert so slow : abount 4000 records /min
<img width="1515" alt="image"
src="https://user-images.githubusercontent.com/29246713/160319772-8e01087a-98b6-44d8-a0fc-f2aebdd39c49.png">
* too many small hdfs file
<img width="1014" alt="image"
src="https://user-images.githubusercontent.com/29246713/160319884-de51b98f-3099-4c5b-97af-0133426476d2.png">
**Stacktrace**
``` nothing```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]