[GitHub] [hudi] Guanpx opened a new issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index

GitBox Sun, 27 Mar 2022 20:11:04 -0700


Guanpx opened a new issue #5150:
URL: https://github.com/apache/hudi/issues/5150



   **Describe the problem you faced**
   
   Flink + hudi cow + BUCKET index + bulk_insert
   bucket_bulk_insert  **so slow** and generate **too many hdfs small flie**
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. use bulk_insert cow table with Flink BUCKET index, that data size about 
500w(5GB) (Flink batch mode)
   
   **Expected behavior**
   
   data source from hive, sink to hudi with Flink **and** without too many 
small file.
   
   **Environment Description**
   
   * Hadoop version : 1.14.3
   
   * Hudi version : master-0.11.0 (2022-03-28 10:00am, UTC+8)
   
   * Hadoop version : 3.0.0
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   **Additional context**
   
   * hudi config
   ```
     'connector' = 'hudi',
     'path' = 'hdfs://nameservice-ha/hudi/dw/rds.db/xxxx',
     'hoodie.parquet.compression.codec'= 'snappy',
     'index.type' = 'BUCKET',
     'table.type' = 'COPY_ON_WRITE',
     'write.operation' = 'bulk_insert', 
     'write.tasks' = '6', 
     'hoodie.bucket.index.num.buckets' = '6', 
     'write.sort.memory' = '256', 
     'hoodie.bucket.index.hash.field' = 'id' 
   ```
   
   * bucket_bulk_insert so slow : abount 4000 records /min
   
   <img width="1515" alt="image" 
src="https://user-images.githubusercontent.com/29246713/160319772-8e01087a-98b6-44d8-a0fc-f2aebdd39c49.png";>
   
   * too many small hdfs file 
   
   <img width="1014" alt="image" 
src="https://user-images.githubusercontent.com/29246713/160319884-de51b98f-3099-4c5b-97af-0133426476d2.png";>
   
   **Stacktrace**
   ``` nothing```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] Guanpx opened a new issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index

Reply via email to