[I] Clustering not working as expected(Addressing Small File Issues in Hudi with Flink) [hudi]

via GitHub Sun, 22 Jun 2025 23:52:13 -0700


maheshguptags opened a new issue, #13475:
URL: https://github.com/apache/hudi/issues/13475


   Hi Team,
   I'm encountering small file issues while storing data in S3 using Hudi with 
Flink. The dataset comprises approximately 3–4 billion records, yet the table 
size has reached around 130 TB.
   
   Config:
   
   ```
   'table.type' = 'COPY_ON_WRITE',
   'hoodie.clean.max.commits'='8',
   'hoodie.clean.trigger.strategy'='NUM_COMMITS',
   'hoodie.cleaner.commits.retained'='6',
   'hoodie.cleaner.parallelism'='100',
   'hoodie.clean.automatic' = 'true',
   'hoodie.clean.async'='true',
   'hoodie.cleaner.policy' = 'KEEP_LATEST_COMMITS',
   'hoodie.parquet.small.file.limit'='104857600',
   'hoodie.index.type'= 'BUCKET',
   'hoodie.index.bucket.engine' = 'SIMPLE',
   'hoodie.bucket.index.num.buckets'='16',
   'hoodie.bucket.index.hash.field'='x',
   'hoodie.archive.automatic'='true',
   'hoodie.keep.max.commits'= '45',
   'hoodie.keep.min.commits'= '30',
   'hoodie.parquet.compression.codec'='snappy',
   'write.operation'='upsert',"
   'hoodie.write.concurrency.mode'='optimistic_concurrency_control',"
   
'hoodie.write.lock.provider'='org.apache.hudi.client.transaction.lock.InProcessLockProvider',"
   'hoodie.schema.on.read.enable'= 'true'
   ```
   
   Table structure :
   ```
   example cid/date_partition then it is picking up only first partition value 
cl-821/
     ├── 2025-04-04/ this will have only clustering effect
     ├── 2025-04-05/
     ├── 2025-04-06/
     ├── ...
   
   ```
    clustering.prop
   
   ```
   hoodie.clustering.async.enabled=true
   hoodie.clustering.async.max.commits=0
   hoodie.cleaner.commits.retained=5
   hoodie.clustering.plan.strategy.max.num.groups: 50000
   hoodie.clustering.plan.strategy.max.bytes.per.group=21474836480
   hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824
   hoodie.clustering.plan.strategy.small.file.limit=629145600
   
   ```
   
   Current Table Configuration:
   Hudi Version: 0.15
   Flink Version: 18
   Table Type: Copy-On-Write (COW)
   Write Operation: Upsert --> changed --> Insert 
   
   * Storage (HDFS/S3/GCS..) : s3
   
   
   **Expected behavior**
   
   it is require to do the clustering for all the partition not the first 
partition.
   
   current behavior :
   it is doing the clustering for first partition of the table. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Clustering not working as expected(Addressing Small File Issues in Hudi with Flink) [hudi]

Reply via email to