maheshguptags opened a new issue, #13475:
URL: https://github.com/apache/hudi/issues/13475
Hi Team,
I'm encountering small file issues while storing data in S3 using Hudi with
Flink. The dataset comprises approximately 3–4 billion records, yet the table
size has reached around 130 TB.
Config:
```
'table.type' = 'COPY_ON_WRITE',
'hoodie.clean.max.commits'='8',
'hoodie.clean.trigger.strategy'='NUM_COMMITS',
'hoodie.cleaner.commits.retained'='6',
'hoodie.cleaner.parallelism'='100',
'hoodie.clean.automatic' = 'true',
'hoodie.clean.async'='true',
'hoodie.cleaner.policy' = 'KEEP_LATEST_COMMITS',
'hoodie.parquet.small.file.limit'='104857600',
'hoodie.index.type'= 'BUCKET',
'hoodie.index.bucket.engine' = 'SIMPLE',
'hoodie.bucket.index.num.buckets'='16',
'hoodie.bucket.index.hash.field'='x',
'hoodie.archive.automatic'='true',
'hoodie.keep.max.commits'= '45',
'hoodie.keep.min.commits'= '30',
'hoodie.parquet.compression.codec'='snappy',
'write.operation'='upsert',"
'hoodie.write.concurrency.mode'='optimistic_concurrency_control',"
'hoodie.write.lock.provider'='org.apache.hudi.client.transaction.lock.InProcessLockProvider',"
'hoodie.schema.on.read.enable'= 'true'
```
Table structure :
```
example cid/date_partition then it is picking up only first partition value
cl-821/
├── 2025-04-04/ this will have only clustering effect
├── 2025-04-05/
├── 2025-04-06/
├── ...
```
clustering.prop
```
hoodie.clustering.async.enabled=true
hoodie.clustering.async.max.commits=0
hoodie.cleaner.commits.retained=5
hoodie.clustering.plan.strategy.max.num.groups: 50000
hoodie.clustering.plan.strategy.max.bytes.per.group=21474836480
hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824
hoodie.clustering.plan.strategy.small.file.limit=629145600
```
Current Table Configuration:
Hudi Version: 0.15
Flink Version: 18
Table Type: Copy-On-Write (COW)
Write Operation: Upsert --> changed --> Insert
* Storage (HDFS/S3/GCS..) : s3
**Expected behavior**
it is require to do the clustering for all the partition not the first
partition.
current behavior :
it is doing the clustering for first partition of the table.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]