dineshbganesan opened a new issue, #9024: URL: https://github.com/apache/hudi/issues/9024
### **Problem Description** We have a table partitioned by a date field. We're using inline clustering to resize the smaller files. Based on our configuration, the clustering kicks off every 2 commits and resizes the files. Only few partitions are picked up for clustering but we were expecting all the partitions would be picked up and eligible files(small files <384 MB) would be resized. The table has data for daily partitions starting from 2017 to 2023 which are eligible for clustering but only a few dates are picked up. Attached the recent replacecommit. Appreciate any help/inputs. ### **Hudi Configuration** 'hoodie.table.type': 'COPY_ON_WRITE', 'hoodie.table.name': 'd1_msrmt', 'hoodie.datasource.write.operation': 'UPSERT', 'hoodie.datasource.hive_sync.database': 'testdb', 'hoodie.datasource.hive_sync.table': 'd1_msrmt', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.support_timestamp': 'true', 'hoodie.datasource.hive_sync.mode': 'hms', 'hoodie.datasource.hive_sync.use_jdbc': 'false', 'hoodie.datasource.write.precombine.field': 'update_ts_dms', 'hoodie.datasource.write.recordkey.field': 'measr_comp_id,msrmt_dttm', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', 'hoodie.datasource.meta_sync.condition.sync': 'true', 'hoodie.datasource.write.partitionpath.field': 'msrmt_dt', 'hoodie.datasource.hive_sync.partition_fields': 'msrmt_dt', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.HiveStylePartitionValueExtractor', 'hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled': 'true', 'hoodie.index.type': 'BLOOM', 'hoodie.metadata.enable': 'true', 'hoodie.schema.on.read.enable': 'true', 'hoodie.parquet.writelegacyformat.enabled': 'true', 'hoodie.parquet.small.file.limit': 0, 'hoodie.clustering.plan.partition.filter.mode': 'NONE', 'hoodie.clustering.inline': 'true', 'hoodie.clustering.inline.max.commits': 2, 'hoodie.clustering.plan.strategy.target.file.max.bytes': 536870912, --512 MB 'hoodie.clustering.plan.strategy.small.file.limit': 402653184, --384 MB 'hoodie.clustering.plan.strategy.sort.columns': 'id', 'hoodie.clustering.plan.strategy.max.bytes.per.group': 2147483648, --2 GB 'hoodie.clustering.plan.strategy.max.num.groups': 30, 'hoodie.upsert.shuffle.parallelism': 200, 'hoodie.combine.before.insert': 'true', 'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS', 'hoodie.cleaner.commits.retained': 10 ### **Environment Description** * Platform: AWS Glue v4.0 * Hudi version: 0.12.1 * Spark version: 3.3 * Storage (HDFS/S3/GCS..): S3 [20230618145329612.replacecommit.txt](https://github.com/apache/hudi/files/11804939/20230618145329612.replacecommit.txt) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
