[GitHub] [hudi] dineshbganesan opened a new issue, #9024: Clustering is not picking all partitions

via GitHub Tue, 20 Jun 2023 11:22:49 -0700


dineshbganesan opened a new issue, #9024:
URL: https://github.com/apache/hudi/issues/9024


   ### **Problem Description**
   
   We have a table partitioned by a date field. We're using inline clustering 
to resize the smaller files. Based on our configuration, the clustering kicks 
off every 2 commits and resizes the files. Only few partitions are picked up 
for clustering but we were expecting all the partitions would be picked up and 
eligible files(small files <384 MB) would be resized. The table has data for 
daily partitions starting from 2017 to 2023 which are eligible for clustering 
but only a few dates are picked up. Attached the recent replacecommit. 
Appreciate any help/inputs.
   
   ### **Hudi Configuration**
   
   'hoodie.table.type': 'COPY_ON_WRITE',
   'hoodie.table.name': 'd1_msrmt',
   'hoodie.datasource.write.operation': 'UPSERT',
   'hoodie.datasource.hive_sync.database': 'testdb',
   'hoodie.datasource.hive_sync.table': 'd1_msrmt',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.support_timestamp': 'true',
   'hoodie.datasource.hive_sync.mode': 'hms',
   'hoodie.datasource.hive_sync.use_jdbc': 'false',
   'hoodie.datasource.write.precombine.field': 'update_ts_dms',
   'hoodie.datasource.write.recordkey.field': 'measr_comp_id,msrmt_dttm',
   'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.ComplexKeyGenerator',
   'hoodie.datasource.meta_sync.condition.sync': 'true',
   'hoodie.datasource.write.partitionpath.field': 'msrmt_dt',
   'hoodie.datasource.hive_sync.partition_fields': 'msrmt_dt',
   'hoodie.datasource.write.hive_style_partitioning': 'true',
   'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.HiveStylePartitionValueExtractor',
   'hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled': 
'true',
   'hoodie.index.type': 'BLOOM',
   'hoodie.metadata.enable': 'true',
   'hoodie.schema.on.read.enable': 'true',
   'hoodie.parquet.writelegacyformat.enabled': 'true',
   'hoodie.parquet.small.file.limit': 0,
   'hoodie.clustering.plan.partition.filter.mode': 'NONE',
   'hoodie.clustering.inline': 'true',
   'hoodie.clustering.inline.max.commits': 2,
   'hoodie.clustering.plan.strategy.target.file.max.bytes': 536870912, --512 MB
   'hoodie.clustering.plan.strategy.small.file.limit': 402653184, --384 MB
   'hoodie.clustering.plan.strategy.sort.columns': 'id',
   'hoodie.clustering.plan.strategy.max.bytes.per.group': 2147483648, --2 GB
   'hoodie.clustering.plan.strategy.max.num.groups': 30,
   'hoodie.upsert.shuffle.parallelism': 200,
   'hoodie.combine.before.insert': 'true',
   'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS',
   'hoodie.cleaner.commits.retained': 10
   
   ### **Environment Description**
   
   * Platform: AWS Glue v4.0
   
   * Hudi version: 0.12.1
   
   * Spark version: 3.3
   
   * Storage (HDFS/S3/GCS..): S3
   
   
[20230618145329612.replacecommit.txt](https://github.com/apache/hudi/files/11804939/20230618145329612.replacecommit.txt)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] dineshbganesan opened a new issue, #9024: Clustering is not picking all partitions

Reply via email to