JnaneshwarikTR opened a new issue, #7910:
URL: https://github.com/apache/hudi/issues/7910

   Hi,
   
   * Hudi version :0.11.1
   
   * Spark version :3.2.1
   
   * Hive version : NA
   
   * Hadoop version : NA
   
   * Storage (HDFS/S3/GCS..) :S3
   
   * Running on Docker? (yes/no) : no
   
   We have spark streaming application running with batch interval of 5 min. We 
added below configs to avoid small file creation.
   
    HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT.key() -> 
String.valueOf(104857600)
   HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key() -> String.valueOf(125829120)
   
   However when i run my application i see my parquet file are created with 
lesser than the mentioned small file limit. 
   
   here is the complete hudi config we are using  in application.
   
   HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT.key() -> 
String.valueOf(104857600), 
        HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key() -> 
String.valueOf(125829120),
        HoodieCompactionConfig.INLINE_COMPACT_TRIGGER_STRATEGY.key() -> 
CompactionTriggerStrategy.TIME_ELAPSED.name,
       HoodieCompactionConfig.INLINE_COMPACT_TIME_DELTA_SECONDS.key() -> 
String.valueOf(60 * 60),
       HoodieCompactionConfig.CLEANER_POLICY.key() -> 
HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name(),
       HoodieCompactionConfig.CLEANER_COMMITS_RETAINED.key() -> "936", 
       HoodieCompactionConfig.MIN_COMMITS_TO_KEEP.key() -> "937", 
       HoodieCompactionConfig.MAX_COMMITS_TO_KEEP.key() -> "960", 
       HoodieCompactionConfig.ASYNC_CLEAN.key() -> "false", 
       HoodieCompactionConfig.INLINE_COMPACT.key() -> "true",
       HoodieMetricsConfig.TURN_METRICS_ON.key() -> "true",
       HoodieMetricsConfig.METRICS_REPORTER_TYPE_VALUE.key() -> 
MetricsReporterType.DATADOG.name(),
       HoodieMetricsDatadogConfig.API_SITE_VALUE.key() -> "US",
       HoodieMetricsDatadogConfig.METRIC_PREFIX_VALUE.key() -> 
"tacticalnovusingest.hudi",
       HoodieMetricsDatadogConfig.API_KEY_SUPPLIER.key() -> 
"com.tr.indigo.tacticalnovusingest.utils.DatadogKeySupplier",
       HoodieMetadataConfig.ENABLE.key() -> "false",
       HoodieWriteConfig.ROLLBACK_USING_MARKERS_ENABLE.key() -> "false",
   
   
   Parquet files which created are as below.
   
   
![image](https://user-images.githubusercontent.com/112955571/217815538-3ba9b42d-4ea6-40b2-b571-32eb5146fe26.png)
   
   how can we avoid small file creations? 
   
   @koochiswathiTR  my teammate in case need more info.
   
   Appreciate all the help you guys do.
   
   Thanks,JK
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to