[I] [SUPPORT] Compaction & Clustering are not working [hudi]

via GitHub Mon, 27 Nov 2023 12:31:54 -0800


Cpandey43 opened a new issue, #10183:
URL: https://github.com/apache/hudi/issues/10183


   **Describe the problem you faced**
   
   **Issue:1**
   I configured the application with async compaction, async clustering, and 
async cleaning in the job but all are not working as per the configured 
settings.
   
   async compaction - Not working at all
   async clustering - Not working at all
   async cleaning - cleaning was executed after every commit
   
   **Issue:2**
   Configured table _type is MOR with operation upsert but in partitions, Hudi 
is not creating .log files and also generating 18MB parquet files. Application 
created two partitions and both have maximum 18MB files. Attaching the content 
of both partitions below FYR.
   
   **Environment Description**
   
   * Hudi version : 0.13.1
   
   * Spark version : 3.2.4
   
   * Storage (HDFS/S3/GCS..) : Minio as a S3 extension
   
   * Running on k8s : Using spark operator
   
   
   **Hudi conf used in spark application**
   
   ```
       df.write
       .format("org.apache.hudi")
       // Write Config
       .option("hoodie.datasource.write.table.type", "MERGE_ON_READ")
       .option("hoodie.datasource.write.precombine.field", "ts")
       .option("hoodie.datasource.write.recordkey.field", "recordkey")
       .option("hoodie.datasource.write.partitionpath.field", "date")
       .option("hoodie.datasource.write.table.name", "spark_streaming")
       .option("hoodie.table.name", "spark_streaming")
       .option("hoodie.datasource.write.operation", "upsert")
       .option("hoodie.merge.small.file.group.candidates.limit", "1")
       // Hive Sync
       .option("hoodie.datasource.hive_sync.mode", "hms")
       .option("hoodie.datasource.hive_sync.metastore.uris", 
"thrift://hive-metastore:9090")
       .option("hoodie.datasource.hive_sync.database", "hudi_test")
       .option("hoodie.datasource.hive_sync.table", "test_table")
       .option("hoodie.datasource.hive_sync.partition_fields", "date")
       .option("hoodie.datasource.hive_sync.enable", "true")
       // Compaction
       .option("hoodie.compact.inline", "false")
       .option("hoodie.compact.inline.max.delta.commits", "6")
       .option("hoodie.datasource.compaction.async.enable", "true")
       .option("hoodie.parquet.small.file.limit", "104857600")
       // Clustering
       .option("hoodie.clustering.async.enabled", "true")
       .option("hoodie.clustering.async.max.commits", "1")
       .option("hoodie.clustering.execution.strategy.class", 
"org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy")
       // Cleaning
       .option("hoodie.clean.async", "true")
       .option("hoodie.cleaner.commits.retained", "3")
       // Archive
       .option("hoodie.archive.async", "true")
       // Index
       .option("hoodie.index.type", "BLOOM")
       // Payload
       .option("hoodie.payload.event.time.field", "ts")
       .option("hoodie.payload.ordering.field", "ts")
       // KeyGenerator
       .option("hoodie.datasource.write.hive_style_partitioning", "true")
       .option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.SimpleKeyGenerator")
       // Marker
       .option("hoodie.rollback.using.markers", "true")
       // Multi-writes
       .option("hoodie.write.concurrency.mode", 
"optimistic_concurrency_control")
       .option("hoodie.write.lock.provider", 
"org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider")
       .option("hoodie.write.lock.zookeeper.url", "zookeeper")
       .option("hoodie.write.lock.zookeeper.port", "2181")
       .option("hoodie.write.lock.zookeeper.base_path", "/hoodie")
       .option("hoodie.cleaner.policy.failed.writes", "LAZY")
       .option("hoodie.write.lock.zookeeper.lock_key", "lock_v1")
       .option("hoodie.write.lock.zookeeper.connection_timeout_ms", "30000")
       .option("hoodie.write.lock.wait_time_ms", "600000")
       // Multi-Modal Index 
       .option("hoodie.metadata.index.bloom.filter.enable", "true")
       .option("hoodie.metadata.index.column.stats.enable", "true")
       // MetaData 
       .option("hoodie.metadata.enable", "true")
       // Data Skipping
       .option("hoodie.enable.data.skipping", "true")
       // Storage
       .option("hoodie.parquet.max.file.size", "125829120")
       .option("hoodie.logfile.max.size", "1073741824")
       .option("hoodie.logfile.to.parquet.compression.ratio", "0.35"))
   ```
   
   **hoodie.properties file content**
   ```
   #Updated at 2023-11-21T20:26:57.745347Z
   #Tue Nov 21 20:26:57 UTC 2023
   
hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
   hoodie.table.type=MERGE_ON_READ
   hoodie.table.metadata.partitions=bloom_filters,column_stats,files
   hoodie.table.precombine.field=ts
   hoodie.table.partition.fields=date
   hoodie.archivelog.folder=archived
   hoodie.table.cdc.enabled=false
   hoodie.timeline.layout.version=1
   hoodie.table.checksum=3323849328
   hoodie.datasource.write.drop.partition.columns=false
   hoodie.table.timeline.timezone=LOCAL
   hoodie.table.name=spark_streaming
   hoodie.table.recordkey.fields=recordkey
   hoodie.compaction.record.merger.strategy=eeb8d96f-b1e4-49fd-bbf8-28ac514178e5
   hoodie.datasource.write.hive_style_partitioning=true
   hoodie.partition.metafile.use.base.format=false
   hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
   hoodie.populate.meta.fields=true
   hoodie.table.base.file.format=PARQUET
   hoodie.database.name=
   hoodie.datasource.write.partitionpath.urlencode=false
   hoodie.table.version=5
   ```
   
   **.hoodie folder content captured in below attached file name hoodie.txt**
   [hoodie.txt](https://github.com/apache/hudi/files/13479589/hoodie.txt) 
   
   **.hoodie folder recursive all folders content captured in below-attached 
file name hoodie-recursive.txt**
   
[hoodie-recursive.txt](https://github.com/apache/hudi/files/13479587/hoodie-recursive.txt)
   
   **Application processed two days data 1st & 2nd November. Attaching the 
content of both partitions below:**
   
[date=2023-11-01.txt](https://github.com/apache/hudi/files/13479585/date.2023-11-01.txt)
   
[date=2023-11-02.txt](https://github.com/apache/hudi/files/13479586/date.2023-11-02.txt)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Compaction & Clustering are not working [hudi]

Reply via email to