Cpandey43 opened a new issue, #10183:
URL: https://github.com/apache/hudi/issues/10183
**Describe the problem you faced**
**Issue:1**
I configured the application with async compaction, async clustering, and
async cleaning in the job but all are not working as per the configured
settings.
async compaction - Not working at all
async clustering - Not working at all
async cleaning - cleaning was executed after every commit
**Issue:2**
Configured table _type is MOR with operation upsert but in partitions, Hudi
is not creating .log files and also generating 18MB parquet files. Application
created two partitions and both have maximum 18MB files. Attaching the content
of both partitions below FYR.
**Environment Description**
* Hudi version : 0.13.1
* Spark version : 3.2.4
* Storage (HDFS/S3/GCS..) : Minio as a S3 extension
* Running on k8s : Using spark operator
**Hudi conf used in spark application**
```
df.write
.format("org.apache.hudi")
// Write Config
.option("hoodie.datasource.write.table.type", "MERGE_ON_READ")
.option("hoodie.datasource.write.precombine.field", "ts")
.option("hoodie.datasource.write.recordkey.field", "recordkey")
.option("hoodie.datasource.write.partitionpath.field", "date")
.option("hoodie.datasource.write.table.name", "spark_streaming")
.option("hoodie.table.name", "spark_streaming")
.option("hoodie.datasource.write.operation", "upsert")
.option("hoodie.merge.small.file.group.candidates.limit", "1")
// Hive Sync
.option("hoodie.datasource.hive_sync.mode", "hms")
.option("hoodie.datasource.hive_sync.metastore.uris",
"thrift://hive-metastore:9090")
.option("hoodie.datasource.hive_sync.database", "hudi_test")
.option("hoodie.datasource.hive_sync.table", "test_table")
.option("hoodie.datasource.hive_sync.partition_fields", "date")
.option("hoodie.datasource.hive_sync.enable", "true")
// Compaction
.option("hoodie.compact.inline", "false")
.option("hoodie.compact.inline.max.delta.commits", "6")
.option("hoodie.datasource.compaction.async.enable", "true")
.option("hoodie.parquet.small.file.limit", "104857600")
// Clustering
.option("hoodie.clustering.async.enabled", "true")
.option("hoodie.clustering.async.max.commits", "1")
.option("hoodie.clustering.execution.strategy.class",
"org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy")
// Cleaning
.option("hoodie.clean.async", "true")
.option("hoodie.cleaner.commits.retained", "3")
// Archive
.option("hoodie.archive.async", "true")
// Index
.option("hoodie.index.type", "BLOOM")
// Payload
.option("hoodie.payload.event.time.field", "ts")
.option("hoodie.payload.ordering.field", "ts")
// KeyGenerator
.option("hoodie.datasource.write.hive_style_partitioning", "true")
.option("hoodie.datasource.write.keygenerator.class",
"org.apache.hudi.keygen.SimpleKeyGenerator")
// Marker
.option("hoodie.rollback.using.markers", "true")
// Multi-writes
.option("hoodie.write.concurrency.mode",
"optimistic_concurrency_control")
.option("hoodie.write.lock.provider",
"org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider")
.option("hoodie.write.lock.zookeeper.url", "zookeeper")
.option("hoodie.write.lock.zookeeper.port", "2181")
.option("hoodie.write.lock.zookeeper.base_path", "/hoodie")
.option("hoodie.cleaner.policy.failed.writes", "LAZY")
.option("hoodie.write.lock.zookeeper.lock_key", "lock_v1")
.option("hoodie.write.lock.zookeeper.connection_timeout_ms", "30000")
.option("hoodie.write.lock.wait_time_ms", "600000")
// Multi-Modal Index
.option("hoodie.metadata.index.bloom.filter.enable", "true")
.option("hoodie.metadata.index.column.stats.enable", "true")
// MetaData
.option("hoodie.metadata.enable", "true")
// Data Skipping
.option("hoodie.enable.data.skipping", "true")
// Storage
.option("hoodie.parquet.max.file.size", "125829120")
.option("hoodie.logfile.max.size", "1073741824")
.option("hoodie.logfile.to.parquet.compression.ratio", "0.35"))
```
**hoodie.properties file content**
```
#Updated at 2023-11-21T20:26:57.745347Z
#Tue Nov 21 20:26:57 UTC 2023
hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
hoodie.table.type=MERGE_ON_READ
hoodie.table.metadata.partitions=bloom_filters,column_stats,files
hoodie.table.precombine.field=ts
hoodie.table.partition.fields=date
hoodie.archivelog.folder=archived
hoodie.table.cdc.enabled=false
hoodie.timeline.layout.version=1
hoodie.table.checksum=3323849328
hoodie.datasource.write.drop.partition.columns=false
hoodie.table.timeline.timezone=LOCAL
hoodie.table.name=spark_streaming
hoodie.table.recordkey.fields=recordkey
hoodie.compaction.record.merger.strategy=eeb8d96f-b1e4-49fd-bbf8-28ac514178e5
hoodie.datasource.write.hive_style_partitioning=true
hoodie.partition.metafile.use.base.format=false
hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
hoodie.populate.meta.fields=true
hoodie.table.base.file.format=PARQUET
hoodie.database.name=
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.table.version=5
```
**.hoodie folder content captured in below attached file name hoodie.txt**
[hoodie.txt](https://github.com/apache/hudi/files/13479589/hoodie.txt)
**.hoodie folder recursive all folders content captured in below-attached
file name hoodie-recursive.txt**
[hoodie-recursive.txt](https://github.com/apache/hudi/files/13479587/hoodie-recursive.txt)
**Application processed two days data 1st & 2nd November. Attaching the
content of both partitions below:**
[date=2023-11-01.txt](https://github.com/apache/hudi/files/13479585/date.2023-11-01.txt)
[date=2023-11-02.txt](https://github.com/apache/hudi/files/13479586/date.2023-11-02.txt)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]