ksrihari93 opened a new issue, #5822:
URL: https://github.com/apache/hudi/issues/5822
**Describe the problem you faced**
Hudi Clustering not working.
A clear and concise description of the problem.
I'm using Hudi Delta streamer in continuous mode with Kafka source.
we have 120 partitions in the Kafka topic and the ingestion rate is (200k)
RPM
we are using the BULK INSERT mode to ingest data into target location .
But we could see that lot of small files were being generated. In Order to
overcome this small file problem we are using the Hudi Clustering ,still we
could see files were not being merged.
Configuration for the JOB is
#base properties
hoodie.insert.shuffle.parallelism=50
hoodie.bulkinsert.shuffle.parallelism=200
hoodie.embed.timeline.server=true
hoodie.filesystem.view.type=EMBEDDED_KV_STORE
hoodie.compact.inline=false
hoodie.bulkinsert.sort.mode=none
#cleaner properties
hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
hoodie.cleaner.fileversions.retained=60
hoodie.clean.async=true
#archival
hoodie.keep.min.commits=12
hoodie.keep.max.commits=15
#datasource properties
hoodie.deltastreamer.schemaprovider.registry.url=
hoodie.datasource.write.recordkey.field=
hoodie.deltastreamer.source.kafka.topic=
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
hoodie.datasource.write.partitionpath.field=timestamp:TIMESTAMP
hoodie.deltastreamer.kafka.source.maxEvents=600000000
hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS
hoodie.deltastreamer.keygen.timebased.input.timezone=UTC
hoodie.deltastreamer.keygen.timebased.output.timezone=UTC
hoodie.deltastreamer.keygen.timebased.output.dateformat='dt='yyyy-MM-dd
hoodie.clustering.async.enabled=true
hoodie.clustering.plan.strategy.target.file.max.bytes=3000000000
hoodie.clustering.plan.strategy.small.file.limit=200000001
hoodie.clustering.async.max.commits=1
hoodie.clustering.plan.strategy.max.num.groups=10
#kafka props
bootstrap.servers=
schema.registry.url=
Deltastreamer Class Arguments:
- "--table-type"
- "COPY_ON_WRITE"
- "--props"
- "/opt/spark/hudi/config/source.properties"
- "--schemaprovider-class"
- "org.apache.hudi.utilities.schema.SchemaRegistryProvider"
- "--source-class"
- "org.apache.hudi.utilities.sources.JsonKafkaSource"
- "--target-base-path"
- ""
- "--target-table"
- ""
- "--op"
- "BULK_INSERT"
- "--source-ordering-field"
- "timestamp"
- "--continuous"
- "--min-sync-interval-seconds"
- "60"
* Hudi version :0.9
* Spark version :2.4.4
* Storage (HDFS/S3/GCS..) :BLOB
* Running on Docker? (yes/no) :Kubernetes
**Stacktrace**
```22/06/09 22:01:36 INFO ClusteringUtils: Found 0 files in pending
clustering operations
22/06/09 22:11:07 INFO ClusteringUtils: Found 0 files in pending clustering
operations
22/06/09 22:11:07 INFO RocksDbBasedFileSystemView: Resetting file groups in
pending clustering to ROCKSDB based file-system view at
/tmp/hoodie_timeline_rocksdb, Total file-groups=0```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]