kepplertreet opened a new issue, #9361: URL: https://github.com/apache/hudi/issues/9361
Hi. I'm using a Spark Structured Streaming Application running on EMR-6.11.0 to Write into a Hudi MOR Table. Hudi Version : 0.13.0 Spark Version : 3.3.2 ``` 'hoodie.table.name': <table_name>, 'hoodie.datasource.write.recordkey.field': <column_name> , 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.SimpleKeyGenerator', 'hoodie.datasource.write.table.type': "MERGE_ON_READ", 'hoodie.datasource.write.partitionpath.field': <year_month>, 'hoodie.datasource.write.table.name': <table_name>, 'hoodie.datasource.write.precombine.field': <commit_time_ms>, "hoodie.table.version": 5, "hoodie.datasource.write.commitmeta.key.prefix": "_", "hoodie.datasource.write.hive_style_partitioning": 'true', "hoodie.datasource.meta.sync.enable": 'false', "hoodie.datasource.hive_sync.enable": 'true', "hoodie.datasource.hive_sync.auto_create_database": 'true', "hoodie.datasource.hive_sync.skip_ro_suffix": 'true', "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.parquet.small.file.limit": 125217728, "hoodie.parquet.max.file.size": 134217728, # Compaction Configs "hoodie.compact.inline" : "false", "hoodie.compact.schedule.inline" : "false", "hoodie.datasource.compaction.async.enable": "true", "hoodie.compact.inline.trigger.strategy": "NUM_COMMITS", "hoodie.compact.inline.max.delta.commits": 3, # --- Cleaner Configs ---- "hoodie.clean.automatic": 'true', "hoodie.clean.async": 'true', "hoodie.cleaner.policy.failed.writes": "LAZY", "hoodie.clean.trigger.strategy" : "NUM_COMMITS", "hoodie.clean.max.commits" : 7, "hoodie.cleaner.commits.retained" : 3, "hoodie.cleaner.fileversions.retained": 1, "hoodie.cleaner.hours.retained": 1, "hoodie.cleaner.policy": "KEEP_LATEST_COMMITS", "hoodie.parquet.compression.codec": "snappy", "hoodie.embed.timeline.server": 'true', "hoodie.embed.timeline.server.async": 'false', "hoodie.write.concurrency.mode": "OPTIMISTIC_CONCURRENCY_CONTROL", "hoodie.write.lock.provider": "org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider", "hoodie.index.type": "BLOOM", "hoodie.datasource.write.streaming.checkpoint.identifier" : <streaming_app_identifier>, # Metadata Configs "hoodie.metadata.enable": 'true', "hoodie.bloom.index.use.metadata": 'true', "hoodie.metadata.index.async": 'false', "hoodie.metadata.clean.async": 'true', "hoodie.metadata.index.bloom.filter.enable": 'true', "hoodie.metadata.index.column.stats.enable" : 'true', "hoodie.metadata.index.bloom.filter.column.list": <record_key_field>, "hoodie.metadata.index.column.stats.column.list" : <record_key_field>, "hoodie.metadata.metrics.enable": 'true', "hoodie.keep.max.commits": 50, "hoodie.archive.async": 'true', "hoodie.archive.merge.enable": 'false', "hoodie.archive.beyond.savepoint": 'true', "hoodie.cleaner.policy": "KEEP_LATEST_BY_HOURS", "hoodie.cleaner.hours.retained": 1 ``` Issues Faced : As the configs show, we have OCC and Metadata enabled for the table. My only concern for now is that I never see log files being written into the main table and hence a compaction is never scheduled nor triggered for the main table i.e all incoming data is written directly into parquet files, whereas the metadata timeline show scheduling and execution of compaction and hence a **commit** is reflected into the timeline. Is this a normal expected behaviour? Is hudi internally calculating the cost of carrying out a trade off between the cost of writing Log Files and then executing a compaction on them v/s directly writing the data to a parquet, and chooses to perform whichever turns out less expensive. Is their some defined threshold for ingress batches crossing which only makes Hudi Write Data into Log Files. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
