kepplertreet opened a new issue, #9361:
URL: https://github.com/apache/hudi/issues/9361

   Hi. 
   
   I'm using a Spark Structured Streaming Application running on EMR-6.11.0 to 
Write into a Hudi MOR Table. 
   
   Hudi Version : 0.13.0
   Spark Version : 3.3.2
   ``` 
   'hoodie.table.name': <table_name>,
   'hoodie.datasource.write.recordkey.field': <column_name> ,
   'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.SimpleKeyGenerator',
   'hoodie.datasource.write.table.type': "MERGE_ON_READ",
   'hoodie.datasource.write.partitionpath.field': <year_month>,
   'hoodie.datasource.write.table.name': <table_name>,
   'hoodie.datasource.write.precombine.field': <commit_time_ms>,
   "hoodie.table.version": 5,
   "hoodie.datasource.write.commitmeta.key.prefix": "_",
   "hoodie.datasource.write.hive_style_partitioning": 'true',
   "hoodie.datasource.meta.sync.enable": 'false',
   "hoodie.datasource.hive_sync.enable": 'true',
   "hoodie.datasource.hive_sync.auto_create_database": 'true',
   "hoodie.datasource.hive_sync.skip_ro_suffix": 'true',
   "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
   "hoodie.parquet.small.file.limit": 125217728,
   "hoodie.parquet.max.file.size": 134217728,
   
   # Compaction Configs 
   "hoodie.compact.inline" : "false", 
   "hoodie.compact.schedule.inline" : "false", 
   "hoodie.datasource.compaction.async.enable": "true",
   "hoodie.compact.inline.trigger.strategy": "NUM_COMMITS",
   "hoodie.compact.inline.max.delta.commits": 3,
   
   # --- Cleaner Configs ---- 
   "hoodie.clean.automatic": 'true',
   "hoodie.clean.async": 'true',
   "hoodie.cleaner.policy.failed.writes": "LAZY",
   "hoodie.clean.trigger.strategy" : "NUM_COMMITS", 
   "hoodie.clean.max.commits" : 7, 
   "hoodie.cleaner.commits.retained" : 3, 
   "hoodie.cleaner.fileversions.retained": 1, 
   "hoodie.cleaner.hours.retained": 1, 
   "hoodie.cleaner.policy": "KEEP_LATEST_COMMITS",
   
   "hoodie.parquet.compression.codec": "snappy",
   "hoodie.embed.timeline.server": 'true',
   "hoodie.embed.timeline.server.async": 'false',
   "hoodie.write.concurrency.mode": "OPTIMISTIC_CONCURRENCY_CONTROL",
   "hoodie.write.lock.provider": 
"org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider",
   "hoodie.index.type": "BLOOM",
   "hoodie.datasource.write.streaming.checkpoint.identifier" : 
<streaming_app_identifier>,
   
   # Metadata Configs 
   "hoodie.metadata.enable": 'true',
   "hoodie.bloom.index.use.metadata": 'true',
   "hoodie.metadata.index.async": 'false',
   "hoodie.metadata.clean.async": 'true',
   "hoodie.metadata.index.bloom.filter.enable": 'true',
   "hoodie.metadata.index.column.stats.enable" : 'true', 
   "hoodie.metadata.index.bloom.filter.column.list": <record_key_field>, 
   "hoodie.metadata.index.column.stats.column.list" : <record_key_field>,
   "hoodie.metadata.metrics.enable": 'true', 
   
   "hoodie.keep.max.commits": 50,
   "hoodie.archive.async": 'true',
   "hoodie.archive.merge.enable": 'false',
   "hoodie.archive.beyond.savepoint": 'true',
   "hoodie.cleaner.policy": "KEEP_LATEST_BY_HOURS",
   "hoodie.cleaner.hours.retained": 1
   ```
   Issues Faced : As the configs show,  we have OCC and Metadata enabled for 
the table. 
   My only concern for now is that I never see log files being written into the 
main table and hence a compaction is never scheduled nor triggered for the main 
table i.e all incoming data is written directly into parquet files, whereas the 
metadata timeline show scheduling and execution of compaction and hence a 
**commit** is reflected into the timeline. 
   
   Is this a normal expected behaviour? Is hudi internally calculating the cost 
of carrying out a trade off between the cost of writing Log Files and then 
executing a compaction on them v/s directly writing the data to a parquet, and 
chooses to perform whichever turns out less expensive. Is their some defined 
threshold for ingress batches crossing which only makes Hudi Write Data into 
Log Files.
   
   
   Thanks  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to