pravin1406 opened a new issue, #17643:
URL: https://github.com/apache/hudi/issues/17643

   We are consistently hitting executor OOM during upsert on a large Hudi MOR 
table, specifically in the Global Simple Index reconcile path. We would like to 
understand whether this is a known scalability limitation or if there are 
recommended configurations / architectural patterns for this workload.
   
   ⸻
   
   Environment
        •       Hudi version: 0.14.1
        •       Spark version: 3.3.2
        •       Java: 11 (G1GC)
        •       Storage: S3
        •       Source: Kafka (120 partitions)
        •       Table type: MERGE_ON_READ (MOR)
   
   ⸻
   
   Table & Workload Characteristics
        •       Total records already ingested: ~16 billion
        •       Expected daily ingestion: ~750M – 1B records/day (mostly 
upserts)
        •       Primary key (composite):
        •       bill_invoice_row BIGINT
        •       bill_ref_no BIGINT
        •       bill_ref_resets BIGINT
        •       Precombine key: op_ts TIMESTAMP
        •       Partition column: cunumber BIGINT
        •       Cardinality: < 100
        •       Index: GLOBAL_SIMPLE
        •       Record payload: DefaultHoodieRecordPayload
   
   ⸻
   
   Spark Cluster Configuration (tested)
        •       Executors: 40
        •       Executor cores: 6
        •       Executor memory: 140 GB
        •       Executor memory overhead: 60 GB
        •       Driver memory: 40 GB
   
   Key Spark configs:
   
   spark.io.compression.codec=zstd
   spark.rdd.compress=true
   spark.memory.fraction=0.2
   spark.memory.storageFraction=0.25
   spark.network.timeout=240s
   
   
   ⸻
   
   Hudi Write Configuration (relevant)
   
   hoodie.upsert.shuffle.parallelism = 400 – 10000 (tested full range)
   
   hoodie.schema.on.read.enable = true
   hoodie.datasource.write.reconcile.schema = true
   hoodie.datasource.hive_sync.support_timestamp = true
   
   hoodie.compact.inline = true
   hoodie.compact.inline.trigger.strategy = NUM_OR_TIME
   hoodie.compact.inline.max.delta.seconds = 43200
   hoodie.compact.inline.max.delta.commits = 24
   
   hoodie.cleaner.policy = KEEP_LATEST_BY_HOURS
   hoodie.cleaner.hours.retained = 240
   hoodie.cleaner.fileversions.retained = 240
   hoodie.cleaner.commits.retained = 240
   
   hoodie.keep.min.commits = 360
   hoodie.keep.max.commits = 480
   
   hoodie.simple.index.update.partition.path = true
   
   hoodie.parquet.compression.codec = zstd
   hoodie.parquet.max.file.size = 536870912
   hoodie.parquet.small.file.limit = 134217728
   
   Inline compaction and cleaner are enabled.
   
   ⸻
   
   Observed Behavior
        •       Executors consistently OOM during upsert, failing in:
        •       mergeForPartitionUpdatesIfNeeded
        •       tagGlobalLocationBackToRecords
        •       GC logs show:
        •       Healthy G1GC behavior
        •       Rapid growth of live heap
        •       Failure with “To-space exhausted”
        •       OOM occurs even with very large executor heap + overhead.
        •       Changing hoodie.upsert.shuffle.parallelism (low or high) only 
shifts the failure earlier/later.
        •       Stage retries fail immediately due to executor loss.
   
   This strongly suggests algorithmic memory pressure during Global Simple 
Index reconcile, rather than Spark or GC misconfiguration.
   
   ⸻
   
   Questions
        1.      Is GLOBAL_SIMPLE index expected to scale for:
        •       multi-billion total records
        •       ~1B daily upserts
        •       low partition cardinality (<100)
        •       MOR tables on object storage (S3)?
        2.      Are there known limitations or issues with:
        •       Global Simple index reconcile (mergeForPartitionUpdatesIfNeeded)
        •       MOR + high upsert cardinality?
        3.      Are there recommended patterns for this scale, such as:
        •       splitting insert vs update workloads
        •       disabling inline compaction/cleaner during heavy upsert windows
        •       index-type alternatives (if unavoidable)
        •       configs to bound reconcile memory usage?
        4.      Does this behavior improve in newer Hudi versions beyond 0.14.x?
   
   
   
   Summary
   
   We have exhausted:
        •       Spark memory tuning
        •       GC tuning (G1GC)
        •       Shuffle parallelism tuning
        •       Executor sizing
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to