pravin1406 opened a new issue, #17643:
URL: https://github.com/apache/hudi/issues/17643
We are consistently hitting executor OOM during upsert on a large Hudi MOR
table, specifically in the Global Simple Index reconcile path. We would like to
understand whether this is a known scalability limitation or if there are
recommended configurations / architectural patterns for this workload.
⸻
Environment
• Hudi version: 0.14.1
• Spark version: 3.3.2
• Java: 11 (G1GC)
• Storage: S3
• Source: Kafka (120 partitions)
• Table type: MERGE_ON_READ (MOR)
⸻
Table & Workload Characteristics
• Total records already ingested: ~16 billion
• Expected daily ingestion: ~750M – 1B records/day (mostly
upserts)
• Primary key (composite):
• bill_invoice_row BIGINT
• bill_ref_no BIGINT
• bill_ref_resets BIGINT
• Precombine key: op_ts TIMESTAMP
• Partition column: cunumber BIGINT
• Cardinality: < 100
• Index: GLOBAL_SIMPLE
• Record payload: DefaultHoodieRecordPayload
⸻
Spark Cluster Configuration (tested)
• Executors: 40
• Executor cores: 6
• Executor memory: 140 GB
• Executor memory overhead: 60 GB
• Driver memory: 40 GB
Key Spark configs:
spark.io.compression.codec=zstd
spark.rdd.compress=true
spark.memory.fraction=0.2
spark.memory.storageFraction=0.25
spark.network.timeout=240s
⸻
Hudi Write Configuration (relevant)
hoodie.upsert.shuffle.parallelism = 400 – 10000 (tested full range)
hoodie.schema.on.read.enable = true
hoodie.datasource.write.reconcile.schema = true
hoodie.datasource.hive_sync.support_timestamp = true
hoodie.compact.inline = true
hoodie.compact.inline.trigger.strategy = NUM_OR_TIME
hoodie.compact.inline.max.delta.seconds = 43200
hoodie.compact.inline.max.delta.commits = 24
hoodie.cleaner.policy = KEEP_LATEST_BY_HOURS
hoodie.cleaner.hours.retained = 240
hoodie.cleaner.fileversions.retained = 240
hoodie.cleaner.commits.retained = 240
hoodie.keep.min.commits = 360
hoodie.keep.max.commits = 480
hoodie.simple.index.update.partition.path = true
hoodie.parquet.compression.codec = zstd
hoodie.parquet.max.file.size = 536870912
hoodie.parquet.small.file.limit = 134217728
Inline compaction and cleaner are enabled.
⸻
Observed Behavior
• Executors consistently OOM during upsert, failing in:
• mergeForPartitionUpdatesIfNeeded
• tagGlobalLocationBackToRecords
• GC logs show:
• Healthy G1GC behavior
• Rapid growth of live heap
• Failure with “To-space exhausted”
• OOM occurs even with very large executor heap + overhead.
• Changing hoodie.upsert.shuffle.parallelism (low or high) only
shifts the failure earlier/later.
• Stage retries fail immediately due to executor loss.
This strongly suggests algorithmic memory pressure during Global Simple
Index reconcile, rather than Spark or GC misconfiguration.
⸻
Questions
1. Is GLOBAL_SIMPLE index expected to scale for:
• multi-billion total records
• ~1B daily upserts
• low partition cardinality (<100)
• MOR tables on object storage (S3)?
2. Are there known limitations or issues with:
• Global Simple index reconcile (mergeForPartitionUpdatesIfNeeded)
• MOR + high upsert cardinality?
3. Are there recommended patterns for this scale, such as:
• splitting insert vs update workloads
• disabling inline compaction/cleaner during heavy upsert windows
• index-type alternatives (if unavoidable)
• configs to bound reconcile memory usage?
4. Does this behavior improve in newer Hudi versions beyond 0.14.x?
Summary
We have exhausted:
• Spark memory tuning
• GC tuning (G1GC)
• Shuffle parallelism tuning
• Executor sizing
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]