vinishjail97 opened a new pull request, #18849: URL: https://github.com/apache/hudi/pull/18849
### Change Logs When `hoodie.errortable.source.rdd.persist=true` is set on tables backed by `RECORD_INDEX` (or `PARTITIONED_RECORD_INDEX`), the source DataFrame is cached three times at `MEMORY_AND_DISK_SER`: 1. `Source#persist` — gated by `hoodie.errortable.source.rdd.persist` 2. `SparkMetadataTableRecordIndex#tagLocation` — gated by `hoodie.record.index.use.caching` (default true) 3. `BaseSparkCommitActionExecutor#execute` — unconditional persist of tagged records All three live on the same executor heap and contend with the MDT `HoodieAppendHandle` log-block buffer (`hoodie.logfile.data.block.max.size`, default 256 MB per task). On wide-row tables (log/JSON ingest with large raw blobs in the source schema) this has caused container OOMKills. This PR makes `Source#isAllowSourcePersistRdd` return `false` when the configured index type is `RECORD_INDEX` or `PARTITIONED_RECORD_INDEX`, even when the persist flag is true. The downstream RLI-layer persist already covers the re-read between lookup and tagging, so removing the top-level source persist is safe and drops the widest (and largest) of the three caches. `ProtoKafkaSource#isAllowSourcePersistRdd` now delegates to `super.isAllowSourcePersistRdd()` so the RLI bypass is honored on the proto path too. ### Behavior | `errortable.source.rdd.persist` | Index type | Before | After | |---|---|---|---| | `false` | any | no persist | no persist | | `true` | `BLOOM` / `SIMPLE` / `HBASE` / others | persist | persist (unchanged) | | `true` | `RECORD_INDEX` | persist | **no persist** | | `true` | `PARTITIONED_RECORD_INDEX` | persist | **no persist** | | `true` | unset / blank | persist | persist (unchanged) | ### Impact Error-table writes have to re-read the source from upstream when the gate fires. For wide-row pipelines this is cheap relative to the heap cost of triple-caching; for narrow tables the saving is modest but never negative. ### Risk level low — gated by `hoodie.index.type`, no behavior change for non-RLI tables, default `hoodie.errortable.source.rdd.persist` is false. ### Documentation Update n/a ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly consistent with the [PR title](https://hudi.apache.org/contribute/developer-setup#pull-request-and-issue-management) - [x] Adequate tests were added if applicable - [x] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
