vinishjail97 opened a new pull request, #18849:
URL: https://github.com/apache/hudi/pull/18849

   ### Change Logs
   
   When `hoodie.errortable.source.rdd.persist=true` is set on tables backed by 
`RECORD_INDEX` (or `PARTITIONED_RECORD_INDEX`), the source DataFrame is cached 
three times at `MEMORY_AND_DISK_SER`:
   
   1. `Source#persist` — gated by `hoodie.errortable.source.rdd.persist`
   2. `SparkMetadataTableRecordIndex#tagLocation` — gated by 
`hoodie.record.index.use.caching` (default true)
   3. `BaseSparkCommitActionExecutor#execute` — unconditional persist of tagged 
records
   
   All three live on the same executor heap and contend with the MDT 
`HoodieAppendHandle` log-block buffer (`hoodie.logfile.data.block.max.size`, 
default 256 MB per task). On wide-row tables (log/JSON ingest with large raw 
blobs in the source schema) this has caused container OOMKills.
   
   This PR makes `Source#isAllowSourcePersistRdd` return `false` when the 
configured index type is `RECORD_INDEX` or `PARTITIONED_RECORD_INDEX`, even 
when the persist flag is true. The downstream RLI-layer persist already covers 
the re-read between lookup and tagging, so removing the top-level source 
persist is safe and drops the widest (and largest) of the three caches.
   
   `ProtoKafkaSource#isAllowSourcePersistRdd` now delegates to 
`super.isAllowSourcePersistRdd()` so the RLI bypass is honored on the proto 
path too.
   
   ### Behavior
   
   | `errortable.source.rdd.persist` | Index type | Before | After |
   |---|---|---|---|
   | `false` | any | no persist | no persist |
   | `true` | `BLOOM` / `SIMPLE` / `HBASE` / others | persist | persist 
(unchanged) |
   | `true` | `RECORD_INDEX` | persist | **no persist** |
   | `true` | `PARTITIONED_RECORD_INDEX` | persist | **no persist** |
   | `true` | unset / blank | persist | persist (unchanged) |
   
   ### Impact
   
   Error-table writes have to re-read the source from upstream when the gate 
fires. For wide-row pipelines this is cheap relative to the heap cost of 
triple-caching; for narrow tables the saving is modest but never negative.
   
   ### Risk level
   
   low — gated by `hoodie.index.type`, no behavior change for non-RLI tables, 
default `hoodie.errortable.source.rdd.persist` is false.
   
   ### Documentation Update
   
   n/a
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly consistent with the [PR 
title](https://hudi.apache.org/contribute/developer-setup#pull-request-and-issue-management)
   - [x] Adequate tests were added if applicable
   - [x] CI passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to