suryaprasanna opened a new issue, #19055:
URL: https://github.com/apache/hudi/issues/19055
### Feature Description
**What the feature achieves:**
Make incremental queries (`hoodie.datasource.query.type=incremental`,
`incremental.format=latest_state`) on Merge-on-Read (MoR) tables return
**complete records** even when the table is written with a **partial-update
payload / custom merger** — i.e. when each update writes only `{recordKey,
preCombineField, changedColumns}` to the delta log rather than the full record
image.
Today, snapshot queries on such tables return fully-merged rows, but
incremental queries can return **partial rows** (only the columns present in
the in-window log records are populated; all other columns come back `null`).
This feature makes the incremental read reconstruct the full record for
affected file groups so downstream/derived datasets receive the complete record
— the same way Copy-on-Write (CoW) already does.
**Why this feature is needed:**
For partial-update workloads, the upstream intentionally emits only the
changed columns per update. On a CoW source this is invisible to consumers,
because every commit rewrites the **full** record into the in-window base file,
so an incremental read returns the complete row. On a MoR source the same
incremental read does not, which makes a CoW → MoR migration
**non-transparent** for any derived dataset that consumes the source
incrementally (e.g. via Hudi Streamer `HoodieIncrSource`) and assumes the full
record arrives from upstream. Downstream merge/upsert logic then overwrites
good columns with `null`.
Root cause (read path): the MoR incremental relation builds the queried file
slices from **in-window files only**, then merges and applies a post-merge
`_hoodie_commit_time` filter:
This "in-window files only" optimization is correct under the implicit
assumption that every log record is a complete record image — true for
`OverwriteWithLatestAvroPayload` /`DefaultHoodieRecordPayload`, but not for
partial-update / custom merge payloads. The feature lifts that assumption for
the affected payload types. (Applies to both
`MergeOnReadIncrementalRelationV1`for table version 6 and
`MergeOnReadIncrementalRelationV2` for table version 8.)
### User Experience
**How users will use this feature:** _Work In Progress_
- Configuration changes needed: _Work In Progress_
- API changes
- No public/user-facing API signature changes. The change is internal to
`MergeOnReadIncrementalRelation.collectFileSplits` (file-slice construction)
plus one new advanced read config. `HoodieIncrSource` / Hudi Streamer require
no code change; the option is passed through as a datasource read option.
- Usage examples: _Work In Progress_
### Hudi RFC Requirements
**RFC PR link:** _Work In Progress_
**Why RFC is/isn't needed:**
- Does this change public interfaces/APIs? **No** (adds one advanced read
config; no API signature changes; behavior can be auto-derived from existing
table merge-mode config).
- Does this change storage format? **No** (read-path only; no changes to
base/log file layout, timeline, or table properties on disk).
- Justification:
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]