voonhous opened a new pull request, #18744:
URL: https://github.com/apache/hudi/pull/18744
### Describe the issue this Pull Request addresses
Closes: #18742
### Summary and Changelog
Flip the default of `hoodie.read.blob.inline.mode` from `CONTENT` to
`DESCRIPTOR` so plain reads on Lance tables stop materializing inline blob
payloads on every row. `CONTENT` remains available as an opt-in.
Changes:
- `HoodieReaderConfig`: default flipped to `DESCRIPTOR`; doc updated.
- `BatchedBlobReader`: INLINE branch now falls back to the synthesized
reference when `inline_data` is null, so `read_blob()` returns bytes under both
modes. Previously it short-circuited on `storage_type=INLINE` and read the null
field.
- `TestLanceDataSource`:
- `testBlobInlineRoundTrip` now opts into `CONTENT` explicitly (it exists
to validate `CONTENT` semantics).
- New `testBlobInlineCompactionRoundTrip` verifies INLINE bytes survive
MOR compaction under the new default, asserted via the realistic user paths
(plain read returns descriptor shape; `read_blob()` returns bytes).
- `rfc/rfc-100/rfc-100.md`: replaced the ASCII visual with two mermaid
diagrams covering row shape per (storage type × query × mode × file format) and
`read_blob()` byte-resolution hop counts.
The existing compaction-side CONTENT pin in `HoodieSparkLanceReader` is
already in place and unchanged.
### Impact
User-facing behavior change for Lance INLINE blobs: `SELECT *` now returns
`data=NULL` plus a populated descriptor by default; callers materializing bytes
should use `read_blob()` (or explicitly set
`hoodie.read.blob.inline.mode=CONTENT`). Parquet is unaffected (the Parquet
reader does not honor `BLOB_INLINE_READ_MODE` today).
Performance: plain reads skip blob decoding; `read_blob()` on INLINE rows
now does one extra raw `pread` (2 hops total) instead of getting bytes off the
row.
### Risk Level
Medium. The default flip changes observable read behavior for Lance INLINE
blob columns. Mitigations:
- The `BatchedBlobReader` fallback keeps `read_blob()` working in both
modes, so the canonical bytes API is unaffected by the flip.
- Compaction-side reader is hard-pinned to `CONTENT` and the new compaction
test exercises the round-trip end-to-end.
- Users on the prior default can restore it via
`hoodie.read.blob.inline.mode=CONTENT`.
### Documentation Update
- `HoodieReaderConfig.BLOB_INLINE_READ_MODE` description updated to reflect
the new default.
- RFC-100 visual section refreshed with mermaid diagrams covering the new
default.
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Enough context is provided in the sections above
- [ ] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]