voonhous opened a new pull request, #18744:
URL: https://github.com/apache/hudi/pull/18744

   ### Describe the issue this Pull Request addresses
   
   Closes: #18742
   
   ### Summary and Changelog
   
   Flip the default of `hoodie.read.blob.inline.mode` from `CONTENT` to 
`DESCRIPTOR` so plain reads on Lance tables stop materializing inline blob 
payloads on every row. `CONTENT` remains available as an opt-in.
   
   Changes:
   - `HoodieReaderConfig`: default flipped to `DESCRIPTOR`; doc updated.
   - `BatchedBlobReader`: INLINE branch now falls back to the synthesized 
reference when `inline_data` is null, so `read_blob()` returns bytes under both 
modes. Previously it short-circuited on `storage_type=INLINE` and read the null 
field.
   - `TestLanceDataSource`:
     - `testBlobInlineRoundTrip` now opts into `CONTENT` explicitly (it exists 
to validate `CONTENT` semantics).
     - New `testBlobInlineCompactionRoundTrip` verifies INLINE bytes survive 
MOR compaction under the new default, asserted via the realistic user paths 
(plain read returns descriptor shape; `read_blob()` returns bytes).
   - `rfc/rfc-100/rfc-100.md`: replaced the ASCII visual with two mermaid 
diagrams covering row shape per (storage type × query × mode × file format) and 
`read_blob()` byte-resolution hop counts.
   
   The existing compaction-side CONTENT pin in `HoodieSparkLanceReader` is 
already in place and unchanged.
   
   ### Impact
   
   User-facing behavior change for Lance INLINE blobs: `SELECT *` now returns 
`data=NULL` plus a populated descriptor by default; callers materializing bytes 
should use `read_blob()` (or explicitly set 
`hoodie.read.blob.inline.mode=CONTENT`). Parquet is unaffected (the Parquet 
reader does not honor `BLOB_INLINE_READ_MODE` today).
   
   Performance: plain reads skip blob decoding; `read_blob()` on INLINE rows 
now does one extra raw `pread` (2 hops total) instead of getting bytes off the 
row.
   
   ### Risk Level
   
   Medium. The default flip changes observable read behavior for Lance INLINE 
blob columns. Mitigations:
   - The `BatchedBlobReader` fallback keeps `read_blob()` working in both 
modes, so the canonical bytes API is unaffected by the flip.
   - Compaction-side reader is hard-pinned to `CONTENT` and the new compaction 
test exercises the round-trip end-to-end.
   - Users on the prior default can restore it via 
`hoodie.read.blob.inline.mode=CONTENT`.
   
   ### Documentation Update
   
   - `HoodieReaderConfig.BLOB_INLINE_READ_MODE` description updated to reflect 
the new default.
   - RFC-100 visual section refreshed with mermaid diagrams covering the new 
default.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to