[I] Default hoodie.read.blob.inline.mode to DESCRIPTOR for Lance (compaction pinned to CONTENT) [hudi]

via GitHub Thu, 14 May 2026 21:42:28 -0700


rahil-c opened a new issue, #18742:
URL: https://github.com/apache/hudi/issues/18742


   ### Summary
   
   Change the default of `hoodie.read.blob.inline.mode` from `CONTENT` to 
`DESCRIPTOR` so plain column reads (e.g. `SELECT *`) over Lance tables no 
longer pay the I/O cost of materializing large inline blob payloads. 
`read_blob()` remains the canonical bytes-materializing API and always returns 
full bytes regardless of this setting.
   
   The compaction path must continue to read in `CONTENT` mode, since 
compaction reads the base file in order to rewrite it and would otherwise lose 
blob bytes on rewrite.
   
   Parquet is out of scope: the Parquet reader path does not currently 
implement DESCRIPTOR semantics, so this change has no observable effect on 
Parquet tables. Parquet behavior remains as-is.
   
   ### Motivation
   
   `BLOB_INLINE_READ_MODE` was introduced in 1.2.0 (see 
[`HoodieReaderConfig.java`](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java))
 with valid values `CONTENT` (current default) and `DESCRIPTOR`.
   
   For Lance:
   - `CONTENT` returns the raw inline bytes in the data field on every read — 
wasteful when users don't reference the blob column.
   - `DESCRIPTOR` suppresses the inline bytes and populates the reference 
struct with blob stream coordinates, letting `read_blob()` be the explicit, 
opt-in path for materializing bytes.
   
   For typical query workloads on Lance, DESCRIPTOR is the right default. 
CONTENT only makes sense when blob bytes need to round-trip through a rewrite 
(compaction).
   
   ### Proposed Changes
   
   1. **Flip the default** of `HoodieReaderConfig.BLOB_INLINE_READ_MODE` from 
`BLOB_INLINE_READ_MODE_CONTENT` to `BLOB_INLINE_READ_MODE_DESCRIPTOR`.
   2. **Pin compaction to CONTENT.** In the Lance compaction read path, 
explicitly override the blob read mode to `CONTENT` so compaction reliably 
round-trips inline blob bytes. The override should be applied at the 
reader-context layer rather than via the user-facing config, so users cannot 
accidentally disable it.
   3. **Update the config Javadoc** to reflect the new default and call out the 
compaction exception.
   4. **Note in release notes** that this is a behavior change for `SELECT *` 
on Lance blob tables; users who want the prior behavior can set 
`hoodie.read.blob.inline.mode=CONTENT` explicitly.
   
   ### Out of Scope
   
   - Parquet behavior — the Parquet path does not currently honor DESCRIPTOR 
mode and is not changed by this issue.
   - Any new DESCRIPTOR semantics for Parquet (tracked separately if/when 
implemented).
   
   ### Testing
   
   - Run the existing Lance-related blob test suites and fix any tests that 
hardcode assumptions around the CONTENT default:
     - `TestReadBlobSQL` (hudi-spark)
     - `TestBlobSupport` (hudi-spark)
     - `TestLanceDataSource` (hudi-spark, Lance encoding paths)
   - Add a Lance compaction-with-blobs test that verifies inline blob bytes 
survive a compaction round-trip under the new default. Without the CONTENT pin, 
this test should fail; with the pin, the compacted base file must contain the 
original blob bytes.
   - Add a Lance query-side test confirming `SELECT *` returns null `data` / 
populated descriptor under the new default, and that `read_blob()` still 
returns full bytes.
   
   ### Compatibility
   
   This is a behavior change for existing Lance users relying on `SELECT *` 
returning raw blob bytes. The config is marked `@Advanced` and 
`sinceVersion("1.2.0")`, so the surface area is small. Users can restore prior 
behavior by setting `hoodie.read.blob.inline.mode=CONTENT`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Default hoodie.read.blob.inline.mode to DESCRIPTOR for Lance (compaction pinned to CONTENT) [hudi]

Reply via email to