rahil-c opened a new issue, #18742: URL: https://github.com/apache/hudi/issues/18742
### Summary Change the default of `hoodie.read.blob.inline.mode` from `CONTENT` to `DESCRIPTOR` so plain column reads (e.g. `SELECT *`) over Lance tables no longer pay the I/O cost of materializing large inline blob payloads. `read_blob()` remains the canonical bytes-materializing API and always returns full bytes regardless of this setting. The compaction path must continue to read in `CONTENT` mode, since compaction reads the base file in order to rewrite it and would otherwise lose blob bytes on rewrite. Parquet is out of scope: the Parquet reader path does not currently implement DESCRIPTOR semantics, so this change has no observable effect on Parquet tables. Parquet behavior remains as-is. ### Motivation `BLOB_INLINE_READ_MODE` was introduced in 1.2.0 (see [`HoodieReaderConfig.java`](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java)) with valid values `CONTENT` (current default) and `DESCRIPTOR`. For Lance: - `CONTENT` returns the raw inline bytes in the data field on every read — wasteful when users don't reference the blob column. - `DESCRIPTOR` suppresses the inline bytes and populates the reference struct with blob stream coordinates, letting `read_blob()` be the explicit, opt-in path for materializing bytes. For typical query workloads on Lance, DESCRIPTOR is the right default. CONTENT only makes sense when blob bytes need to round-trip through a rewrite (compaction). ### Proposed Changes 1. **Flip the default** of `HoodieReaderConfig.BLOB_INLINE_READ_MODE` from `BLOB_INLINE_READ_MODE_CONTENT` to `BLOB_INLINE_READ_MODE_DESCRIPTOR`. 2. **Pin compaction to CONTENT.** In the Lance compaction read path, explicitly override the blob read mode to `CONTENT` so compaction reliably round-trips inline blob bytes. The override should be applied at the reader-context layer rather than via the user-facing config, so users cannot accidentally disable it. 3. **Update the config Javadoc** to reflect the new default and call out the compaction exception. 4. **Note in release notes** that this is a behavior change for `SELECT *` on Lance blob tables; users who want the prior behavior can set `hoodie.read.blob.inline.mode=CONTENT` explicitly. ### Out of Scope - Parquet behavior — the Parquet path does not currently honor DESCRIPTOR mode and is not changed by this issue. - Any new DESCRIPTOR semantics for Parquet (tracked separately if/when implemented). ### Testing - Run the existing Lance-related blob test suites and fix any tests that hardcode assumptions around the CONTENT default: - `TestReadBlobSQL` (hudi-spark) - `TestBlobSupport` (hudi-spark) - `TestLanceDataSource` (hudi-spark, Lance encoding paths) - Add a Lance compaction-with-blobs test that verifies inline blob bytes survive a compaction round-trip under the new default. Without the CONTENT pin, this test should fail; with the pin, the compacted base file must contain the original blob bytes. - Add a Lance query-side test confirming `SELECT *` returns null `data` / populated descriptor under the new default, and that `read_blob()` still returns full bytes. ### Compatibility This is a behavior change for existing Lance users relying on `SELECT *` returning raw blob bytes. The config is marked `@Advanced` and `sinceVersion("1.2.0")`, so the surface area is small. Users can restore prior behavior by setting `hoodie.read.blob.inline.mode=CONTENT`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
