This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch release-1.2.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 657405a944801634dccb23d42c32e3ddad347d74
Author: Rahil C <[email protected]>
AuthorDate: Thu May 14 15:05:31 2026 -0700

    feat(blob): RFC-100: Clarify inline vs out-of-line blob read behavior 
(#18728)
    
    Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
    Co-authored-by: Y Ethan Guo <[email protected]>
---
 rfc/rfc-100/rfc-100.md | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/rfc/rfc-100/rfc-100.md b/rfc/rfc-100/rfc-100.md
index e66ab97c57a4..2c637ccb2db2 100644
--- a/rfc/rfc-100/rfc-100.md
+++ b/rfc/rfc-100/rfc-100.md
@@ -128,6 +128,67 @@ For Spark SQL we will provide a function that the user can 
leverage to materiali
 SELECT id, url, read_blob(image_blob) as image_bytes FROM my_table;
 ```
 
+#### Read Modes: `read_blob` vs. `SELECT *`
+
+`read_blob(<blob_column>)` is the canonical, universal API for materializing 
raw blob bytes in a query. It always returns the underlying `bytes` regardless 
of:
+- Storage strategy (`INLINE` vs `OUT_OF_LINE`)
+- Base file format (Parquet, Lance, …)
+- Any reader-side config such as `hoodie.read.blob.inline.mode`
+
+Selecting the blob column directly (e.g. `SELECT image_blob FROM t` or `SELECT 
*`) returns the underlying `Blob` struct as-is. The contents of that struct 
depend on the storage strategy, the file format, and the read mode, as 
summarized below.
+
+**Reader Configuration**
+
+- `hoodie.read.blob.inline.mode` — values `CONTENT` (default) | `DESCRIPTOR`.
+  - `CONTENT`: the engine eagerly materializes inline bytes into the struct's 
`data` field.
+  - `DESCRIPTOR`: the engine returns an `OUT_OF_LINE`-shaped descriptor in the 
`reference` field where the underlying file format supports it (Lance today), 
enabling lazy byte materialization via `read_blob`. For file formats without a 
native descriptor for inline payloads (Parquet), both `data` and `reference` 
are returned `NULL`, and the caller must use `read_blob` to retrieve bytes.
+  - This config governs `INLINE` reads only. For `OUT_OF_LINE` storage, the 
engine always returns a populated `reference` regardless of this setting.
+
+**Behavior matrix**
+
+| Access pattern   | Storage      | File format | 
`hoodie.read.blob.inline.mode` | `data` field | `reference` field            | 
Raw bytes available?                              |
+|------------------|--------------|-------------|--------------------------------|--------------|------------------------------|---------------------------------------------------|
+| `SELECT read_blob(col) FROM table` | INLINE       | Parquet     | (any)      
                    | n/a          | n/a                          | Yes — 
returns bytes                               |
+| `SELECT read_blob(col) FROM table` | INLINE       | Lance       | (any)      
                    | n/a          | n/a                          | Yes — 
returns bytes                               |
+| `SELECT read_blob(col) FROM table` | OUT_OF_LINE  | (any)       | (any)      
                    | n/a          | n/a                          | Yes — 
returns bytes                               |
+| `SELECT col FROM table`       | INLINE       | Parquet     | `CONTENT` 
(default)            | bytes        | NULL                         | Yes — via 
`data`                                  |
+| `SELECT col FROM table`       | INLINE       | Parquet     | `DESCRIPTOR`    
               | **NULL**     | **NULL**                     | No — must call 
`read_blob`                        |
+| `SELECT col FROM table`       | INLINE       | Lance       | `CONTENT` 
(default)            | bytes        | NULL                         | Yes — via 
`data`                                  |
+| `SELECT col FROM table`       | INLINE       | Lance       | `DESCRIPTOR`    
               | NULL         | populated (Lance blob enc.)  | No — descriptor 
visible; use `read_blob` for bytes|
+| `SELECT col FROM table`       | OUT_OF_LINE  | (any)       | (irrelevant)    
               | NULL         | populated                    | No — must call 
`read_blob`                        |
+
+**Why Parquet and Lance differ in `DESCRIPTOR` mode**
+
+Lance's native blob encoding stores blobs in a way that already exposes a 
`(file, offset, length)` descriptor cheaply, so `DESCRIPTOR` mode surfaces it 
directly in the `reference` field — effectively letting INLINE blobs be read 
with the same deferred-materialization path used for OUT_OF_LINE references. 
Parquet has no equivalent native descriptor for an inline byte array, so both 
fields are `NULL` in `DESCRIPTOR` mode and the caller must use `read_blob` to 
materialize bytes.
+
+**Visual**
+
+```
+  ┌──────────────────────────────────────────────────────────────────┐
+  │  read_blob(col)        ── universal, always materializes bytes ──│
+  │       │                                                          │
+  │       ▼                                                          │
+  │  ┌─────────────┐    INLINE ───► read inline payload              │
+  │  │ Hudi reader │ ──┤                                             │
+  │  └─────────────┘    OUT_OF_LINE ► follow reference → read bytes  │
+  └──────────────────────────────────────────────────────────────────┘
+
+  ┌──────────────────────────────────────────────────────────────────┐
+  │  SELECT col  (returns Blob struct as-is)                         │
+  │       │                                                          │
+  │       ▼                                                          │
+  │  storage = OUT_OF_LINE  ─────────────► data=NULL, reference=set  │
+  │                                                                  │
+  │  storage = INLINE,                                               │
+  │   inline.mode = CONTENT (default) ───► data=<bytes>, ref=NULL    │
+  │                                                                  │
+  │  storage = INLINE,                                               │
+  │   inline.mode = DESCRIPTOR                                       │
+  │     ├─ Parquet  ─────────────────────► data=NULL, ref=NULL       │
+  │     └─ Lance    ─────────────────────► data=NULL, ref=set        │
+  └──────────────────────────────────────────────────────────────────┘
+```
+
 ### 3. Writer
 #### Phase 1: External Blob Support
 The writer will be updated to support writing blob data as out-of-line 
references. 

Reply via email to