Re: [PR] feat(blob): RFC-100: Clarify inline vs out-of-line blob read behavior [hudi]

via GitHub Thu, 14 May 2026 14:58:35 -0700


yihua commented on code in PR #18728:
URL: https://github.com/apache/hudi/pull/18728#discussion_r3244521737



##########
rfc/rfc-100/rfc-100.md:
##########
@@ -128,6 +128,67 @@ For Spark SQL we will provide a function that the user can 
leverage to materiali
 SELECT id, url, read_blob(image_blob) as image_bytes FROM my_table;
 ```
 
+#### Read Modes: `read_blob` vs. `SELECT *`
+
+`read_blob(<blob_column>)` is the canonical, universal API for materializing 
raw blob bytes in a query. It always returns the underlying `bytes` regardless 
of:
+- Storage strategy (`INLINE` vs `OUT_OF_LINE`)
+- Base file format (Parquet, Lance, …)
+- Any reader-side config such as `hoodie.read.blob.inline.mode`
+
+Selecting the blob column directly (e.g. `SELECT image_blob FROM t` or `SELECT 
*`) returns the underlying `Blob` struct as-is. The contents of that struct 
depend on the storage strategy, the file format, and the read mode, as 
summarized below.
+
+**Reader Configuration**
+
+- `hoodie.read.blob.inline.mode` — values `CONTENT` (default) | `DESCRIPTOR`.
+  - `CONTENT`: the engine eagerly materializes inline bytes into the struct's 
`data` field.
+  - `DESCRIPTOR`: the engine returns an `OUT_OF_LINE`-shaped descriptor in the 
`reference` field where the underlying file format supports it (Lance today), 
enabling lazy byte materialization via `read_blob`. For file formats without a 
native descriptor for inline payloads (Parquet), both `data` and `reference` 
are returned `NULL`, and the caller must use `read_blob` to retrieve bytes.
+  - This config governs `INLINE` reads only. For `OUT_OF_LINE` storage, the 
engine always returns a populated `reference` regardless of this setting.
+
+**Behavior matrix**
+
+| Access pattern   | Storage      | File format | 
`hoodie.read.blob.inline.mode` | `data` field | `reference` field            | 
Raw bytes available?                              |
+|------------------|--------------|-------------|--------------------------------|--------------|------------------------------|---------------------------------------------------|
+| `SELECT read_blob(col) FROM table` | INLINE       | Parquet     | (any)      
                    | n/a          | n/a                          | Yes — 
returns bytes                               |
+| `SELECT read_blob(col) FROM table` | INLINE       | Lance       | (any)      
                    | n/a          | n/a                          | Yes — 
returns bytes                               |
+| `SELECT read_blob(col) FROM table` | OUT_OF_LINE  | (any)       | (any)      
                    | n/a          | n/a                          | Yes — 
returns bytes                               |
+| `SELECT col FROM table`       | INLINE       | Parquet     | `CONTENT` 
(default)            | bytes        | NULL                         | Yes — via 
`data`                                  |
+| `SELECT col FROM table`       | INLINE       | Parquet     | `DESCRIPTOR`    
               | **NULL**     | **NULL**                     | No — must call 
`read_blob`                        |
+| `SELECT col FROM table`       | INLINE       | Lance       | `CONTENT` 
(default)            | bytes        | NULL                         | Yes — via 
`data`                                  |
+| `SELECT col FROM table`       | INLINE       | Lance       | `DESCRIPTOR`    
               | NULL         | populated (Lance blob enc.)  | No — descriptor 
visible; use `read_blob` for bytes|
+| `SELECT col FROM table`       | OUT_OF_LINE  | (any)       | (irrelevant)    
               | NULL         | populated                    | No — must call 
`read_blob`                        |
+
+**Why Parquet and Lance differ in `DESCRIPTOR` mode**
+
+Lance's native blob encoding stores blobs in a way that already exposes a 
`(file, offset, length)` descriptor cheaply, so `DESCRIPTOR` mode surfaces it 
directly in the `reference` field — effectively letting INLINE blobs be read 
with the same deferred-materialization path used for OUT_OF_LINE references. 
Parquet has no equivalent native descriptor for an inline byte array, so both 
fields are `NULL` in `DESCRIPTOR` mode and the caller must use `read_blob` to 
materialize bytes.
+
+**Visual**
+
+```
+  ┌──────────────────────────────────────────────────────────────────┐
+  │  read_blob(col)        ── universal, always materializes bytes ──│
+  │       │                                                          │
+  │       ▼                                                          │
+  │  ┌─────────────┐    INLINE ───► read inline payload              │
+  │  │ Hudi reader │ ──┤                                             │
+  │  └─────────────┘    OUT_OF_LINE ► follow reference → read bytes  │
+  └──────────────────────────────────────────────────────────────────┘
+
+  ┌──────────────────────────────────────────────────────────────────┐
+  │  SELECT *  (returns Blob struct as-is)                           │

Review Comment:
   ```suggestion
     │  SELECT col  (returns Blob struct as-is)                         │
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(blob): RFC-100: Clarify inline vs out-of-line blob read behavior [hudi]

Reply via email to