Re: [PR] [DOCS] RFC-100: Clarify inline vs out-of-line blob read behavior [hudi]

via GitHub Wed, 13 May 2026 14:59:30 -0700


hudi-agent commented on code in PR #18728:
URL: https://github.com/apache/hudi/pull/18728#discussion_r3237698767



##########
rfc/rfc-100/rfc-100.md:
##########
@@ -128,6 +128,67 @@ For Spark SQL we will provide a function that the user can 
leverage to materiali
 SELECT id, url, read_blob(image_blob) as image_bytes FROM my_table;
 ```
 
+#### Read Modes: `read_blob` vs. `SELECT *`
+
+`read_blob(<blob_column>)` is the canonical, universal API for materializing 
raw blob bytes in a query. It always returns the underlying `bytes` regardless 
of:
+- Storage strategy (`INLINE` vs `OUT_OF_LINE`)
+- Base file format (Parquet, Lance, …)
+- Any reader-side config such as `hoodie.read.blob.inline.mode`
+
+Selecting the blob column directly (e.g. `SELECT image_blob FROM t` or `SELECT 
*`) returns the underlying `Blob` struct as-is. The contents of that struct 
depend on the storage strategy, the file format, and the read mode, as 
summarized below.
+
+**Reader Configuration**
+
+- `hoodie.read.blob.inline.mode` — values `CONTENT` (default) | `DESCRIPTOR`.
+  - `CONTENT`: the engine eagerly materializes inline bytes into the struct's 
`data` field.
+  - `DESCRIPTOR`: the engine returns an `OUT_OF_LINE`-shaped descriptor in the 
`reference` field where the underlying file format supports it (Lance today), 
enabling lazy byte materialization via `read_blob`. For file formats without a 
native descriptor for inline payloads (Parquet), both `data` and `reference` 
are returned `NULL`, and the caller must use `read_blob` to retrieve bytes.
+  - This config governs `INLINE` reads only. For `OUT_OF_LINE` storage, the 
engine always returns a populated `reference` regardless of this setting.
+
+**Behavior matrix**
+
+| Access pattern   | Storage      | File format | 
`hoodie.read.blob.inline.mode` | `data` field | `reference` field            | 
Raw bytes available?                              |
+|------------------|--------------|-------------|--------------------------------|--------------|------------------------------|---------------------------------------------------|
+| `read_blob(col)` | INLINE       | Parquet     | (any)                        
  | n/a          | n/a                          | Yes — returns bytes           
                    |
+| `read_blob(col)` | INLINE       | Lance       | (any)                        
  | n/a          | n/a                          | Yes — returns bytes           
                    |
+| `read_blob(col)` | OUT_OF_LINE  | (any)       | (any)                        
  | n/a          | n/a                          | Yes — returns bytes           
                    |
+| `SELECT *`       | INLINE       | Parquet     | `CONTENT` (default)          
  | bytes        | NULL                         | Yes — via `data`              
                    |
+| `SELECT *`       | INLINE       | Parquet     | `DESCRIPTOR`                 
  | **NULL**     | **NULL**                     | No — must call `read_blob`    
                    |
+| `SELECT *`       | INLINE       | Lance       | `CONTENT` (default)          
  | bytes        | NULL                         | Yes — via `data`              
                    |

Review Comment:
   🤖 For `SELECT *` with INLINE + Parquet + `DESCRIPTOR`, both `data` and 
`reference` return NULL. Could you clarify the rationale — is `DESCRIPTOR` 
simply a no-op on Parquet, or should the reader instead fall back to `CONTENT` 
(or fail fast) so users aren't silently handed a row that looks empty? It might 
be worth calling out the intended UX here.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
rfc/rfc-100/rfc-100.md:
##########
@@ -128,6 +128,67 @@ For Spark SQL we will provide a function that the user can 
leverage to materiali
 SELECT id, url, read_blob(image_blob) as image_bytes FROM my_table;
 ```
 
+#### Read Modes: `read_blob` vs. `SELECT *`
+
+`read_blob(<blob_column>)` is the canonical, universal API for materializing 
raw blob bytes in a query. It always returns the underlying `bytes` regardless 
of:
+- Storage strategy (`INLINE` vs `OUT_OF_LINE`)
+- Base file format (Parquet, Lance, …)
+- Any reader-side config such as `hoodie.read.blob.inline.mode`
+
+Selecting the blob column directly (e.g. `SELECT image_blob FROM t` or `SELECT 
*`) returns the underlying `Blob` struct as-is. The contents of that struct 
depend on the storage strategy, the file format, and the read mode, as 
summarized below.
+
+**Reader Configuration**

Review Comment:
   🤖 Could you clarify the schema contract for `read_blob`? Specifically: does 
it always return `BINARY`/`bytes` (and what about NULL handling if the 
underlying row has no blob), and what is the column lineage for 
predicate/projection pushdown when `read_blob(col)` appears in the SELECT list?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
rfc/rfc-100/rfc-100.md:
##########
@@ -128,6 +128,67 @@ For Spark SQL we will provide a function that the user can 
leverage to materiali
 SELECT id, url, read_blob(image_blob) as image_bytes FROM my_table;
 ```
 
+#### Read Modes: `read_blob` vs. `SELECT *`
+
+`read_blob(<blob_column>)` is the canonical, universal API for materializing 
raw blob bytes in a query. It always returns the underlying `bytes` regardless 
of:
+- Storage strategy (`INLINE` vs `OUT_OF_LINE`)
+- Base file format (Parquet, Lance, …)
+- Any reader-side config such as `hoodie.read.blob.inline.mode`
+
+Selecting the blob column directly (e.g. `SELECT image_blob FROM t` or `SELECT 
*`) returns the underlying `Blob` struct as-is. The contents of that struct 
depend on the storage strategy, the file format, and the read mode, as 
summarized below.
+
+**Reader Configuration**
+
+- `hoodie.read.blob.inline.mode` — values `CONTENT` (default) | `DESCRIPTOR`.
+  - `CONTENT`: the engine eagerly materializes inline bytes into the struct's 
`data` field.
+  - `DESCRIPTOR`: the engine returns an `OUT_OF_LINE`-shaped descriptor in the 
`reference` field where the underlying file format supports it (Lance today), 
enabling lazy byte materialization via `read_blob`. For file formats without a 
native descriptor for inline payloads (Parquet), both `data` and `reference` 
are returned `NULL`, and the caller must use `read_blob` to retrieve bytes.
+  - This config governs `INLINE` reads only. For `OUT_OF_LINE` storage, the 
engine always returns a populated `reference` regardless of this setting.

Review Comment:
   🤖 Since this config now has user-visible semantics described in an RFC, it 
would be helpful to note its stability/compatibility commitment — is 
`DESCRIPTOR` mode considered stable, experimental, or Lance-only-for-now? 
Default-flipping later could be surprising.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
rfc/rfc-100/rfc-100.md:
##########
@@ -128,6 +128,67 @@ For Spark SQL we will provide a function that the user can 
leverage to materiali
 SELECT id, url, read_blob(image_blob) as image_bytes FROM my_table;
 ```
 
+#### Read Modes: `read_blob` vs. `SELECT *`
+
+`read_blob(<blob_column>)` is the canonical, universal API for materializing 
raw blob bytes in a query. It always returns the underlying `bytes` regardless 
of:
+- Storage strategy (`INLINE` vs `OUT_OF_LINE`)
+- Base file format (Parquet, Lance, …)
+- Any reader-side config such as `hoodie.read.blob.inline.mode`
+
+Selecting the blob column directly (e.g. `SELECT image_blob FROM t` or `SELECT 
*`) returns the underlying `Blob` struct as-is. The contents of that struct 
depend on the storage strategy, the file format, and the read mode, as 
summarized below.
+
+**Reader Configuration**
+
+- `hoodie.read.blob.inline.mode` — values `CONTENT` (default) | `DESCRIPTOR`.

Review Comment:
   🤖 The doc only shows the Spark SQL surface. How does `read_blob` and 
`hoodie.read.blob.inline.mode` behave across Flink/Trino/Presto readers? A 
short note (even "out of scope for this RFC, tracked elsewhere") would help 
readers understand the multi-engine story.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [DOCS] RFC-100: Clarify inline vs out-of-line blob read behavior [hudi]

Reply via email to