hudi-agent commented on code in PR #18728:
URL: https://github.com/apache/hudi/pull/18728#discussion_r3237698767
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -128,6 +128,67 @@ For Spark SQL we will provide a function that the user can
leverage to materiali
SELECT id, url, read_blob(image_blob) as image_bytes FROM my_table;
```
+#### Read Modes: `read_blob` vs. `SELECT *`
+
+`read_blob(<blob_column>)` is the canonical, universal API for materializing
raw blob bytes in a query. It always returns the underlying `bytes` regardless
of:
+- Storage strategy (`INLINE` vs `OUT_OF_LINE`)
+- Base file format (Parquet, Lance, …)
+- Any reader-side config such as `hoodie.read.blob.inline.mode`
+
+Selecting the blob column directly (e.g. `SELECT image_blob FROM t` or `SELECT
*`) returns the underlying `Blob` struct as-is. The contents of that struct
depend on the storage strategy, the file format, and the read mode, as
summarized below.
+
+**Reader Configuration**
+
+- `hoodie.read.blob.inline.mode` — values `CONTENT` (default) | `DESCRIPTOR`.
+ - `CONTENT`: the engine eagerly materializes inline bytes into the struct's
`data` field.
+ - `DESCRIPTOR`: the engine returns an `OUT_OF_LINE`-shaped descriptor in the
`reference` field where the underlying file format supports it (Lance today),
enabling lazy byte materialization via `read_blob`. For file formats without a
native descriptor for inline payloads (Parquet), both `data` and `reference`
are returned `NULL`, and the caller must use `read_blob` to retrieve bytes.
+ - This config governs `INLINE` reads only. For `OUT_OF_LINE` storage, the
engine always returns a populated `reference` regardless of this setting.
+
+**Behavior matrix**
+
+| Access pattern | Storage | File format |
`hoodie.read.blob.inline.mode` | `data` field | `reference` field |
Raw bytes available? |
+|------------------|--------------|-------------|--------------------------------|--------------|------------------------------|---------------------------------------------------|
+| `read_blob(col)` | INLINE | Parquet | (any)
| n/a | n/a | Yes — returns bytes
|
+| `read_blob(col)` | INLINE | Lance | (any)
| n/a | n/a | Yes — returns bytes
|
+| `read_blob(col)` | OUT_OF_LINE | (any) | (any)
| n/a | n/a | Yes — returns bytes
|
+| `SELECT *` | INLINE | Parquet | `CONTENT` (default)
| bytes | NULL | Yes — via `data`
|
+| `SELECT *` | INLINE | Parquet | `DESCRIPTOR`
| **NULL** | **NULL** | No — must call `read_blob`
|
+| `SELECT *` | INLINE | Lance | `CONTENT` (default)
| bytes | NULL | Yes — via `data`
|
Review Comment:
🤖 For `SELECT *` with INLINE + Parquet + `DESCRIPTOR`, both `data` and
`reference` return NULL. Could you clarify the rationale — is `DESCRIPTOR`
simply a no-op on Parquet, or should the reader instead fall back to `CONTENT`
(or fail fast) so users aren't silently handed a row that looks empty? It might
be worth calling out the intended UX here.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -128,6 +128,67 @@ For Spark SQL we will provide a function that the user can
leverage to materiali
SELECT id, url, read_blob(image_blob) as image_bytes FROM my_table;
```
+#### Read Modes: `read_blob` vs. `SELECT *`
+
+`read_blob(<blob_column>)` is the canonical, universal API for materializing
raw blob bytes in a query. It always returns the underlying `bytes` regardless
of:
+- Storage strategy (`INLINE` vs `OUT_OF_LINE`)
+- Base file format (Parquet, Lance, …)
+- Any reader-side config such as `hoodie.read.blob.inline.mode`
+
+Selecting the blob column directly (e.g. `SELECT image_blob FROM t` or `SELECT
*`) returns the underlying `Blob` struct as-is. The contents of that struct
depend on the storage strategy, the file format, and the read mode, as
summarized below.
+
+**Reader Configuration**
Review Comment:
🤖 Could you clarify the schema contract for `read_blob`? Specifically: does
it always return `BINARY`/`bytes` (and what about NULL handling if the
underlying row has no blob), and what is the column lineage for
predicate/projection pushdown when `read_blob(col)` appears in the SELECT list?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -128,6 +128,67 @@ For Spark SQL we will provide a function that the user can
leverage to materiali
SELECT id, url, read_blob(image_blob) as image_bytes FROM my_table;
```
+#### Read Modes: `read_blob` vs. `SELECT *`
+
+`read_blob(<blob_column>)` is the canonical, universal API for materializing
raw blob bytes in a query. It always returns the underlying `bytes` regardless
of:
+- Storage strategy (`INLINE` vs `OUT_OF_LINE`)
+- Base file format (Parquet, Lance, …)
+- Any reader-side config such as `hoodie.read.blob.inline.mode`
+
+Selecting the blob column directly (e.g. `SELECT image_blob FROM t` or `SELECT
*`) returns the underlying `Blob` struct as-is. The contents of that struct
depend on the storage strategy, the file format, and the read mode, as
summarized below.
+
+**Reader Configuration**
+
+- `hoodie.read.blob.inline.mode` — values `CONTENT` (default) | `DESCRIPTOR`.
+ - `CONTENT`: the engine eagerly materializes inline bytes into the struct's
`data` field.
+ - `DESCRIPTOR`: the engine returns an `OUT_OF_LINE`-shaped descriptor in the
`reference` field where the underlying file format supports it (Lance today),
enabling lazy byte materialization via `read_blob`. For file formats without a
native descriptor for inline payloads (Parquet), both `data` and `reference`
are returned `NULL`, and the caller must use `read_blob` to retrieve bytes.
+ - This config governs `INLINE` reads only. For `OUT_OF_LINE` storage, the
engine always returns a populated `reference` regardless of this setting.
Review Comment:
🤖 Since this config now has user-visible semantics described in an RFC, it
would be helpful to note its stability/compatibility commitment — is
`DESCRIPTOR` mode considered stable, experimental, or Lance-only-for-now?
Default-flipping later could be surprising.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -128,6 +128,67 @@ For Spark SQL we will provide a function that the user can
leverage to materiali
SELECT id, url, read_blob(image_blob) as image_bytes FROM my_table;
```
+#### Read Modes: `read_blob` vs. `SELECT *`
+
+`read_blob(<blob_column>)` is the canonical, universal API for materializing
raw blob bytes in a query. It always returns the underlying `bytes` regardless
of:
+- Storage strategy (`INLINE` vs `OUT_OF_LINE`)
+- Base file format (Parquet, Lance, …)
+- Any reader-side config such as `hoodie.read.blob.inline.mode`
+
+Selecting the blob column directly (e.g. `SELECT image_blob FROM t` or `SELECT
*`) returns the underlying `Blob` struct as-is. The contents of that struct
depend on the storage strategy, the file format, and the read mode, as
summarized below.
+
+**Reader Configuration**
+
+- `hoodie.read.blob.inline.mode` — values `CONTENT` (default) | `DESCRIPTOR`.
Review Comment:
🤖 The doc only shows the Spark SQL surface. How does `read_blob` and
`hoodie.read.blob.inline.mode` behave across Flink/Trino/Presto readers? A
short note (even "out of scope for this RFC, tracked elsewhere") would help
readers understand the multi-engine story.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]