yihua commented on code in PR #18728:
URL: https://github.com/apache/hudi/pull/18728#discussion_r3244521737
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -128,6 +128,67 @@ For Spark SQL we will provide a function that the user can
leverage to materiali
SELECT id, url, read_blob(image_blob) as image_bytes FROM my_table;
```
+#### Read Modes: `read_blob` vs. `SELECT *`
+
+`read_blob(<blob_column>)` is the canonical, universal API for materializing
raw blob bytes in a query. It always returns the underlying `bytes` regardless
of:
+- Storage strategy (`INLINE` vs `OUT_OF_LINE`)
+- Base file format (Parquet, Lance, …)
+- Any reader-side config such as `hoodie.read.blob.inline.mode`
+
+Selecting the blob column directly (e.g. `SELECT image_blob FROM t` or `SELECT
*`) returns the underlying `Blob` struct as-is. The contents of that struct
depend on the storage strategy, the file format, and the read mode, as
summarized below.
+
+**Reader Configuration**
+
+- `hoodie.read.blob.inline.mode` — values `CONTENT` (default) | `DESCRIPTOR`.
+ - `CONTENT`: the engine eagerly materializes inline bytes into the struct's
`data` field.
+ - `DESCRIPTOR`: the engine returns an `OUT_OF_LINE`-shaped descriptor in the
`reference` field where the underlying file format supports it (Lance today),
enabling lazy byte materialization via `read_blob`. For file formats without a
native descriptor for inline payloads (Parquet), both `data` and `reference`
are returned `NULL`, and the caller must use `read_blob` to retrieve bytes.
+ - This config governs `INLINE` reads only. For `OUT_OF_LINE` storage, the
engine always returns a populated `reference` regardless of this setting.
+
+**Behavior matrix**
+
+| Access pattern | Storage | File format |
`hoodie.read.blob.inline.mode` | `data` field | `reference` field |
Raw bytes available? |
+|------------------|--------------|-------------|--------------------------------|--------------|------------------------------|---------------------------------------------------|
+| `SELECT read_blob(col) FROM table` | INLINE | Parquet | (any)
| n/a | n/a | Yes —
returns bytes |
+| `SELECT read_blob(col) FROM table` | INLINE | Lance | (any)
| n/a | n/a | Yes —
returns bytes |
+| `SELECT read_blob(col) FROM table` | OUT_OF_LINE | (any) | (any)
| n/a | n/a | Yes —
returns bytes |
+| `SELECT col FROM table` | INLINE | Parquet | `CONTENT`
(default) | bytes | NULL | Yes — via
`data` |
+| `SELECT col FROM table` | INLINE | Parquet | `DESCRIPTOR`
| **NULL** | **NULL** | No — must call
`read_blob` |
+| `SELECT col FROM table` | INLINE | Lance | `CONTENT`
(default) | bytes | NULL | Yes — via
`data` |
+| `SELECT col FROM table` | INLINE | Lance | `DESCRIPTOR`
| NULL | populated (Lance blob enc.) | No — descriptor
visible; use `read_blob` for bytes|
+| `SELECT col FROM table` | OUT_OF_LINE | (any) | (irrelevant)
| NULL | populated | No — must call
`read_blob` |
+
+**Why Parquet and Lance differ in `DESCRIPTOR` mode**
+
+Lance's native blob encoding stores blobs in a way that already exposes a
`(file, offset, length)` descriptor cheaply, so `DESCRIPTOR` mode surfaces it
directly in the `reference` field — effectively letting INLINE blobs be read
with the same deferred-materialization path used for OUT_OF_LINE references.
Parquet has no equivalent native descriptor for an inline byte array, so both
fields are `NULL` in `DESCRIPTOR` mode and the caller must use `read_blob` to
materialize bytes.
+
+**Visual**
+
+```
+ ┌──────────────────────────────────────────────────────────────────┐
+ │ read_blob(col) ── universal, always materializes bytes ──│
+ │ │ │
+ │ ▼ │
+ │ ┌─────────────┐ INLINE ───► read inline payload │
+ │ │ Hudi reader │ ──┤ │
+ │ └─────────────┘ OUT_OF_LINE ► follow reference → read bytes │
+ └──────────────────────────────────────────────────────────────────┘
+
+ ┌──────────────────────────────────────────────────────────────────┐
+ │ SELECT * (returns Blob struct as-is) │
Review Comment:
```suggestion
│ SELECT col (returns Blob struct as-is) │
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]