yihua commented on code in PR #18728:
URL: https://github.com/apache/hudi/pull/18728#discussion_r3244515044
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -128,6 +128,67 @@ For Spark SQL we will provide a function that the user can
leverage to materiali
SELECT id, url, read_blob(image_blob) as image_bytes FROM my_table;
```
+#### Read Modes: `read_blob` vs. `SELECT *`
+
+`read_blob(<blob_column>)` is the canonical, universal API for materializing
raw blob bytes in a query. It always returns the underlying `bytes` regardless
of:
+- Storage strategy (`INLINE` vs `OUT_OF_LINE`)
+- Base file format (Parquet, Lance, …)
+- Any reader-side config such as `hoodie.read.blob.inline.mode`
+
+Selecting the blob column directly (e.g. `SELECT image_blob FROM t` or `SELECT
*`) returns the underlying `Blob` struct as-is. The contents of that struct
depend on the storage strategy, the file format, and the read mode, as
summarized below.
+
+**Reader Configuration**
+
+- `hoodie.read.blob.inline.mode` — values `CONTENT` (default) | `DESCRIPTOR`.
+ - `CONTENT`: the engine eagerly materializes inline bytes into the struct's
`data` field.
+ - `DESCRIPTOR`: the engine returns an `OUT_OF_LINE`-shaped descriptor in the
`reference` field where the underlying file format supports it (Lance today),
enabling lazy byte materialization via `read_blob`. For file formats without a
native descriptor for inline payloads (Parquet), both `data` and `reference`
are returned `NULL`, and the caller must use `read_blob` to retrieve bytes.
+ - This config governs `INLINE` reads only. For `OUT_OF_LINE` storage, the
engine always returns a populated `reference` regardless of this setting.
+
+**Behavior matrix**
+
+| Access pattern | Storage | File format |
`hoodie.read.blob.inline.mode` | `data` field | `reference` field |
Raw bytes available? |
+|------------------|--------------|-------------|--------------------------------|--------------|------------------------------|---------------------------------------------------|
+| `read_blob(col)` | INLINE | Parquet | (any)
| n/a | n/a | Yes — returns bytes
|
+| `read_blob(col)` | INLINE | Lance | (any)
| n/a | n/a | Yes — returns bytes
|
+| `read_blob(col)` | OUT_OF_LINE | (any) | (any)
| n/a | n/a | Yes — returns bytes
|
+| `SELECT *` | INLINE | Parquet | `CONTENT` (default)
| bytes | NULL | Yes — via `data`
|
+| `SELECT *` | INLINE | Parquet | `DESCRIPTOR`
| **NULL** | **NULL** | No — must call `read_blob`
|
+| `SELECT *` | INLINE | Lance | `CONTENT` (default)
| bytes | NULL | Yes — via `data`
|
+| `SELECT *` | INLINE | Lance | `DESCRIPTOR`
| NULL | populated (Lance blob enc.) | No — descriptor visible; use
`read_blob` for bytes|
+| `SELECT *` | OUT_OF_LINE | (any) | (irrelevant)
| NULL | populated | No — must call `read_blob`
|
Review Comment:
```suggestion
| `SELECT col FROM table` | INLINE | Parquet | `CONTENT`
(default) | bytes | NULL | Yes — via
`data` |
| `SELECT col FROM table` | INLINE | Parquet | `DESCRIPTOR`
| **NULL** | **NULL** | No — must call
`read_blob` |
| `SELECT col FROM table` | INLINE | Lance | `CONTENT`
(default) | bytes | NULL | Yes — via
`data` |
| `SELECT col FROM table` | INLINE | Lance | `DESCRIPTOR`
| NULL | populated (Lance blob enc.) | No —
descriptor visible; use `read_blob` for bytes|
| `SELECT col FROM table` | OUT_OF_LINE | (any) | (irrelevant)
| NULL | populated | No — must call
`read_blob` |
```
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -128,6 +128,67 @@ For Spark SQL we will provide a function that the user can
leverage to materiali
SELECT id, url, read_blob(image_blob) as image_bytes FROM my_table;
```
+#### Read Modes: `read_blob` vs. `SELECT *`
+
+`read_blob(<blob_column>)` is the canonical, universal API for materializing
raw blob bytes in a query. It always returns the underlying `bytes` regardless
of:
+- Storage strategy (`INLINE` vs `OUT_OF_LINE`)
+- Base file format (Parquet, Lance, …)
+- Any reader-side config such as `hoodie.read.blob.inline.mode`
+
+Selecting the blob column directly (e.g. `SELECT image_blob FROM t` or `SELECT
*`) returns the underlying `Blob` struct as-is. The contents of that struct
depend on the storage strategy, the file format, and the read mode, as
summarized below.
+
+**Reader Configuration**
+
+- `hoodie.read.blob.inline.mode` — values `CONTENT` (default) | `DESCRIPTOR`.
+ - `CONTENT`: the engine eagerly materializes inline bytes into the struct's
`data` field.
+ - `DESCRIPTOR`: the engine returns an `OUT_OF_LINE`-shaped descriptor in the
`reference` field where the underlying file format supports it (Lance today),
enabling lazy byte materialization via `read_blob`. For file formats without a
native descriptor for inline payloads (Parquet), both `data` and `reference`
are returned `NULL`, and the caller must use `read_blob` to retrieve bytes.
+ - This config governs `INLINE` reads only. For `OUT_OF_LINE` storage, the
engine always returns a populated `reference` regardless of this setting.
+
+**Behavior matrix**
+
+| Access pattern | Storage | File format |
`hoodie.read.blob.inline.mode` | `data` field | `reference` field |
Raw bytes available? |
+|------------------|--------------|-------------|--------------------------------|--------------|------------------------------|---------------------------------------------------|
+| `read_blob(col)` | INLINE | Parquet | (any)
| n/a | n/a | Yes — returns bytes
|
+| `read_blob(col)` | INLINE | Lance | (any)
| n/a | n/a | Yes — returns bytes
|
+| `read_blob(col)` | OUT_OF_LINE | (any) | (any)
| n/a | n/a | Yes — returns bytes
|
Review Comment:
```suggestion
| `SELECT read_blob(col) FROM table` | INLINE | Parquet | (any)
| n/a | n/a | Yes —
returns bytes |
| `SELECT read_blob(col) FROM table` | INLINE | Lance | (any)
| n/a | n/a | Yes —
returns bytes |
| `SELECT read_blob(col) FROM table` | OUT_OF_LINE | (any) | (any)
| n/a | n/a | Yes —
returns bytes |
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]