hudi-agent commented on code in PR #18744:
URL: https://github.com/apache/hudi/pull/18744#discussion_r3247921125
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -139,9 +139,9 @@ Selecting the blob column directly (e.g. `SELECT image_blob
FROM t` or `SELECT *
**Reader Configuration**
-- `hoodie.read.blob.inline.mode` β values `CONTENT` (default) | `DESCRIPTOR`.
+- `hoodie.read.blob.inline.mode` β values `DESCRIPTOR` (default) | `CONTENT`.
+ - `DESCRIPTOR` (default): the engine returns an `OUT_OF_LINE`-shaped
descriptor in the `reference` field where the underlying file format supports
it (Lance today), enabling lazy byte materialization via `read_blob`. For file
formats without a native descriptor for inline payloads (Parquet), both `data`
and `reference` are returned `NULL`, and the caller must use `read_blob` to
retrieve bytes.
- `CONTENT`: the engine eagerly materializes inline bytes into the struct's
`data` field.
Review Comment:
π€ Could the RFC discuss the lifecycle of the synthetic DESCRIPTOR returned
for INLINE+Lance? The descriptor encodes `(base file path, offset, length)`
into a managed Hudi base file β if cleaning, clustering, or a subsequent
compaction rewrites/removes that file before the caller invokes `read_blob()`,
the deferred resolution will fail or read stale bytes. OUT_OF_LINE references
are user-supplied and assumed externally stable; a synthetic descriptor
pointing at a Hudi-managed file is not. Is there a contract about how long
these descriptors stay valid (e.g. only within a single query/stage), and how
that's enforced?
<sub><i>- AI-generated; verify before applying. React π/π to flag
quality.</i></sub>
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -151,44 +151,64 @@ Selecting the blob column directly (e.g. `SELECT
image_blob FROM t` or `SELECT *
| `SELECT read_blob(col) FROM table` | INLINE | Parquet | (any)
| n/a | n/a | Yes β
returns bytes |
| `SELECT read_blob(col) FROM table` | INLINE | Lance | (any)
| n/a | n/a | Yes β
returns bytes |
| `SELECT read_blob(col) FROM table` | OUT_OF_LINE | (any) | (any)
| n/a | n/a | Yes β
returns bytes |
-| `SELECT col FROM table` | INLINE | Parquet | `CONTENT`
(default) | bytes | NULL | Yes β via
`data` |
-| `SELECT col FROM table` | INLINE | Parquet | `DESCRIPTOR`
| **NULL** | **NULL** | No β must call
`read_blob` |
-| `SELECT col FROM table` | INLINE | Lance | `CONTENT`
(default) | bytes | NULL | Yes β via
`data` |
-| `SELECT col FROM table` | INLINE | Lance | `DESCRIPTOR`
| NULL | populated (Lance blob enc.) | No β descriptor
visible; use `read_blob` for bytes|
+| `SELECT col FROM table` | INLINE | Parquet | `CONTENT`
| bytes | NULL | Yes β via `data`
|
+| `SELECT col FROM table` | INLINE | Parquet | `DESCRIPTOR`
(default) | bytesΒΉ | NULLΒΉ | Yes β via
`data`ΒΉ |
+| `SELECT col FROM table` | INLINE | Lance | `CONTENT`
| bytes | NULL | Yes β via `data`
|
+| `SELECT col FROM table` | INLINE | Lance | `DESCRIPTOR`
(default) | NULL | populated (Lance blob enc.) | No β
descriptor visible; use `read_blob` for bytes|
| `SELECT col FROM table` | OUT_OF_LINE | (any) | (irrelevant)
| NULL | populated | No β must call
`read_blob` |
+ΒΉ **Parquet + DESCRIPTOR is a no-op today.** The Parquet reader path does not
currently honor `hoodie.read.blob.inline.mode`, so on Parquet tables the config
is ignored and INLINE rows always come back in `CONTENT` shape regardless of
the setting. The default flip to `DESCRIPTOR` therefore has no observable
effect on Parquet tables. The matrix row above shows what users will actually
see today, not the eventual spec.
+
+**Eventual spec for Parquet + DESCRIPTOR (not yet implemented)**
+
+When Parquet gains DESCRIPTOR support, INLINE rows in `DESCRIPTOR` mode are
intended to return `data = NULL` and `reference = NULL` (no native positional
handle exists in Parquet for an inline byte array), with `read_blob`
materializing bytes through a format-specific path. Until then, treat
`DESCRIPTOR` as effectively `CONTENT` on Parquet.
+
**Why Parquet and Lance differ in `DESCRIPTOR` mode**
-Lance's native blob encoding stores blobs in a way that already exposes a
`(file, offset, length)` descriptor cheaply, so `DESCRIPTOR` mode surfaces it
directly in the `reference` field β effectively letting INLINE blobs be read
with the same deferred-materialization path used for OUT_OF_LINE references.
Parquet has no equivalent native descriptor for an inline byte array, so both
fields are `NULL` in `DESCRIPTOR` mode and the caller must use `read_blob` to
materialize bytes.
+Lance's native blob encoding stores blobs in a way that already exposes a
`(file, offset, length)` descriptor cheaply, so `DESCRIPTOR` mode surfaces it
directly in the `reference` field β effectively letting INLINE blobs be read
with the same deferred-materialization path used for OUT_OF_LINE references.
Parquet has no equivalent native descriptor for an inline byte array, which is
why the DESCRIPTOR path is currently a no-op there.
**Visual**
+What the user gets back, grouped by storage type (set at write time) and then
by query shape:
+
+```mermaid
+flowchart TD
+ ST{storage_type}
+
+ ST -->|OUT_OF_LINE| QO{Query}
+ QO -->|"SELECT col"| OOL["type = OUT_OF_LINE<br/>inline_data =
NULL<br/>reference = user-supplied"]
+ QO -->|"SELECT read_blob(col)"| RBO(["bytes β materialized<br/>via the
external reference"])
+
+ ST -->|INLINE| QI{Query}
+ QI -->|"SELECT col"| M{hoodie.read.blob.inline.mode}
+ M -->|CONTENT| CONT["type = INLINE<br/>inline_data = bytes<br/>reference =
NULL"]
+ M -->|DESCRIPTOR default| F{file format}
+ F -->|Lance| LD["type = INLINE<br/>inline_data = NULL<br/>reference =
synthetic managed<br/>path, offset, length, is_managed=true"]
+ F -->|"Parquet (today: mode no-op)"| PD["Parquet reader does
not<br/>implement DESCRIPTOR yet β<br/>returns CONTENT shape:<br/>inline_data =
bytes, reference = NULL"]
+
+ QI -->|"SELECT read_blob(col)"| RM{hoodie.read.blob.inline.mode}
Review Comment:
π€ The Lance branch shows `reference = synthetic managed path, offset,
length, is_managed=true`. Could you spell out what `is_managed=true` means
semantically and how it differs from a user-supplied OUT_OF_LINE reference
downstream β e.g., does it gate cleaning, does `read_blob()` route differently,
can a writer round-trip a row carrying this flag back into the table? This flag
appears to be the only signal distinguishing "safe to clean β Hudi owns the
bytes" from "do not clean β user owns the bytes", which is load-bearing.
<sub><i>- AI-generated; verify before applying. React π/π to flag
quality.</i></sub>
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReader.scala:
##########
@@ -208,16 +208,44 @@ class BatchedBlobReader(
// Dispatch based on storage_type (field 0)
val storageType = accessor.getString(blobStruct, 0)
if (storageType == HoodieSchema.Blob.INLINE) {
- // Case 1: Inline β bytes are in field 1
- val bytes = accessor.getBytes(blobStruct, 1)
- batch += RowInfo[R](
- originalRow = row,
- filePath = "",
- offset = -1,
- length = -1,
- index = rowIndex,
- inlineBytes = Some(bytes)
- )
+ // INLINE rows can arrive in two shapes:
+ // - CONTENT-mode read: inline_data populated with raw bytes
(field 1).
+ // - DESCRIPTOR-mode read: inline_data is null and the
reference struct
+ // (field 2) carries a synthetic positional pointer into the
base file.
+ // read_blob() must return bytes in both cases, so we fall back
to the
+ // descriptor's range read when inline_data is absent.
+ val inlineIsNull = accessor.isNullAt(blobStruct, 1)
+ if (!inlineIsNull) {
+ val bytes = accessor.getBytes(blobStruct, 1)
+ batch += RowInfo[R](
+ originalRow = row,
+ filePath = "",
+ offset = -1,
+ length = -1,
+ index = rowIndex,
+ inlineBytes = Some(bytes)
+ )
+ } else {
+ require(!accessor.isNullAt(blobStruct, 2),
+ s"INLINE blob at row $rowIndex has null inline_data and null
reference; cannot resolve bytes")
+ val referenceStruct = accessor.getStruct(blobStruct, 2,
HoodieSchema.Blob.getReferenceFieldCount)
+ val filePath = accessor.getString(referenceStruct, 0)
+ require(filePath != null && filePath.nonEmpty,
+ s"INLINE blob descriptor at row $rowIndex must have
non-empty external_path")
+ require(!accessor.isNullAt(referenceStruct, 1) &&
!accessor.isNullAt(referenceStruct, 2),
+ s"INLINE blob descriptor at row $rowIndex must set both
offset and length")
+ val offset = accessor.getLong(referenceStruct, 1)
+ val length = accessor.getLong(referenceStruct, 2)
+ require(offset >= 0, s"INLINE blob descriptor offset must be
non-negative for '$filePath': $offset")
+ require(length >= 0, s"INLINE blob descriptor length must be
non-negative for '$filePath': $length")
+ batch += RowInfo[R](
+ originalRow = row,
+ filePath = filePath,
+ offset = offset,
+ length = length,
+ index = rowIndex
Review Comment:
π€ The `require(...)` calls here will throw `IllegalArgumentException`
mid-batch on any malformed INLINE+DESCRIPTOR row (null inline_data and
null/incomplete reference). On a corrupted file this kills the whole task and
surfaces as a confusing user error far from the cause. Would it be worth using
a Hudi-typed exception with the file path + row index, or at least a top-level
wrapper that names the column and source file? Same question for the
offset/length sign checks β those would only fire on writer bugs, but the
diagnostic is what determines whether anyone can debug it.
<sub><i>- AI-generated; verify before applying. React π/π to flag
quality.</i></sub>
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -151,44 +151,64 @@ Selecting the blob column directly (e.g. `SELECT
image_blob FROM t` or `SELECT *
| `SELECT read_blob(col) FROM table` | INLINE | Parquet | (any)
| n/a | n/a | Yes β
returns bytes |
| `SELECT read_blob(col) FROM table` | INLINE | Lance | (any)
| n/a | n/a | Yes β
returns bytes |
| `SELECT read_blob(col) FROM table` | OUT_OF_LINE | (any) | (any)
| n/a | n/a | Yes β
returns bytes |
-| `SELECT col FROM table` | INLINE | Parquet | `CONTENT`
(default) | bytes | NULL | Yes β via
`data` |
-| `SELECT col FROM table` | INLINE | Parquet | `DESCRIPTOR`
| **NULL** | **NULL** | No β must call
`read_blob` |
-| `SELECT col FROM table` | INLINE | Lance | `CONTENT`
(default) | bytes | NULL | Yes β via
`data` |
-| `SELECT col FROM table` | INLINE | Lance | `DESCRIPTOR`
| NULL | populated (Lance blob enc.) | No β descriptor
visible; use `read_blob` for bytes|
+| `SELECT col FROM table` | INLINE | Parquet | `CONTENT`
| bytes | NULL | Yes β via `data`
|
+| `SELECT col FROM table` | INLINE | Parquet | `DESCRIPTOR`
(default) | bytesΒΉ | NULLΒΉ | Yes β via
`data`ΒΉ |
+| `SELECT col FROM table` | INLINE | Lance | `CONTENT`
| bytes | NULL | Yes β via `data`
|
+| `SELECT col FROM table` | INLINE | Lance | `DESCRIPTOR`
(default) | NULL | populated (Lance blob enc.) | No β
descriptor visible; use `read_blob` for bytes|
Review Comment:
π€ The new compaction test reveals that `HoodieSparkLanceReader` must
hard-pin CONTENT regardless of `hoodie.read.blob.inline.mode` β otherwise the
compactor reads `data=NULL` for untouched rows and silently rewrites the base
file without bytes. That's a critical correctness invariant that the RFC
doesn't currently mention. Could you add a note (probably in this section or in
the Writer/Compaction section) that internal table-services readers MUST ignore
this config and always materialize CONTENT? Otherwise a future refactor that
"unifies" reader configs could re-introduce silent data loss. @yihua might want
to weigh in.
<sub><i>- AI-generated; verify before applying. React π/π to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]