hudi-agent commented on code in PR #18744:
URL: https://github.com/apache/hudi/pull/18744#discussion_r3247148447
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -163,32 +163,46 @@ Lance's native blob encoding stores blobs in a way that
already exposes a `(file
**Visual**
+What the user gets back, grouped by storage type (set at write time) and then
by query shape:
+
+```mermaid
+flowchart TD
+ ST{storage_type}
+
+ ST -->|OUT_OF_LINE| QO{Query}
+ QO -->|"SELECT col"| OOL["type = OUT_OF_LINE<br/>inline_data =
NULL<br/>reference = user-supplied"]
+ QO -->|"SELECT read_blob(col)"| RBO(["bytes — materialized<br/>via the
external reference"])
+
+ ST -->|INLINE| QI{Query}
+ QI -->|"SELECT col"| M{hoodie.read.blob.inline.mode}
+ M -->|CONTENT default| CONT["type = INLINE<br/>inline_data =
bytes<br/>reference = NULL"]
Review Comment:
🤖 The mermaid labels still read `CONTENT default` for the INLINE branches
(both the row-shape diagram and the `read_blob()` resolution diagram), but this
PR flips the default to DESCRIPTOR. Could you update the labels so the diagram
matches the actual default in `HoodieReaderConfig`? Otherwise readers will come
away believing CONTENT is still the default.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -163,32 +163,46 @@ Lance's native blob encoding stores blobs in a way that
already exposes a `(file
**Visual**
+What the user gets back, grouped by storage type (set at write time) and then
by query shape:
+
+```mermaid
+flowchart TD
+ ST{storage_type}
+
+ ST -->|OUT_OF_LINE| QO{Query}
+ QO -->|"SELECT col"| OOL["type = OUT_OF_LINE<br/>inline_data =
NULL<br/>reference = user-supplied"]
+ QO -->|"SELECT read_blob(col)"| RBO(["bytes — materialized<br/>via the
external reference"])
+
+ ST -->|INLINE| QI{Query}
+ QI -->|"SELECT col"| M{hoodie.read.blob.inline.mode}
+ M -->|CONTENT default| CONT["type = INLINE<br/>inline_data =
bytes<br/>reference = NULL"]
+ M -->|DESCRIPTOR| F{file format}
+ F -->|Lance| LD["type = INLINE<br/>inline_data = NULL<br/>reference =
synthetic managed<br/>path, offset, length, is_managed=true"]
+ F -->|Parquet| PD["type = INLINE<br/>inline_data = NULL<br/>reference =
NULL<br/>bytes only via read_blob"]
+
+ QI -->|"SELECT read_blob(col)"| RM{hoodie.read.blob.inline.mode}
+ RM -->|CONTENT default| RBC(["bytes from inline_data on the row<br/>1
hop"])
Review Comment:
🤖 The Parquet + DESCRIPTOR branch shows `reference = NULL` and then claims
bytes are available via `read_blob()` through a "format-specific
materialization path", but the PR description says the Parquet reader does not
honor `BLOB_INLINE_READ_MODE` today. Could you clarify what actually happens
for Parquet under the new default — does the reader silently fall back to
CONTENT shape, or is DESCRIPTOR a no-op? It would help to spell out the matrix
explicitly so users on Parquet aren't surprised by the default flip.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java:
##########
@@ -107,13 +107,13 @@ public class HoodieReaderConfig extends HoodieConfig {
public static final String BLOB_INLINE_READ_MODE_DESCRIPTOR = "DESCRIPTOR";
public static final ConfigProperty<String> BLOB_INLINE_READ_MODE =
ConfigProperty
.key("hoodie.read.blob.inline.mode")
- .defaultValue(BLOB_INLINE_READ_MODE_CONTENT)
+ .defaultValue(BLOB_INLINE_READ_MODE_DESCRIPTOR)
Review Comment:
🤖 This is a user-visible behavior change for `SELECT *` on INLINE blob
columns. The config carries `sinceVersion("1.2.0")` — has 1.2.0 already shipped
CONTENT as the default? If yes, the flip should bump `sinceVersion`/document
the breaking change explicitly; if no, it would help to call that out so
reviewers can confirm no released version locks in the old default. Either way,
a short "Migration / Upgrade" subsection in the RFC pointing users to set
`hoodie.read.blob.inline.mode=CONTENT` to preserve prior behavior would make
this easier to roll out.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -163,32 +163,46 @@ Lance's native blob encoding stores blobs in a way that
already exposes a `(file
**Visual**
Review Comment:
🤖 It would help to add a short "Alternatives considered" note for the
default flip — e.g., gating DESCRIPTOR per-table/per-query vs flipping the
global default, or making DESCRIPTOR the default only when the engine signals
projection without `read_blob()`. The current text presents the flip as the
only option; documenting why a global flip is preferable would strengthen the
RFC.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -163,32 +163,46 @@ Lance's native blob encoding stores blobs in a way that
already exposes a `(file
**Visual**
+What the user gets back, grouped by storage type (set at write time) and then
by query shape:
+
+```mermaid
+flowchart TD
+ ST{storage_type}
+
+ ST -->|OUT_OF_LINE| QO{Query}
+ QO -->|"SELECT col"| OOL["type = OUT_OF_LINE<br/>inline_data =
NULL<br/>reference = user-supplied"]
+ QO -->|"SELECT read_blob(col)"| RBO(["bytes — materialized<br/>via the
external reference"])
+
+ ST -->|INLINE| QI{Query}
+ QI -->|"SELECT col"| M{hoodie.read.blob.inline.mode}
+ M -->|CONTENT default| CONT["type = INLINE<br/>inline_data =
bytes<br/>reference = NULL"]
+ M -->|DESCRIPTOR| F{file format}
+ F -->|Lance| LD["type = INLINE<br/>inline_data = NULL<br/>reference =
synthetic managed<br/>path, offset, length, is_managed=true"]
+ F -->|Parquet| PD["type = INLINE<br/>inline_data = NULL<br/>reference =
NULL<br/>bytes only via read_blob"]
+
+ QI -->|"SELECT read_blob(col)"| RM{hoodie.read.blob.inline.mode}
+ RM -->|CONTENT default| RBC(["bytes from inline_data on the row<br/>1
hop"])
+ RM -->|DESCRIPTOR| RF{file format}
+ RF -->|Lance| RBL(["bytes via raw pread at<br/>synth reference
offset/length<br/>2 hops"])
+ RF -->|Parquet| RBP(["bytes via format-specific<br/>materialization
path<br/>row has no descriptor"])
```
- ┌──────────────────────────────────────────────────────────────────┐
- │ read_blob(col) ── universal, always materializes bytes ──│
- │ │ │
- │ ▼ │
- │ ┌─────────────┐ INLINE ───► read inline payload │
- │ │ Hudi reader │ ──┤ │
- │ └─────────────┘ OUT_OF_LINE ► follow reference → read bytes │
- └──────────────────────────────────────────────────────────────────┘
-
- ┌──────────────────────────────────────────────────────────────────┐
- │ SELECT col (returns Blob struct as-is) │
- │ │ │
- │ ▼ │
- │ storage = OUT_OF_LINE ─────────────► data=NULL, reference=set │
- │ │
- │ storage = INLINE, │
- │ inline.mode = CONTENT (default) ───► data=<bytes>, ref=NULL │
- │ │
- │ storage = INLINE, │
- │ inline.mode = DESCRIPTOR │
- │ ├─ Parquet ─────────────────────► data=NULL, ref=NULL │
- │ └─ Lance ─────────────────────► data=NULL, ref=set │
- └──────────────────────────────────────────────────────────────────┘
+
+`read_blob(col)` byte resolution — works regardless of storage type or mode,
with hop count depending on the row shape that arrives:
+
+```mermaid
+flowchart LR
+ RB[/"SELECT read_blob(col)"/] --> Scan["Scan emits Blob struct"]
+ Scan --> Shape{"row shape"}
+ Shape -->|"inline_data populated<br/>(CONTENT mode)"|
Direct["BatchedBlobReader:<br/>read inline_data off the row<br/><b>1 hop
total</b>"]
+ Shape -->|"reference populated<br/>(OUT_OF_LINE, or DESCRIPTOR + Lance)"|
Indirect["BatchedBlobReader:<br/>openSeekable(external_path)<br/>seek(offset),
readFully(length)<br/><b>2 hops total</b>"]
+ Direct --> Bytes(["bytes"])
+ Indirect --> Bytes
```
+Notes:
+- For DESCRIPTOR + Lance, hop 2 is a raw filesystem `pread` against the
`.lance` file at the descriptor's `(offset, length)`. It bypasses Lance's
decoder entirely — the blob encoding (`lance-encoding:blob=true` on a
`LargeBinary` column) stores blob bytes contiguously at the position Lance
reports, so direct byte access is safe.
+- Plain `SELECT col` (no `read_blob`) is always 1 hop. DESCRIPTOR's win is
that hop 1 skips blob decoding when bytes aren't needed.
+
### 3. Writer
Review Comment:
🤖 The RFC notes the DESCRIPTOR + Lance hop-2 read "bypasses Lance's decoder
entirely" via raw pread at the recorded offset/length. Is there a stability
contract from Lance that the `(offset, length)` reported on a blob-encoded
`LargeBinary` column is guaranteed to be the on-disk byte range across Lance
versions? If Lance ever switches blob storage to a chunked/compressed layout,
raw pread would silently return garbage. Worth either citing the Lance contract
here or noting the version pin / detection strategy.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReader.scala:
##########
@@ -208,16 +208,44 @@ class BatchedBlobReader(
// Dispatch based on storage_type (field 0)
val storageType = accessor.getString(blobStruct, 0)
if (storageType == HoodieSchema.Blob.INLINE) {
- // Case 1: Inline — bytes are in field 1
- val bytes = accessor.getBytes(blobStruct, 1)
- batch += RowInfo[R](
- originalRow = row,
- filePath = "",
- offset = -1,
- length = -1,
- index = rowIndex,
- inlineBytes = Some(bytes)
- )
+ // INLINE rows can arrive in two shapes:
+ // - CONTENT-mode read: inline_data populated with raw bytes
(field 1).
+ // - DESCRIPTOR-mode read: inline_data is null and the
reference struct
+ // (field 2) carries a synthetic positional pointer into the
base file.
+ // read_blob() must return bytes in both cases, so we fall back
to the
+ // descriptor's range read when inline_data is absent.
+ val inlineIsNull = accessor.isNullAt(blobStruct, 1)
+ if (!inlineIsNull) {
+ val bytes = accessor.getBytes(blobStruct, 1)
+ batch += RowInfo[R](
+ originalRow = row,
+ filePath = "",
+ offset = -1,
+ length = -1,
+ index = rowIndex,
+ inlineBytes = Some(bytes)
+ )
+ } else {
+ require(!accessor.isNullAt(blobStruct, 2),
+ s"INLINE blob at row $rowIndex has null inline_data and null
reference; cannot resolve bytes")
+ val referenceStruct = accessor.getStruct(blobStruct, 2,
HoodieSchema.Blob.getReferenceFieldCount)
+ val filePath = accessor.getString(referenceStruct, 0)
+ require(filePath != null && filePath.nonEmpty,
+ s"INLINE blob descriptor at row $rowIndex must have
non-empty external_path")
Review Comment:
🤖 The INLINE-with-null-inline_data branch silently performs a remote pread
against the synthetic descriptor's `external_path`. Could the RFC discuss the
failure mode and security implications of trusting that path? If a
malformed/forged base file produces a descriptor pointing at an arbitrary file
or with an `is_managed=true` claim that isn't validated against the table's
known file set, `read_blob()` could end up reading bytes outside the table.
Some validation (e.g. `is_managed` must be true and the path must lie under the
table base) seems worth calling out as a design invariant.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]