hudi-agent commented on code in PR #18744:
URL: https://github.com/apache/hudi/pull/18744#discussion_r3247148447


##########
rfc/rfc-100/rfc-100.md:
##########
@@ -163,32 +163,46 @@ Lance's native blob encoding stores blobs in a way that 
already exposes a `(file
 
 **Visual**
 
+What the user gets back, grouped by storage type (set at write time) and then 
by query shape:
+
+```mermaid
+flowchart TD
+    ST{storage_type}
+
+    ST -->|OUT_OF_LINE| QO{Query}
+    QO -->|"SELECT col"| OOL["type = OUT_OF_LINE<br/>inline_data = 
NULL<br/>reference = user-supplied"]
+    QO -->|"SELECT read_blob(col)"| RBO(["bytes — materialized<br/>via the 
external reference"])
+
+    ST -->|INLINE| QI{Query}
+    QI -->|"SELECT col"| M{hoodie.read.blob.inline.mode}
+    M -->|CONTENT default| CONT["type = INLINE<br/>inline_data = 
bytes<br/>reference = NULL"]

Review Comment:
   🤖 The mermaid labels still read `CONTENT default` for the INLINE branches 
(both the row-shape diagram and the `read_blob()` resolution diagram), but this 
PR flips the default to DESCRIPTOR. Could you update the labels so the diagram 
matches the actual default in `HoodieReaderConfig`? Otherwise readers will come 
away believing CONTENT is still the default.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
rfc/rfc-100/rfc-100.md:
##########
@@ -163,32 +163,46 @@ Lance's native blob encoding stores blobs in a way that 
already exposes a `(file
 
 **Visual**
 
+What the user gets back, grouped by storage type (set at write time) and then 
by query shape:
+
+```mermaid
+flowchart TD
+    ST{storage_type}
+
+    ST -->|OUT_OF_LINE| QO{Query}
+    QO -->|"SELECT col"| OOL["type = OUT_OF_LINE<br/>inline_data = 
NULL<br/>reference = user-supplied"]
+    QO -->|"SELECT read_blob(col)"| RBO(["bytes — materialized<br/>via the 
external reference"])
+
+    ST -->|INLINE| QI{Query}
+    QI -->|"SELECT col"| M{hoodie.read.blob.inline.mode}
+    M -->|CONTENT default| CONT["type = INLINE<br/>inline_data = 
bytes<br/>reference = NULL"]
+    M -->|DESCRIPTOR| F{file format}
+    F -->|Lance| LD["type = INLINE<br/>inline_data = NULL<br/>reference = 
synthetic managed<br/>path, offset, length, is_managed=true"]
+    F -->|Parquet| PD["type = INLINE<br/>inline_data = NULL<br/>reference = 
NULL<br/>bytes only via read_blob"]
+
+    QI -->|"SELECT read_blob(col)"| RM{hoodie.read.blob.inline.mode}
+    RM -->|CONTENT default| RBC(["bytes from inline_data on the row<br/>1 
hop"])

Review Comment:
   🤖 The Parquet + DESCRIPTOR branch shows `reference = NULL` and then claims 
bytes are available via `read_blob()` through a "format-specific 
materialization path", but the PR description says the Parquet reader does not 
honor `BLOB_INLINE_READ_MODE` today. Could you clarify what actually happens 
for Parquet under the new default — does the reader silently fall back to 
CONTENT shape, or is DESCRIPTOR a no-op? It would help to spell out the matrix 
explicitly so users on Parquet aren't surprised by the default flip.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java:
##########
@@ -107,13 +107,13 @@ public class HoodieReaderConfig extends HoodieConfig {
   public static final String BLOB_INLINE_READ_MODE_DESCRIPTOR = "DESCRIPTOR";
   public static final ConfigProperty<String> BLOB_INLINE_READ_MODE = 
ConfigProperty
       .key("hoodie.read.blob.inline.mode")
-      .defaultValue(BLOB_INLINE_READ_MODE_CONTENT)
+      .defaultValue(BLOB_INLINE_READ_MODE_DESCRIPTOR)

Review Comment:
   🤖 This is a user-visible behavior change for `SELECT *` on INLINE blob 
columns. The config carries `sinceVersion("1.2.0")` — has 1.2.0 already shipped 
CONTENT as the default? If yes, the flip should bump `sinceVersion`/document 
the breaking change explicitly; if no, it would help to call that out so 
reviewers can confirm no released version locks in the old default. Either way, 
a short "Migration / Upgrade" subsection in the RFC pointing users to set 
`hoodie.read.blob.inline.mode=CONTENT` to preserve prior behavior would make 
this easier to roll out.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
rfc/rfc-100/rfc-100.md:
##########
@@ -163,32 +163,46 @@ Lance's native blob encoding stores blobs in a way that 
already exposes a `(file
 
 **Visual**

Review Comment:
   🤖 It would help to add a short "Alternatives considered" note for the 
default flip — e.g., gating DESCRIPTOR per-table/per-query vs flipping the 
global default, or making DESCRIPTOR the default only when the engine signals 
projection without `read_blob()`. The current text presents the flip as the 
only option; documenting why a global flip is preferable would strengthen the 
RFC.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
rfc/rfc-100/rfc-100.md:
##########
@@ -163,32 +163,46 @@ Lance's native blob encoding stores blobs in a way that 
already exposes a `(file
 
 **Visual**
 
+What the user gets back, grouped by storage type (set at write time) and then 
by query shape:
+
+```mermaid
+flowchart TD
+    ST{storage_type}
+
+    ST -->|OUT_OF_LINE| QO{Query}
+    QO -->|"SELECT col"| OOL["type = OUT_OF_LINE<br/>inline_data = 
NULL<br/>reference = user-supplied"]
+    QO -->|"SELECT read_blob(col)"| RBO(["bytes — materialized<br/>via the 
external reference"])
+
+    ST -->|INLINE| QI{Query}
+    QI -->|"SELECT col"| M{hoodie.read.blob.inline.mode}
+    M -->|CONTENT default| CONT["type = INLINE<br/>inline_data = 
bytes<br/>reference = NULL"]
+    M -->|DESCRIPTOR| F{file format}
+    F -->|Lance| LD["type = INLINE<br/>inline_data = NULL<br/>reference = 
synthetic managed<br/>path, offset, length, is_managed=true"]
+    F -->|Parquet| PD["type = INLINE<br/>inline_data = NULL<br/>reference = 
NULL<br/>bytes only via read_blob"]
+
+    QI -->|"SELECT read_blob(col)"| RM{hoodie.read.blob.inline.mode}
+    RM -->|CONTENT default| RBC(["bytes from inline_data on the row<br/>1 
hop"])
+    RM -->|DESCRIPTOR| RF{file format}
+    RF -->|Lance| RBL(["bytes via raw pread at<br/>synth reference 
offset/length<br/>2 hops"])
+    RF -->|Parquet| RBP(["bytes via format-specific<br/>materialization 
path<br/>row has no descriptor"])
 ```
-  ┌──────────────────────────────────────────────────────────────────┐
-  │  read_blob(col)        ── universal, always materializes bytes ──│
-  │       │                                                          │
-  │       ▼                                                          │
-  │  ┌─────────────┐    INLINE ───► read inline payload              │
-  │  │ Hudi reader │ ──┤                                             │
-  │  └─────────────┘    OUT_OF_LINE ► follow reference → read bytes  │
-  └──────────────────────────────────────────────────────────────────┘
-
-  ┌──────────────────────────────────────────────────────────────────┐
-  │  SELECT col  (returns Blob struct as-is)                         │
-  │       │                                                          │
-  │       ▼                                                          │
-  │  storage = OUT_OF_LINE  ─────────────► data=NULL, reference=set  │
-  │                                                                  │
-  │  storage = INLINE,                                               │
-  │   inline.mode = CONTENT (default) ───► data=<bytes>, ref=NULL    │
-  │                                                                  │
-  │  storage = INLINE,                                               │
-  │   inline.mode = DESCRIPTOR                                       │
-  │     ├─ Parquet  ─────────────────────► data=NULL, ref=NULL       │
-  │     └─ Lance    ─────────────────────► data=NULL, ref=set        │
-  └──────────────────────────────────────────────────────────────────┘
+
+`read_blob(col)` byte resolution — works regardless of storage type or mode, 
with hop count depending on the row shape that arrives:
+
+```mermaid
+flowchart LR
+    RB[/"SELECT read_blob(col)"/] --> Scan["Scan emits Blob struct"]
+    Scan --> Shape{"row shape"}
+    Shape -->|"inline_data populated<br/>(CONTENT mode)"| 
Direct["BatchedBlobReader:<br/>read inline_data off the row<br/><b>1 hop 
total</b>"]
+    Shape -->|"reference populated<br/>(OUT_OF_LINE, or DESCRIPTOR + Lance)"| 
Indirect["BatchedBlobReader:<br/>openSeekable(external_path)<br/>seek(offset), 
readFully(length)<br/><b>2 hops total</b>"]
+    Direct --> Bytes(["bytes"])
+    Indirect --> Bytes
 ```
 
+Notes:
+- For DESCRIPTOR + Lance, hop 2 is a raw filesystem `pread` against the 
`.lance` file at the descriptor's `(offset, length)`. It bypasses Lance's 
decoder entirely — the blob encoding (`lance-encoding:blob=true` on a 
`LargeBinary` column) stores blob bytes contiguously at the position Lance 
reports, so direct byte access is safe.
+- Plain `SELECT col` (no `read_blob`) is always 1 hop. DESCRIPTOR's win is 
that hop 1 skips blob decoding when bytes aren't needed.
+
 ### 3. Writer

Review Comment:
   🤖 The RFC notes the DESCRIPTOR + Lance hop-2 read "bypasses Lance's decoder 
entirely" via raw pread at the recorded offset/length. Is there a stability 
contract from Lance that the `(offset, length)` reported on a blob-encoded 
`LargeBinary` column is guaranteed to be the on-disk byte range across Lance 
versions? If Lance ever switches blob storage to a chunked/compressed layout, 
raw pread would silently return garbage. Worth either citing the Lance contract 
here or noting the version pin / detection strategy.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReader.scala:
##########
@@ -208,16 +208,44 @@ class BatchedBlobReader(
             // Dispatch based on storage_type (field 0)
             val storageType = accessor.getString(blobStruct, 0)
             if (storageType == HoodieSchema.Blob.INLINE) {
-              // Case 1: Inline — bytes are in field 1
-              val bytes = accessor.getBytes(blobStruct, 1)
-              batch += RowInfo[R](
-                originalRow = row,
-                filePath = "",
-                offset = -1,
-                length = -1,
-                index = rowIndex,
-                inlineBytes = Some(bytes)
-              )
+              // INLINE rows can arrive in two shapes:
+              //  - CONTENT-mode read: inline_data populated with raw bytes 
(field 1).
+              //  - DESCRIPTOR-mode read: inline_data is null and the 
reference struct
+              //    (field 2) carries a synthetic positional pointer into the 
base file.
+              // read_blob() must return bytes in both cases, so we fall back 
to the
+              // descriptor's range read when inline_data is absent.
+              val inlineIsNull = accessor.isNullAt(blobStruct, 1)
+              if (!inlineIsNull) {
+                val bytes = accessor.getBytes(blobStruct, 1)
+                batch += RowInfo[R](
+                  originalRow = row,
+                  filePath = "",
+                  offset = -1,
+                  length = -1,
+                  index = rowIndex,
+                  inlineBytes = Some(bytes)
+                )
+              } else {
+                require(!accessor.isNullAt(blobStruct, 2),
+                  s"INLINE blob at row $rowIndex has null inline_data and null 
reference; cannot resolve bytes")
+                val referenceStruct = accessor.getStruct(blobStruct, 2, 
HoodieSchema.Blob.getReferenceFieldCount)
+                val filePath = accessor.getString(referenceStruct, 0)
+                require(filePath != null && filePath.nonEmpty,
+                  s"INLINE blob descriptor at row $rowIndex must have 
non-empty external_path")

Review Comment:
   🤖 The INLINE-with-null-inline_data branch silently performs a remote pread 
against the synthetic descriptor's `external_path`. Could the RFC discuss the 
failure mode and security implications of trusting that path? If a 
malformed/forged base file produces a descriptor pointing at an arbitrary file 
or with an `is_managed=true` claim that isn't validated against the table's 
known file set, `read_blob()` could end up reading bytes outside the table. 
Some validation (e.g. `is_managed` must be true and the path must lie under the 
table base) seems worth calling out as a design invariant.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to