hudi-agent commented on code in PR #18744:
URL: https://github.com/apache/hudi/pull/18744#discussion_r3247921125


##########
rfc/rfc-100/rfc-100.md:
##########
@@ -139,9 +139,9 @@ Selecting the blob column directly (e.g. `SELECT image_blob 
FROM t` or `SELECT *
 
 **Reader Configuration**
 
-- `hoodie.read.blob.inline.mode` β€” values `CONTENT` (default) | `DESCRIPTOR`.
+- `hoodie.read.blob.inline.mode` β€” values `DESCRIPTOR` (default) | `CONTENT`.
+  - `DESCRIPTOR` (default): the engine returns an `OUT_OF_LINE`-shaped 
descriptor in the `reference` field where the underlying file format supports 
it (Lance today), enabling lazy byte materialization via `read_blob`. For file 
formats without a native descriptor for inline payloads (Parquet), both `data` 
and `reference` are returned `NULL`, and the caller must use `read_blob` to 
retrieve bytes.
   - `CONTENT`: the engine eagerly materializes inline bytes into the struct's 
`data` field.

Review Comment:
   πŸ€– Could the RFC discuss the lifecycle of the synthetic DESCRIPTOR returned 
for INLINE+Lance? The descriptor encodes `(base file path, offset, length)` 
into a managed Hudi base file β€” if cleaning, clustering, or a subsequent 
compaction rewrites/removes that file before the caller invokes `read_blob()`, 
the deferred resolution will fail or read stale bytes. OUT_OF_LINE references 
are user-supplied and assumed externally stable; a synthetic descriptor 
pointing at a Hudi-managed file is not. Is there a contract about how long 
these descriptors stay valid (e.g. only within a single query/stage), and how 
that's enforced?
   
   <sub><i>- AI-generated; verify before applying. React πŸ‘/πŸ‘Ž to flag 
quality.</i></sub>



##########
rfc/rfc-100/rfc-100.md:
##########
@@ -151,44 +151,64 @@ Selecting the blob column directly (e.g. `SELECT 
image_blob FROM t` or `SELECT *
 | `SELECT read_blob(col) FROM table` | INLINE       | Parquet     | (any)      
                    | n/a          | n/a                          | Yes β€” 
returns bytes                               |
 | `SELECT read_blob(col) FROM table` | INLINE       | Lance       | (any)      
                    | n/a          | n/a                          | Yes β€” 
returns bytes                               |
 | `SELECT read_blob(col) FROM table` | OUT_OF_LINE  | (any)       | (any)      
                    | n/a          | n/a                          | Yes β€” 
returns bytes                               |
-| `SELECT col FROM table`       | INLINE       | Parquet     | `CONTENT` 
(default)            | bytes        | NULL                         | Yes β€” via 
`data`                                  |
-| `SELECT col FROM table`       | INLINE       | Parquet     | `DESCRIPTOR`    
               | **NULL**     | **NULL**                     | No β€” must call 
`read_blob`                        |
-| `SELECT col FROM table`       | INLINE       | Lance       | `CONTENT` 
(default)            | bytes        | NULL                         | Yes β€” via 
`data`                                  |
-| `SELECT col FROM table`       | INLINE       | Lance       | `DESCRIPTOR`    
               | NULL         | populated (Lance blob enc.)  | No β€” descriptor 
visible; use `read_blob` for bytes|
+| `SELECT col FROM table`       | INLINE       | Parquet     | `CONTENT`       
               | bytes        | NULL                         | Yes β€” via `data` 
                                 |
+| `SELECT col FROM table`       | INLINE       | Parquet     | `DESCRIPTOR` 
(default)         | bytesΒΉ       | NULLΒΉ                        | Yes β€” via 
`data`ΒΉ                                 |
+| `SELECT col FROM table`       | INLINE       | Lance       | `CONTENT`       
               | bytes        | NULL                         | Yes β€” via `data` 
                                 |
+| `SELECT col FROM table`       | INLINE       | Lance       | `DESCRIPTOR` 
(default)         | NULL         | populated (Lance blob enc.)  | No β€” 
descriptor visible; use `read_blob` for bytes|
 | `SELECT col FROM table`       | OUT_OF_LINE  | (any)       | (irrelevant)    
               | NULL         | populated                    | No β€” must call 
`read_blob`                        |
 
+ΒΉ **Parquet + DESCRIPTOR is a no-op today.** The Parquet reader path does not 
currently honor `hoodie.read.blob.inline.mode`, so on Parquet tables the config 
is ignored and INLINE rows always come back in `CONTENT` shape regardless of 
the setting. The default flip to `DESCRIPTOR` therefore has no observable 
effect on Parquet tables. The matrix row above shows what users will actually 
see today, not the eventual spec.
+
+**Eventual spec for Parquet + DESCRIPTOR (not yet implemented)**
+
+When Parquet gains DESCRIPTOR support, INLINE rows in `DESCRIPTOR` mode are 
intended to return `data = NULL` and `reference = NULL` (no native positional 
handle exists in Parquet for an inline byte array), with `read_blob` 
materializing bytes through a format-specific path. Until then, treat 
`DESCRIPTOR` as effectively `CONTENT` on Parquet.
+
 **Why Parquet and Lance differ in `DESCRIPTOR` mode**
 
-Lance's native blob encoding stores blobs in a way that already exposes a 
`(file, offset, length)` descriptor cheaply, so `DESCRIPTOR` mode surfaces it 
directly in the `reference` field β€” effectively letting INLINE blobs be read 
with the same deferred-materialization path used for OUT_OF_LINE references. 
Parquet has no equivalent native descriptor for an inline byte array, so both 
fields are `NULL` in `DESCRIPTOR` mode and the caller must use `read_blob` to 
materialize bytes.
+Lance's native blob encoding stores blobs in a way that already exposes a 
`(file, offset, length)` descriptor cheaply, so `DESCRIPTOR` mode surfaces it 
directly in the `reference` field β€” effectively letting INLINE blobs be read 
with the same deferred-materialization path used for OUT_OF_LINE references. 
Parquet has no equivalent native descriptor for an inline byte array, which is 
why the DESCRIPTOR path is currently a no-op there.
 
 **Visual**
 
+What the user gets back, grouped by storage type (set at write time) and then 
by query shape:
+
+```mermaid
+flowchart TD
+    ST{storage_type}
+
+    ST -->|OUT_OF_LINE| QO{Query}
+    QO -->|"SELECT col"| OOL["type = OUT_OF_LINE<br/>inline_data = 
NULL<br/>reference = user-supplied"]
+    QO -->|"SELECT read_blob(col)"| RBO(["bytes β€” materialized<br/>via the 
external reference"])
+
+    ST -->|INLINE| QI{Query}
+    QI -->|"SELECT col"| M{hoodie.read.blob.inline.mode}
+    M -->|CONTENT| CONT["type = INLINE<br/>inline_data = bytes<br/>reference = 
NULL"]
+    M -->|DESCRIPTOR default| F{file format}
+    F -->|Lance| LD["type = INLINE<br/>inline_data = NULL<br/>reference = 
synthetic managed<br/>path, offset, length, is_managed=true"]
+    F -->|"Parquet (today: mode no-op)"| PD["Parquet reader does 
not<br/>implement DESCRIPTOR yet β€”<br/>returns CONTENT shape:<br/>inline_data = 
bytes, reference = NULL"]
+
+    QI -->|"SELECT read_blob(col)"| RM{hoodie.read.blob.inline.mode}

Review Comment:
   πŸ€– The Lance branch shows `reference = synthetic managed path, offset, 
length, is_managed=true`. Could you spell out what `is_managed=true` means 
semantically and how it differs from a user-supplied OUT_OF_LINE reference 
downstream β€” e.g., does it gate cleaning, does `read_blob()` route differently, 
can a writer round-trip a row carrying this flag back into the table? This flag 
appears to be the only signal distinguishing "safe to clean β€” Hudi owns the 
bytes" from "do not clean β€” user owns the bytes", which is load-bearing.
   
   <sub><i>- AI-generated; verify before applying. React πŸ‘/πŸ‘Ž to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReader.scala:
##########
@@ -208,16 +208,44 @@ class BatchedBlobReader(
             // Dispatch based on storage_type (field 0)
             val storageType = accessor.getString(blobStruct, 0)
             if (storageType == HoodieSchema.Blob.INLINE) {
-              // Case 1: Inline β€” bytes are in field 1
-              val bytes = accessor.getBytes(blobStruct, 1)
-              batch += RowInfo[R](
-                originalRow = row,
-                filePath = "",
-                offset = -1,
-                length = -1,
-                index = rowIndex,
-                inlineBytes = Some(bytes)
-              )
+              // INLINE rows can arrive in two shapes:
+              //  - CONTENT-mode read: inline_data populated with raw bytes 
(field 1).
+              //  - DESCRIPTOR-mode read: inline_data is null and the 
reference struct
+              //    (field 2) carries a synthetic positional pointer into the 
base file.
+              // read_blob() must return bytes in both cases, so we fall back 
to the
+              // descriptor's range read when inline_data is absent.
+              val inlineIsNull = accessor.isNullAt(blobStruct, 1)
+              if (!inlineIsNull) {
+                val bytes = accessor.getBytes(blobStruct, 1)
+                batch += RowInfo[R](
+                  originalRow = row,
+                  filePath = "",
+                  offset = -1,
+                  length = -1,
+                  index = rowIndex,
+                  inlineBytes = Some(bytes)
+                )
+              } else {
+                require(!accessor.isNullAt(blobStruct, 2),
+                  s"INLINE blob at row $rowIndex has null inline_data and null 
reference; cannot resolve bytes")
+                val referenceStruct = accessor.getStruct(blobStruct, 2, 
HoodieSchema.Blob.getReferenceFieldCount)
+                val filePath = accessor.getString(referenceStruct, 0)
+                require(filePath != null && filePath.nonEmpty,
+                  s"INLINE blob descriptor at row $rowIndex must have 
non-empty external_path")
+                require(!accessor.isNullAt(referenceStruct, 1) && 
!accessor.isNullAt(referenceStruct, 2),
+                  s"INLINE blob descriptor at row $rowIndex must set both 
offset and length")
+                val offset = accessor.getLong(referenceStruct, 1)
+                val length = accessor.getLong(referenceStruct, 2)
+                require(offset >= 0, s"INLINE blob descriptor offset must be 
non-negative for '$filePath': $offset")
+                require(length >= 0, s"INLINE blob descriptor length must be 
non-negative for '$filePath': $length")
+                batch += RowInfo[R](
+                  originalRow = row,
+                  filePath = filePath,
+                  offset = offset,
+                  length = length,
+                  index = rowIndex

Review Comment:
   πŸ€– The `require(...)` calls here will throw `IllegalArgumentException` 
mid-batch on any malformed INLINE+DESCRIPTOR row (null inline_data and 
null/incomplete reference). On a corrupted file this kills the whole task and 
surfaces as a confusing user error far from the cause. Would it be worth using 
a Hudi-typed exception with the file path + row index, or at least a top-level 
wrapper that names the column and source file? Same question for the 
offset/length sign checks β€” those would only fire on writer bugs, but the 
diagnostic is what determines whether anyone can debug it.
   
   <sub><i>- AI-generated; verify before applying. React πŸ‘/πŸ‘Ž to flag 
quality.</i></sub>



##########
rfc/rfc-100/rfc-100.md:
##########
@@ -151,44 +151,64 @@ Selecting the blob column directly (e.g. `SELECT 
image_blob FROM t` or `SELECT *
 | `SELECT read_blob(col) FROM table` | INLINE       | Parquet     | (any)      
                    | n/a          | n/a                          | Yes β€” 
returns bytes                               |
 | `SELECT read_blob(col) FROM table` | INLINE       | Lance       | (any)      
                    | n/a          | n/a                          | Yes β€” 
returns bytes                               |
 | `SELECT read_blob(col) FROM table` | OUT_OF_LINE  | (any)       | (any)      
                    | n/a          | n/a                          | Yes β€” 
returns bytes                               |
-| `SELECT col FROM table`       | INLINE       | Parquet     | `CONTENT` 
(default)            | bytes        | NULL                         | Yes β€” via 
`data`                                  |
-| `SELECT col FROM table`       | INLINE       | Parquet     | `DESCRIPTOR`    
               | **NULL**     | **NULL**                     | No β€” must call 
`read_blob`                        |
-| `SELECT col FROM table`       | INLINE       | Lance       | `CONTENT` 
(default)            | bytes        | NULL                         | Yes β€” via 
`data`                                  |
-| `SELECT col FROM table`       | INLINE       | Lance       | `DESCRIPTOR`    
               | NULL         | populated (Lance blob enc.)  | No β€” descriptor 
visible; use `read_blob` for bytes|
+| `SELECT col FROM table`       | INLINE       | Parquet     | `CONTENT`       
               | bytes        | NULL                         | Yes β€” via `data` 
                                 |
+| `SELECT col FROM table`       | INLINE       | Parquet     | `DESCRIPTOR` 
(default)         | bytesΒΉ       | NULLΒΉ                        | Yes β€” via 
`data`ΒΉ                                 |
+| `SELECT col FROM table`       | INLINE       | Lance       | `CONTENT`       
               | bytes        | NULL                         | Yes β€” via `data` 
                                 |
+| `SELECT col FROM table`       | INLINE       | Lance       | `DESCRIPTOR` 
(default)         | NULL         | populated (Lance blob enc.)  | No β€” 
descriptor visible; use `read_blob` for bytes|

Review Comment:
   πŸ€– The new compaction test reveals that `HoodieSparkLanceReader` must 
hard-pin CONTENT regardless of `hoodie.read.blob.inline.mode` β€” otherwise the 
compactor reads `data=NULL` for untouched rows and silently rewrites the base 
file without bytes. That's a critical correctness invariant that the RFC 
doesn't currently mention. Could you add a note (probably in this section or in 
the Writer/Compaction section) that internal table-services readers MUST ignore 
this config and always materialize CONTENT? Otherwise a future refactor that 
"unifies" reader configs could re-introduce silent data loss. @yihua might want 
to weigh in.
   
   <sub><i>- AI-generated; verify before applying. React πŸ‘/πŸ‘Ž to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to