Re: [PR] docs(variant): add Spark 3.5 backward-compat read guide [hudi]

via GitHub Mon, 25 May 2026 07:32:13 -0700


hudi-agent commented on code in PR #18839:
URL: https://github.com/apache/hudi/pull/18839#discussion_r3298691005



##########
website/docs/variant_type.md:
##########
@@ -254,12 +257,94 @@ binary `value` field.
 | Engine | VARIANT Support |
 |:-------|:---------------|
 | **Spark 4.0+** | Native `VariantType` — full read/write/query |
-| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` — 
backward compatible |
+| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` — see 
[Reading from Spark 3.5](#reading-from-spark-35-backward-compatibility) |

Review Comment:
   🤖 The Cross-Engine Compatibility table still labels this row **Spark 3.x**, 
but the new section it links to is titled **Reading from Spark 3.5**. It might 
help to clarify whether the workaround applies to all Spark 3.x readers Hudi 
supports (3.3, 3.4, 3.5) or only to 3.5 — readers on 3.3/3.4 will likely follow 
this link expecting their version to be covered. If the path is identical for 
all 3.x, consider renaming the section heading to "Reading from Spark 3.x"; if 
3.5 is the only tested target, calling that out explicitly in the section intro 
would prevent confusion.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
website/docs/variant_type.md:
##########
@@ -254,12 +257,94 @@ binary `value` field.
 | Engine | VARIANT Support |
 |:-------|:---------------|
 | **Spark 4.0+** | Native `VariantType` — full read/write/query |
-| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` — 
backward compatible |
+| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` — see 
[Reading from Spark 3.5](#reading-from-spark-35-backward-compatibility) |
 | **Flink** | Reads as `ROW<metadata BYTES, value BYTES>` — cross-engine 
compatible |
 
 A VARIANT table written by Spark 4.0 can be read by Spark 3.x or Flink, and 
vice versa. The
 binary encoding is engine-independent.
 
+## Reading from Spark 3.5 (Backward Compatibility)
+
+Spark 3.5 cannot construct a native `VariantType`, so it cannot consume a 
VARIANT-typed Hudi
+table the same way Spark 4.0+ does. This section covers the supported read 
path: how to point
+a Spark 3.5 job at a 1.2.0 VARIANT table, what you get back, and what you give 
up.
+
+### Why a special path is needed
+
+If you let Hudi resolve the table schema from commit metadata on Spark 3.5, 
the read fails
+fast because the Spark 3.x adapter rejects the `VariantType` conversion:
+
+```python
+spark.read.format("hudi").load("/path/to/events").show()
+# org.apache.hudi.exception.HoodieSchemaException: ...
+# Caused by: java.lang.UnsupportedOperationException:
+#   VARIANT type is only supported in Spark 4.0+
+```
+
+The table data on disk is fine — Hudi just needs to be told the column's 
*physical* shape
+on the reader side so it can skip the unsupported logical conversion.
+
+### Pre-declare the table with a Variant-shaped struct
+
+Create a Hudi table at the existing table path, declaring the VARIANT column 
as a struct of
+two binary fields:
+
+```sql
+CREATE TABLE events (
+    event_id  STRING,
+    payload   STRUCT<value: BINARY, metadata: BINARY>,
+    ts        BIGINT
+) USING hudi
+LOCATION '/path/to/events'
+TBLPROPERTIES (
+    primaryKey      = 'event_id',
+    preCombineField = 'ts',
+    type            = 'cow'     -- or 'mor'
+);
+
+SELECT event_id, payload.value, payload.metadata FROM events;
+```
+
+Hudi recognizes `STRUCT<value: BINARY, metadata: BINARY>` as a logical VARIANT 
(via the
+`hudi_type=VARIANT` field-metadata tag stamped at write time) and routes the 
read through
+the binary path. This works against both COW tables and MOR tables — both 
`AVRO` and `SPARK`
+log-record types — because compaction and log-merge operate on the binary 
representation.
+
+### What you get, and what you don't
+
+On Spark 3.5 you get the raw VARIANT binary buffers:
+
+- `payload.metadata` — the field-name dictionary (binary).
+- `payload.value` — the encoded payload (binary).
+
+Both conform to the open [Apache Spark Variant binary

Review Comment:
   🤖 The caveat about shredded VARIANT tables says users "must declare 
[`typed_value` sub-columns] explicitly in your DDL" but the example DDL above 
only shows the unshredded `STRUCT<value: BINARY, metadata: BINARY>` shape. A 
one-line example of the shredded struct shape (or a forward reference to the 
shredding doc) would make this actionable — otherwise a reader on a shredded 
table won't know what shape to write. @yihua could you confirm whether the 
shredded struct layout is stable enough to document here for the 1.2.0 release?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
website/docs/variant_type.md:
##########
@@ -254,12 +257,94 @@ binary `value` field.
 | Engine | VARIANT Support |
 |:-------|:---------------|
 | **Spark 4.0+** | Native `VariantType` — full read/write/query |
-| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` — 
backward compatible |
+| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` — see 
[Reading from Spark 3.5](#reading-from-spark-35-backward-compatibility) |
 | **Flink** | Reads as `ROW<metadata BYTES, value BYTES>` — cross-engine 
compatible |
 
 A VARIANT table written by Spark 4.0 can be read by Spark 3.x or Flink, and 
vice versa. The
 binary encoding is engine-independent.
 
+## Reading from Spark 3.5 (Backward Compatibility)
+
+Spark 3.5 cannot construct a native `VariantType`, so it cannot consume a 
VARIANT-typed Hudi
+table the same way Spark 4.0+ does. This section covers the supported read 
path: how to point
+a Spark 3.5 job at a 1.2.0 VARIANT table, what you get back, and what you give 
up.
+
+### Why a special path is needed
+
+If you let Hudi resolve the table schema from commit metadata on Spark 3.5, 
the read fails
+fast because the Spark 3.x adapter rejects the `VariantType` conversion:
+
+```python
+spark.read.format("hudi").load("/path/to/events").show()
+# org.apache.hudi.exception.HoodieSchemaException: ...
+# Caused by: java.lang.UnsupportedOperationException:
+#   VARIANT type is only supported in Spark 4.0+
+```
+
+The table data on disk is fine — Hudi just needs to be told the column's 
*physical* shape
+on the reader side so it can skip the unsupported logical conversion.

Review Comment:
   🤖 When pre-declaring a table at an existing Hudi LOCATION, what is the 
expected behavior if `primaryKey`, `preCombineField`, or `type` in 
`TBLPROPERTIES` differ from what was stamped into the existing 
`hoodie.properties` at write time? It would be useful to either (a) state that 
these properties must match the existing table and point readers to 
`.hoodie/hoodie.properties` to find the right values, or (b) clarify that Hudi 
will ignore/override them from the on-disk metadata. As written, a user 
copy-pasting this DDL against a table whose preCombineField isn't `ts` (or 
whose type is `mor`) may either silently get wrong behavior or hit a confusing 
error.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] docs(variant): add Spark 3.5 backward-compat read guide [hudi]

Reply via email to