hudi-agent commented on code in PR #18839:
URL: https://github.com/apache/hudi/pull/18839#discussion_r3298691005
##########
website/docs/variant_type.md:
##########
@@ -254,12 +257,94 @@ binary `value` field.
| Engine | VARIANT Support |
|:-------|:---------------|
| **Spark 4.0+** | Native `VariantType` — full read/write/query |
-| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` —
backward compatible |
+| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` — see
[Reading from Spark 3.5](#reading-from-spark-35-backward-compatibility) |
Review Comment:
🤖 The Cross-Engine Compatibility table still labels this row **Spark 3.x**,
but the new section it links to is titled **Reading from Spark 3.5**. It might
help to clarify whether the workaround applies to all Spark 3.x readers Hudi
supports (3.3, 3.4, 3.5) or only to 3.5 — readers on 3.3/3.4 will likely follow
this link expecting their version to be covered. If the path is identical for
all 3.x, consider renaming the section heading to "Reading from Spark 3.x"; if
3.5 is the only tested target, calling that out explicitly in the section intro
would prevent confusion.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
website/docs/variant_type.md:
##########
@@ -254,12 +257,94 @@ binary `value` field.
| Engine | VARIANT Support |
|:-------|:---------------|
| **Spark 4.0+** | Native `VariantType` — full read/write/query |
-| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` —
backward compatible |
+| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` — see
[Reading from Spark 3.5](#reading-from-spark-35-backward-compatibility) |
| **Flink** | Reads as `ROW<metadata BYTES, value BYTES>` — cross-engine
compatible |
A VARIANT table written by Spark 4.0 can be read by Spark 3.x or Flink, and
vice versa. The
binary encoding is engine-independent.
+## Reading from Spark 3.5 (Backward Compatibility)
+
+Spark 3.5 cannot construct a native `VariantType`, so it cannot consume a
VARIANT-typed Hudi
+table the same way Spark 4.0+ does. This section covers the supported read
path: how to point
+a Spark 3.5 job at a 1.2.0 VARIANT table, what you get back, and what you give
up.
+
+### Why a special path is needed
+
+If you let Hudi resolve the table schema from commit metadata on Spark 3.5,
the read fails
+fast because the Spark 3.x adapter rejects the `VariantType` conversion:
+
+```python
+spark.read.format("hudi").load("/path/to/events").show()
+# org.apache.hudi.exception.HoodieSchemaException: ...
+# Caused by: java.lang.UnsupportedOperationException:
+# VARIANT type is only supported in Spark 4.0+
+```
+
+The table data on disk is fine — Hudi just needs to be told the column's
*physical* shape
+on the reader side so it can skip the unsupported logical conversion.
+
+### Pre-declare the table with a Variant-shaped struct
+
+Create a Hudi table at the existing table path, declaring the VARIANT column
as a struct of
+two binary fields:
+
+```sql
+CREATE TABLE events (
+ event_id STRING,
+ payload STRUCT<value: BINARY, metadata: BINARY>,
+ ts BIGINT
+) USING hudi
+LOCATION '/path/to/events'
+TBLPROPERTIES (
+ primaryKey = 'event_id',
+ preCombineField = 'ts',
+ type = 'cow' -- or 'mor'
+);
+
+SELECT event_id, payload.value, payload.metadata FROM events;
+```
+
+Hudi recognizes `STRUCT<value: BINARY, metadata: BINARY>` as a logical VARIANT
(via the
+`hudi_type=VARIANT` field-metadata tag stamped at write time) and routes the
read through
+the binary path. This works against both COW tables and MOR tables — both
`AVRO` and `SPARK`
+log-record types — because compaction and log-merge operate on the binary
representation.
+
+### What you get, and what you don't
+
+On Spark 3.5 you get the raw VARIANT binary buffers:
+
+- `payload.metadata` — the field-name dictionary (binary).
+- `payload.value` — the encoded payload (binary).
+
+Both conform to the open [Apache Spark Variant binary
Review Comment:
🤖 The caveat about shredded VARIANT tables says users "must declare
[`typed_value` sub-columns] explicitly in your DDL" but the example DDL above
only shows the unshredded `STRUCT<value: BINARY, metadata: BINARY>` shape. A
one-line example of the shredded struct shape (or a forward reference to the
shredding doc) would make this actionable — otherwise a reader on a shredded
table won't know what shape to write. @yihua could you confirm whether the
shredded struct layout is stable enough to document here for the 1.2.0 release?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
website/docs/variant_type.md:
##########
@@ -254,12 +257,94 @@ binary `value` field.
| Engine | VARIANT Support |
|:-------|:---------------|
| **Spark 4.0+** | Native `VariantType` — full read/write/query |
-| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` —
backward compatible |
+| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` — see
[Reading from Spark 3.5](#reading-from-spark-35-backward-compatibility) |
| **Flink** | Reads as `ROW<metadata BYTES, value BYTES>` — cross-engine
compatible |
A VARIANT table written by Spark 4.0 can be read by Spark 3.x or Flink, and
vice versa. The
binary encoding is engine-independent.
+## Reading from Spark 3.5 (Backward Compatibility)
+
+Spark 3.5 cannot construct a native `VariantType`, so it cannot consume a
VARIANT-typed Hudi
+table the same way Spark 4.0+ does. This section covers the supported read
path: how to point
+a Spark 3.5 job at a 1.2.0 VARIANT table, what you get back, and what you give
up.
+
+### Why a special path is needed
+
+If you let Hudi resolve the table schema from commit metadata on Spark 3.5,
the read fails
+fast because the Spark 3.x adapter rejects the `VariantType` conversion:
+
+```python
+spark.read.format("hudi").load("/path/to/events").show()
+# org.apache.hudi.exception.HoodieSchemaException: ...
+# Caused by: java.lang.UnsupportedOperationException:
+# VARIANT type is only supported in Spark 4.0+
+```
+
+The table data on disk is fine — Hudi just needs to be told the column's
*physical* shape
+on the reader side so it can skip the unsupported logical conversion.
Review Comment:
🤖 When pre-declaring a table at an existing Hudi LOCATION, what is the
expected behavior if `primaryKey`, `preCombineField`, or `type` in
`TBLPROPERTIES` differ from what was stamped into the existing
`hoodie.properties` at write time? It would be useful to either (a) state that
these properties must match the existing table and point readers to
`.hoodie/hoodie.properties` to find the right values, or (b) clarify that Hudi
will ignore/override them from the on-disk metadata. As written, a user
copy-pasting this DDL against a table whose preCombineField isn't `ts` (or
whose type is `mor`) may either silently get wrong behavior or hit a confusing
error.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]