Re: [PR] docs(variant): add Spark 3.5 backward-compat read guide [hudi]

via GitHub Tue, 26 May 2026 01:57:22 -0700


voonhous commented on code in PR #18839:
URL: https://github.com/apache/hudi/pull/18839#discussion_r3302534333



##########
website/docs/variant_type.md:
##########
@@ -254,12 +257,94 @@ binary `value` field.
 | Engine | VARIANT Support |
 |:-------|:---------------|
 | **Spark 4.0+** | Native `VariantType` — full read/write/query |
-| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` — 
backward compatible |
+| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` — see 
[Reading from Spark 3.5](#reading-from-spark-35-backward-compatibility) |
 | **Flink** | Reads as `ROW<metadata BYTES, value BYTES>` — cross-engine 
compatible |
 
 A VARIANT table written by Spark 4.0 can be read by Spark 3.x or Flink, and 
vice versa. The
 binary encoding is engine-independent.
 
+## Reading from Spark 3.5 (Backward Compatibility)
+
+Spark 3.5 cannot construct a native `VariantType`, so it cannot consume a 
VARIANT-typed Hudi
+table the same way Spark 4.0+ does. This section covers the supported read 
path: how to point
+a Spark 3.5 job at a 1.2.0 VARIANT table, what you get back, and what you give 
up.
+
+### Why a special path is needed
+
+If you let Hudi resolve the table schema from commit metadata on Spark 3.5, 
the read fails
+fast because the Spark 3.x adapter rejects the `VariantType` conversion:
+
+```python
+spark.read.format("hudi").load("/path/to/events").show()
+# org.apache.hudi.exception.HoodieSchemaException: ...
+# Caused by: java.lang.UnsupportedOperationException:
+#   VARIANT type is only supported in Spark 4.0+
+```
+
+The table data on disk is fine — Hudi just needs to be told the column's 
*physical* shape
+on the reader side so it can skip the unsupported logical conversion.
+
+### Pre-declare the table with a Variant-shaped struct
+
+Create a Hudi table at the existing table path, declaring the VARIANT column 
as a struct of
+two binary fields:
+
+```sql
+CREATE TABLE events (
+    event_id  STRING,
+    payload   STRUCT<value: BINARY, metadata: BINARY>,
+    ts        BIGINT
+) USING hudi
+LOCATION '/path/to/events'
+TBLPROPERTIES (
+    primaryKey      = 'event_id',
+    preCombineField = 'ts',
+    type            = 'cow'     -- or 'mor'
+);
+
+SELECT event_id, payload.value, payload.metadata FROM events;
+```
+
+Hudi recognizes `STRUCT<value: BINARY, metadata: BINARY>` as a logical VARIANT 
(via the
+`hudi_type=VARIANT` field-metadata tag stamped at write time) and routes the 
read through
+the binary path. This works against both COW tables and MOR tables — both 
`AVRO` and `SPARK`
+log-record types — because compaction and log-merge operate on the binary 
representation.
+
+### What you get, and what you don't
+
+On Spark 3.5 you get the raw VARIANT binary buffers:
+
+- `payload.metadata` — the field-name dictionary (binary).
+- `payload.value` — the encoded payload (binary).
+
+Both conform to the open [Apache Spark Variant binary

Review Comment:
   Our current code does not support shredding, so we should also state that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] docs(variant): add Spark 3.5 backward-compat read guide [hudi]

Reply via email to