voonhous commented on code in PR #18839:
URL: https://github.com/apache/hudi/pull/18839#discussion_r3302534333
##########
website/docs/variant_type.md:
##########
@@ -254,12 +257,94 @@ binary `value` field.
| Engine | VARIANT Support |
|:-------|:---------------|
| **Spark 4.0+** | Native `VariantType` — full read/write/query |
-| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` —
backward compatible |
+| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` — see
[Reading from Spark 3.5](#reading-from-spark-35-backward-compatibility) |
| **Flink** | Reads as `ROW<metadata BYTES, value BYTES>` — cross-engine
compatible |
A VARIANT table written by Spark 4.0 can be read by Spark 3.x or Flink, and
vice versa. The
binary encoding is engine-independent.
+## Reading from Spark 3.5 (Backward Compatibility)
+
+Spark 3.5 cannot construct a native `VariantType`, so it cannot consume a
VARIANT-typed Hudi
+table the same way Spark 4.0+ does. This section covers the supported read
path: how to point
+a Spark 3.5 job at a 1.2.0 VARIANT table, what you get back, and what you give
up.
+
+### Why a special path is needed
+
+If you let Hudi resolve the table schema from commit metadata on Spark 3.5,
the read fails
+fast because the Spark 3.x adapter rejects the `VariantType` conversion:
+
+```python
+spark.read.format("hudi").load("/path/to/events").show()
+# org.apache.hudi.exception.HoodieSchemaException: ...
+# Caused by: java.lang.UnsupportedOperationException:
+# VARIANT type is only supported in Spark 4.0+
+```
+
+The table data on disk is fine — Hudi just needs to be told the column's
*physical* shape
+on the reader side so it can skip the unsupported logical conversion.
+
+### Pre-declare the table with a Variant-shaped struct
+
+Create a Hudi table at the existing table path, declaring the VARIANT column
as a struct of
+two binary fields:
+
+```sql
+CREATE TABLE events (
+ event_id STRING,
+ payload STRUCT<value: BINARY, metadata: BINARY>,
+ ts BIGINT
+) USING hudi
+LOCATION '/path/to/events'
+TBLPROPERTIES (
+ primaryKey = 'event_id',
+ preCombineField = 'ts',
+ type = 'cow' -- or 'mor'
+);
+
+SELECT event_id, payload.value, payload.metadata FROM events;
+```
+
+Hudi recognizes `STRUCT<value: BINARY, metadata: BINARY>` as a logical VARIANT
(via the
+`hudi_type=VARIANT` field-metadata tag stamped at write time) and routes the
read through
+the binary path. This works against both COW tables and MOR tables — both
`AVRO` and `SPARK`
+log-record types — because compaction and log-merge operate on the binary
representation.
+
+### What you get, and what you don't
+
+On Spark 3.5 you get the raw VARIANT binary buffers:
+
+- `payload.metadata` — the field-name dictionary (binary).
+- `payload.value` — the encoded payload (binary).
+
+Both conform to the open [Apache Spark Variant binary
Review Comment:
Our current code does not support shredding, so we should also state that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]