This is an automated email from the ASF dual-hosted git repository.
yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 68d8f02ced23 docs(variant): add Spark 3.x backward-compat read guide
(#18839)
68d8f02ced23 is described below
commit 68d8f02ced230fe90be1b8a59cbe7f091882c273
Author: voonhous <[email protected]>
AuthorDate: Fri May 29 15:11:21 2026 +0800
docs(variant): add Spark 3.x backward-compat read guide (#18839)
Co-authored-by: Y Ethan Guo <[email protected]>
---
website/docs/sql_ddl.md | 54 ++++++++++++++++++++++++++++++++++-----------
website/docs/sql_queries.md | 9 +++++++-
2 files changed, 49 insertions(+), 14 deletions(-)
diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md
index 3792f3248b4d..f18f14812686 100644
--- a/website/docs/sql_ddl.md
+++ b/website/docs/sql_ddl.md
@@ -2,7 +2,7 @@
title: SQL DDL
summary: "In this page, we discuss using SQL DDL commands with Hudi"
toc: true
-last_modified_at: 2026-05-27T00:00:00-00:00
+last_modified_at: 2026-05-29T00:00:00-00:00
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
@@ -264,34 +264,62 @@ TBLPROPERTIES (
);
```
-Spark 3.x has no native `VariantType`. To read a VARIANT-bearing Hudi table
written by Spark 4.x,
-declare the column as a binary struct in the `CREATE TABLE` DDL pointing at
the table location:
+:::tip Reading a Spark 4.0+ VARIANT table from Spark 3.x?
+Without the explicit DDL below, the read throws `HoodieSchemaException`
(caused by
+`UnsupportedOperationException: VARIANT type is only supported in Spark
4.0+`). See
+[Reading from Spark
3.x](#reading-variant-from-spark-3x-backward-compatibility).
+:::
+
+##### Reading VARIANT from Spark 3.x (backward compatibility)
{#reading-variant-from-spark-3x-backward-compatibility}
+
+Hudi 1.2.0 supports Spark 3.4 and 3.5, but neither can construct a native
`VariantType`. To
+read a Spark 4.0+ VARIANT table from Spark 3.x, create an external Hudi table
at the existing
+location with the VARIANT column declared as a binary struct:
```sql
-CREATE TABLE events (
+CREATE TABLE IF NOT EXISTS events_3x (
event_id STRING,
payload STRUCT<value: BINARY, metadata: BINARY>,
ts BIGINT
) USING hudi
-LOCATION '<existing-table-path>'
+LOCATION '/path/to/events'
TBLPROPERTIES (
- primaryKey = 'event_id',
- preCombineField = 'ts'
+ primaryKey = 'event_id',
+ preCombineField = 'ts',
+ type = 'cow' -- or 'mor'
);
+
+SELECT event_id, payload.value, payload.metadata FROM events_3x;
```
-Spark 3.x then returns the raw `metadata` and `value` bytes; it does not
surface the column as a
-logical VARIANT, and reading a VARIANT table without this explicit DDL (i.e.
letting Hudi
-auto-resolve the schema from commit metadata) throws `"VARIANT type is only
supported in Spark
-4.0+"`.
+`LOCATION` makes the table external (catalog-metadata only; `DROP TABLE` does
not delete
+data). `primaryKey`, `preCombineField`, and `type` must match the table's
+`.hoodie/hoodie.properties`; a mismatch misleads the catalog but does not
corrupt data.
+Works on COW and MOR (both `AVRO` and `SPARK` log-record types).
+
+:::caution Treat the mapping as read-only
+Spark SQL accepts DML against the mapping, but writes fail downstream in
Hudi's Spark 3.x
+write path, possibly after partial work (marker files, log blocks). Only run
`SELECT`s here;
+if a Spark 4.0+ writer is producing this table, register the mapping in a
separate database
+to avoid collisions.
+:::
+
+What you get: raw `payload.metadata` and `payload.value` bytes (open
+[Spark Variant binary
spec](https://github.com/apache/spark/blob/master/common/variant/README.md)),
+decodable in application code if needed. What you do **not** get on Spark 3.x:
+
+- `parse_json()`, `variant_get()`, `cast(payload as STRING)`: Spark 4.0+ only.
+- Predicate pushdown into VARIANT fields.
+- Schema auto-resolution: the DDL above is required;
`spark.read.format("hudi").load(path)`
+ alone fails.
-Engine support for `VARIANT`:
+##### Engine support for `VARIANT`
| Engine | Behavior |
|:-------|:---------|
| Spark 4.0+ | Native `VariantType` for read/write/query on COW and MOR
(CREATE TABLE with `VARIANT` or DataFrame writes with `VariantType`). |
| Spark 4.1 | Same as Spark 4.0. Spark 4.1's `PushVariantIntoScan` rewrites
VARIANT projections into struct-of-extractions; Hudi recognizes that shape and
returns the column as a logical VARIANT. |
-| Spark 3.x | No native VARIANT. Backward-compat read of a Spark 4.x-written
table requires the explicit binary-struct DDL above; raw bytes only. |
+| Spark 3.x (3.4 / 3.5) | No native VARIANT. [Backward-compat
read](#reading-variant-from-spark-3x-backward-compatibility) of a Spark
4.x-written table requires the explicit binary-struct DDL above; raw bytes
only. |
| Flink < 2.1 | Throws `UnsupportedOperationException` on VARIANT columns. |
| Flink ≥ 2.1 | Surfaces VARIANT as `ROW<metadata BYTES, value BYTES>`. Flink
can read the underlying struct but cannot decode it as a variant value. |
diff --git a/website/docs/sql_queries.md b/website/docs/sql_queries.md
index efa067c77039..5a87ead03370 100644
--- a/website/docs/sql_queries.md
+++ b/website/docs/sql_queries.md
@@ -2,7 +2,7 @@
title: SQL Queries
summary: "In this page, we go over querying Hudi tables using SQL"
toc: true
-last_modified_at: 2026-05-27T00:00:00-00:00
+last_modified_at: 2026-05-29T00:00:00-00:00
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
@@ -514,6 +514,13 @@ SELECT event_id, cast(payload as STRING) AS payload_json
FROM events;
VARIANT columns support `UPDATE`, `DELETE`, and `MERGE` on both COW and MOR
tables.
+:::note Spark 3.x (3.4 / 3.5)
+`parse_json()`, `variant_get()`, and `cast(... as STRING)` require Spark 4.0+.
On Spark 3.x,
+read raw binary buffers via the
+[backward-compat
DDL](sql_ddl.md#reading-variant-from-spark-3x-backward-compatibility);
+predicate pushdown into VARIANT fields is not available.
+:::
+
#### End-to-end example
`hudi_vector_search` and `read_blob()` compose in a single query that returns
both the matching