(hudi) branch asf-site updated: docs(variant): add Spark 3.x backward-compat read guide (#18839)

yihua Fri, 29 May 2026 00:11:35 -0700

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 68d8f02ced23 docs(variant): add Spark 3.x backward-compat read guide 
(#18839)
68d8f02ced23 is described below

commit 68d8f02ced230fe90be1b8a59cbe7f091882c273
Author: voonhous <[email protected]>
AuthorDate: Fri May 29 15:11:21 2026 +0800

    docs(variant): add Spark 3.x backward-compat read guide (#18839)
    
    Co-authored-by: Y Ethan Guo <[email protected]>
---
 website/docs/sql_ddl.md     | 54 ++++++++++++++++++++++++++++++++++-----------
 website/docs/sql_queries.md |  9 +++++++-
 2 files changed, 49 insertions(+), 14 deletions(-)

diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md
index 3792f3248b4d..f18f14812686 100644
--- a/website/docs/sql_ddl.md
+++ b/website/docs/sql_ddl.md
@@ -2,7 +2,7 @@
 title: SQL DDL
 summary: "In this page, we discuss using SQL DDL commands with Hudi"
 toc: true
-last_modified_at: 2026-05-27T00:00:00-00:00
+last_modified_at: 2026-05-29T00:00:00-00:00
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
@@ -264,34 +264,62 @@ TBLPROPERTIES (
 );
 ```
 
-Spark 3.x has no native `VariantType`. To read a VARIANT-bearing Hudi table 
written by Spark 4.x,
-declare the column as a binary struct in the `CREATE TABLE` DDL pointing at 
the table location:
+:::tip Reading a Spark 4.0+ VARIANT table from Spark 3.x?
+Without the explicit DDL below, the read throws `HoodieSchemaException` 
(caused by
+`UnsupportedOperationException: VARIANT type is only supported in Spark 
4.0+`). See
+[Reading from Spark 
3.x](#reading-variant-from-spark-3x-backward-compatibility).
+:::
+
+##### Reading VARIANT from Spark 3.x (backward compatibility) 
{#reading-variant-from-spark-3x-backward-compatibility}
+
+Hudi 1.2.0 supports Spark 3.4 and 3.5, but neither can construct a native 
`VariantType`. To
+read a Spark 4.0+ VARIANT table from Spark 3.x, create an external Hudi table 
at the existing
+location with the VARIANT column declared as a binary struct:
 
 ```sql
-CREATE TABLE events (
+CREATE TABLE IF NOT EXISTS events_3x (
     event_id  STRING,
     payload   STRUCT<value: BINARY, metadata: BINARY>,
     ts        BIGINT
 ) USING hudi
-LOCATION '<existing-table-path>'
+LOCATION '/path/to/events'
 TBLPROPERTIES (
-    primaryKey = 'event_id',
-    preCombineField = 'ts'
+    primaryKey      = 'event_id',
+    preCombineField = 'ts',
+    type            = 'cow'     -- or 'mor'
 );
+
+SELECT event_id, payload.value, payload.metadata FROM events_3x;
 ```
 
-Spark 3.x then returns the raw `metadata` and `value` bytes; it does not 
surface the column as a
-logical VARIANT, and reading a VARIANT table without this explicit DDL (i.e. 
letting Hudi
-auto-resolve the schema from commit metadata) throws `"VARIANT type is only 
supported in Spark
-4.0+"`.
+`LOCATION` makes the table external (catalog-metadata only; `DROP TABLE` does 
not delete
+data). `primaryKey`, `preCombineField`, and `type` must match the table's
+`.hoodie/hoodie.properties`; a mismatch misleads the catalog but does not 
corrupt data.
+Works on COW and MOR (both `AVRO` and `SPARK` log-record types).
+
+:::caution Treat the mapping as read-only
+Spark SQL accepts DML against the mapping, but writes fail downstream in 
Hudi's Spark 3.x
+write path, possibly after partial work (marker files, log blocks). Only run 
`SELECT`s here;
+if a Spark 4.0+ writer is producing this table, register the mapping in a 
separate database
+to avoid collisions.
+:::
+
+What you get: raw `payload.metadata` and `payload.value` bytes (open
+[Spark Variant binary 
spec](https://github.com/apache/spark/blob/master/common/variant/README.md)),
+decodable in application code if needed. What you do **not** get on Spark 3.x:
+
+- `parse_json()`, `variant_get()`, `cast(payload as STRING)`: Spark 4.0+ only.
+- Predicate pushdown into VARIANT fields.
+- Schema auto-resolution: the DDL above is required; 
`spark.read.format("hudi").load(path)`
+  alone fails.
 
-Engine support for `VARIANT`:
+##### Engine support for `VARIANT`
 
 | Engine | Behavior |
 |:-------|:---------|
 | Spark 4.0+ | Native `VariantType` for read/write/query on COW and MOR 
(CREATE TABLE with `VARIANT` or DataFrame writes with `VariantType`). |
 | Spark 4.1 | Same as Spark 4.0. Spark 4.1's `PushVariantIntoScan` rewrites 
VARIANT projections into struct-of-extractions; Hudi recognizes that shape and 
returns the column as a logical VARIANT. |
-| Spark 3.x | No native VARIANT. Backward-compat read of a Spark 4.x-written 
table requires the explicit binary-struct DDL above; raw bytes only. |
+| Spark 3.x (3.4 / 3.5) | No native VARIANT. [Backward-compat 
read](#reading-variant-from-spark-3x-backward-compatibility) of a Spark 
4.x-written table requires the explicit binary-struct DDL above; raw bytes 
only. |
 | Flink &lt; 2.1 | Throws `UnsupportedOperationException` on VARIANT columns. |
 | Flink ≥ 2.1 | Surfaces VARIANT as `ROW<metadata BYTES, value BYTES>`. Flink 
can read the underlying struct but cannot decode it as a variant value. |
 
diff --git a/website/docs/sql_queries.md b/website/docs/sql_queries.md
index efa067c77039..5a87ead03370 100644
--- a/website/docs/sql_queries.md
+++ b/website/docs/sql_queries.md
@@ -2,7 +2,7 @@
 title: SQL Queries
 summary: "In this page, we go over querying Hudi tables using SQL"
 toc: true
-last_modified_at: 2026-05-27T00:00:00-00:00
+last_modified_at: 2026-05-29T00:00:00-00:00
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
@@ -514,6 +514,13 @@ SELECT event_id, cast(payload as STRING) AS payload_json 
FROM events;
 
 VARIANT columns support `UPDATE`, `DELETE`, and `MERGE` on both COW and MOR 
tables.
 
+:::note Spark 3.x (3.4 / 3.5)
+`parse_json()`, `variant_get()`, and `cast(... as STRING)` require Spark 4.0+. 
On Spark 3.x,
+read raw binary buffers via the
+[backward-compat 
DDL](sql_ddl.md#reading-variant-from-spark-3x-backward-compatibility);
+predicate pushdown into VARIANT fields is not available.
+:::
+
 #### End-to-end example
 
 `hudi_vector_search` and `read_blob()` compose in a single query that returns 
both the matching

(hudi) branch asf-site updated: docs(variant): add Spark 3.x backward-compat read guide (#18839)

Reply via email to