This is an automated email from the ASF dual-hosted git repository.
yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/master by this push:
new 431b4d8e0e33 docs(spark): Update description of modules related to
integration with Spark (#18219)
431b4d8e0e33 is described below
commit 431b4d8e0e338fd3123577665341a8bce03c9bff
Author: Geser Dugarov <[email protected]>
AuthorDate: Thu Feb 19 06:21:50 2026 +0700
docs(spark): Update description of modules related to integration with
Spark (#18219)
---
hudi-spark-datasource/README.md | 83 ++++++++++++++++++++---------------------
1 file changed, 41 insertions(+), 42 deletions(-)
diff --git a/hudi-spark-datasource/README.md b/hudi-spark-datasource/README.md
index 8912bb2b228e..3762ec58d607 100644
--- a/hudi-spark-datasource/README.md
+++ b/hudi-spark-datasource/README.md
@@ -15,45 +15,44 @@
* See the License for the specific language governing permissions and
-->
-# Description of the relationship between each module
-
-This repo contains the code that integrate Hudi with Spark. The repo is split
into the following modules
-
-`hudi-spark`
-`hudi-spark3.3.x`
-`hudi-spark3.4.x`
-`hudi-spark3.5.x`
-`hudi-spark4.0.x`
-`hudi-spark3-common`
-`hudi-spark-common`
-
-* hudi-spark is the module that contains the code that spark3 version would
share.
-* hudi-spark3.3.x is the module that contains the code that compatible with
spark3.3.x versions.
-* hudi-spark3.4.x is the module that contains the code that compatible with
spark 3.4.x versions.
-* hudi-spark3.5.x is the module that contains the code that compatible with
spark 3.5.x versions.
-* hudi-spark4.0.x is the module that contains the code that compatible with
spark 4.0.x versions.
-* hudi-spark3-common is the module that contains the code that would be reused
between spark3.x versions.
-* hudi-spark-common is the module that contains the code that would be reused
between spark3.x and spark4.x versions.
-
-## Description of Time Travel
-* `HoodieSpark3_2ExtendedSqlAstBuilder` have comments in the spark3.2's code
fork from `org.apache.spark.sql.catalyst.parser.AstBuilder`, and additional
`withTimeTravel` method.
-* `SqlBase.g4` have comments in the code forked from spark3.2's parser, and
add SparkSQL Syntax `TIMESTAMP AS OF` and `VERSION AS OF`.
-
-### Time Travel Support Spark Version:
-
-| version | support |
-| ------ | ------- |
-| 2.4.x | No |
-| 3.0.x | No |
-| 3.1.2 | No |
-| 3.2.0 | Yes |
-
-### To improve:
-Spark3.3 support time travel syntax link
[SPARK-37219](https://issues.apache.org/jira/browse/SPARK-37219).
-Once Spark 3.3 released. The files in the following list will be removed:
-* hudi-spark3.3.x's `HoodieSpark3_3ExtendedSqlAstBuilder.scala`,
`HoodieSpark3_3ExtendedSqlParser.scala`, `TimeTravelRelation.scala`,
`SqlBase.g4`, `HoodieSqlBase.g4`
-Tracking Jira: [HUDI-4468](https://issues.apache.org/jira/browse/HUDI-4468)
-
-Some other improvements undergoing:
-* Port borrowed classes from Spark 3.3
[HUDI-4467](https://issues.apache.org/jira/browse/HUDI-4467)
-
+# `hudi-spark-datasource` module
+
+This module contains the Spark integration for Hudi, providing a DataSource
API for reading and writing Hudi tables using Spark SQL and DataFrames.
+
+## Overview
+
+The `hudi-spark-datasource` aggregates multiple sub-modules that together
provide comprehensive Spark support for Hudi.
+The modules are organized in a layered architecture to maximize code reuse
across different Spark versions while maintaining version-specific
optimizations.
+
+## Module Descriptions
+
+| Module | Description |
+|--------|-------------|
+| `hudi-spark-common` | Core Spark integration code shared across all Spark
versions. Contains DataSource V1/V2 implementations, file indexing, SQL
writers, and incremental read support. |
+| `hudi-spark3-common` | Code shared across Spark 3.x versions. Contains Spark
3 adapter interface, DML commands, and partition mapping. |
+| `hudi-spark4-common` | Code shared across Spark 4.x versions. Contains Spark
4 adapter interface and 4.x-specific implementations. |
+| `hudi-spark3.3.x` | Spark 3.3.x-specific adapter implementation with
version-specific SQL parser and file readers. |
+| `hudi-spark3.4.x` | Spark 3.4.x-specific adapter implementation. |
+| `hudi-spark3.5.x` | Spark 3.5.x-specific adapter implementation (default). |
+| `hudi-spark4.0.x` | Spark 4.0.x-specific adapter implementation. |
+| `hudi-spark` | Main Spark datasource module containing Spark Session
extensions, stored procedures, SQL parser, and logical plans. |
+
+## Spark Version Support
+
+| Spark Version | Module | Scala Version | Java Version | Build Profile |
+|---------------|--------|---------------|--------------|---------------|
+| 3.3.x | `hudi-spark3.3.x` | 2.12 | 11+ | `-Dspark3.3` |
+| 3.4.x | `hudi-spark3.4.x` | 2.12 | 11+ | `-Dspark3.4` |
+| 3.5.x (default) | `hudi-spark3.5.x` | 2.12, 2.13 | 11+ | `-Dspark3.5` |
+| 4.0.x | `hudi-spark4.0.x` | 2.13 | 17+ | `-Dspark4.0` |
+
+## Key Features
+
+- **DataSource V1 Support**: Full integration with Spark's DataSource API
+- **Spark SQL Integration**: Native SQL support for Hudi tables via Spark
Session extensions
+- **Stored Procedures**: Built-in procedures for table management and
operations
+- **Time Travel**: Query historical versions of tables
+- **Incremental Queries**: Efficient change data capture reads
+- **Index Support**: Bloom filters, column statistics, record-level index, and
partition stats
+- **Streaming Support**: Structured Streaming source for continuous data
ingestion
+- **CDC Support**: Change Data Capture for tracking row-level changes