This is an automated email from the ASF dual-hosted git repository.
xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 8747433cbc5f chore(site): update notebooks (#14166)
8747433cbc5f is described below
commit 8747433cbc5f3e6a2bd90a4746b8c0df00b1d6e4
Author: Shiyan Xu <[email protected]>
AuthorDate: Mon Oct 27 14:43:51 2025 -0700
chore(site): update notebooks (#14166)
---
website/docs/notebooks.md | 74 +++++++++++++---------
.../version-1.0.2}/notebooks.md | 74 +++++++++++++---------
.../versioned_sidebars/version-1.0.2-sidebars.json | 4 +-
3 files changed, 88 insertions(+), 64 deletions(-)
diff --git a/website/docs/notebooks.md b/website/docs/notebooks.md
index 07fd2304353f..46a123b7bfed 100644
--- a/website/docs/notebooks.md
+++ b/website/docs/notebooks.md
@@ -1,6 +1,6 @@
---
title: "Notebooks"
-keywords: [ hudi, notebooks]
+keywords: [ hudi, notebooks ]
toc: true
last_modified_at: 2025-10-09T19:13:57+08:00
---
@@ -13,62 +13,74 @@ All you need is a cloned copy of the Hudi repository and
Docker installed on you
### Setup
- * Clone the [Hudi repository](https://github.com/apache/hudi) to your local
machine.
- * Docker Setup : For Mac, Please follow the steps as defined in [Install
Docker Desktop on Mac](https://docs.docker.com/desktop/install/mac-install/).
For running Spark-SQL queries, please ensure atleast 6 GB and 4 CPUs are
allocated to Docker (See Docker -> Preferences -> Advanced).
- * This setup also needs JDK 8 and maven installed on your system.
- * Build Docker Images
- ```sh
- cd hudi-notebooks
- sh build.sh
- ```
- * Start the Environment
- ```sh
- sh run_spark_hudi.sh start
- ```
+* Clone the [Hudi repository](https://github.com/apache/hudi) to your local
machine.
+* Docker Setup: For macOS, follow the steps in [Install Docker Desktop on
Mac](https://docs.docker.com/desktop/install/mac-install/). For Spark SQL
queries, ensure at least 6 GB of memory and 4 CPUs are allocated to Docker (see
Docker > Preferences > Advanced).
+* Build Docker Images
+
+ ```shell
+ # under Hudi repo root dir
+ cd hudi-notebooks
+ sh build.sh
+ ```
+
+* Start the Environment
+
+ ```shell
+ sh run_spark_hudi.sh start
+ ```
### Meet Your Notebooks
+
#### 1 - Getting Started with Apache Hudi: A Hands-On Guide to CRUD Operations
-This notebook is a beginner friendly, practical guide to working with Apache
Hudi using PySpark. It walks you through the essential CRUD operations (Create,
Read, Update, Delete) on Hudi tables, while also helping you understand key
table types such as Copy-On-Write (COW) and Merge-On-Read (MOR).
-For storage, we use MinIO as an S3-compatible backend, simulating a modern
datalake setup.
+This notebook is a beginner-friendly, practical guide to working with Apache
Hudi using PySpark. It walks you through the essential CRUD operations (Create,
Read, Update, Delete) on Hudi tables, while also helping you understand key
table types such as Copy-On-Write (COW) and Merge-On-Read (MOR).
+
+For storage, we use MinIO as an S3-compatible backend.
**What you will learn:**
-- How to create and update Hudi tables using PySpark
-- The difference between COW and MOR tables
-- Reading data using snapshot and incremental queries
-- How Hudi handles upserts and deletes
+
+* How to create and update Hudi tables using PySpark
+* The difference between COW and MOR tables
+* Reading data using snapshot and incremental queries
+* How Hudi handles upserts and deletes
#### 2 - Deep Dive into Apache Hudi Table & Query Types: Snapshot, RO,
Incremental, Time Travel, CDC
+
This notebook is your hands-on guide to mastering Apache Hudi's advanced query
capabilities. You will explore practical examples of various read modes such as
Snapshot, Read-Optimized (RO), Incremental, Time Travel, and Change Data
Capture (CDC) so you can understand when and how to use each for building
efficient, real-world data pipelines.
**What you will learn:**
-- How to perform Snapshot and Read-Optimized queries
-- Using Incremental pulls for near real-time data processing
-- Querying historical data with Time Travel
-- Capturing changes with CDC for downstream consumption
+
+* How to perform Snapshot and Read-Optimized queries
+* Using incremental pulls for near-real-time data processing
+* Querying historical data with Time Travel
+* Capturing changes with CDC for downstream consumption
#### 3 - Implementing Slowly Changing Dimensions (SCD Type 2 & 4) with Apache
Hudi
+
Dive into this practical guide on implementing two key data warehousing
patterns - Slowly Changing Dimensions (SCD) Type 2 and Type 4 using Apache Hudi.
-SCDs help track changes in dimension data over time without losing historical
context. Instead of overwriting records, these patterns let you maintain a full
history of data changes. Leveraging Hudi's upsert capabilities and rich
metadata, this notebook simplifies what's traditionally a complex process.
+SCDs help track changes in dimension data over time without losing historical
context. Instead of overwriting records, these patterns let you maintain a full
history of data changes. Leveraging Hudi's upsert capabilities and rich
metadata, this notebook simplifies what is traditionally a complex process.
**What you will learn:**
-- SCD Type 2: How to track changes by adding new rows to your dimension tables
-- SCD Type 4: How to manage historical data in a separate history table
+
+* SCD Type 2: How to track changes by adding new rows to your dimension tables
+* SCD Type 4: How to manage historical data in a separate history table
#### 4 - Schema Evolution with Apache Hudi: Concepts and Practical Use
-In real-world data lake environments, schema changes are not just common but
they are expected. Whether you are adding new data attributes, adjusting
existing types, or refactoring nested structures, it's essential that your
pipelines adapt without introducing instability.
+
+In real-world data lakehouse environments, schema changes are not just
common—they are expected. Whether you are adding new data attributes, adjusting
existing types, or refactoring nested structures, it is essential that your
pipelines adapt without introducing instability.
Apache Hudi supports powerful schema evolution capabilities that help you
maintain schema flexibility while ensuring data consistency. In this notebook,
we will explore how Hudi enables safe and efficient schema changes, both at
write time and read time.
**What you will learn:**
-- Schema Evolution on Write:
+
+* Schema Evolution on Write:
Apache Hudi allows safe, backward-compatible schema changes during write
operations. This ensures that you can evolve your schema without rewriting
existing data or breaking your ingestion pipelines.
+* Schema Evolution on Read:
+Hudi also supports schema evolution during reads, enabling more flexible
transformations that do not require rewriting the dataset.
-- Schema Evolution on Read:
-Hudi also supports schema evolution during reads, enabling more flexible
transformations that don't require rewriting the dataset.
+#### 5 - A Hands-On Guide to Hudi SQL Procedures
-#### 5 - A Hands-on Guide to Hudi SQL Procedures
Apache Hudi provides a suite of powerful built-in procedures that can be
executed directly from Spark SQL using the familiar CALL syntax.
These procedures enable you to perform advanced table maintenance, auditing,
and data management tasks without writing any custom code or scripts. Whether
you are compacting data, cleaning old versions, or retrieving metadata, Hudi
SQL procedures make it easy and SQL-friendly.
diff --git a/website/docs/notebooks.md
b/website/versioned_docs/version-1.0.2/notebooks.md
similarity index 64%
copy from website/docs/notebooks.md
copy to website/versioned_docs/version-1.0.2/notebooks.md
index 07fd2304353f..46a123b7bfed 100644
--- a/website/docs/notebooks.md
+++ b/website/versioned_docs/version-1.0.2/notebooks.md
@@ -1,6 +1,6 @@
---
title: "Notebooks"
-keywords: [ hudi, notebooks]
+keywords: [ hudi, notebooks ]
toc: true
last_modified_at: 2025-10-09T19:13:57+08:00
---
@@ -13,62 +13,74 @@ All you need is a cloned copy of the Hudi repository and
Docker installed on you
### Setup
- * Clone the [Hudi repository](https://github.com/apache/hudi) to your local
machine.
- * Docker Setup : For Mac, Please follow the steps as defined in [Install
Docker Desktop on Mac](https://docs.docker.com/desktop/install/mac-install/).
For running Spark-SQL queries, please ensure atleast 6 GB and 4 CPUs are
allocated to Docker (See Docker -> Preferences -> Advanced).
- * This setup also needs JDK 8 and maven installed on your system.
- * Build Docker Images
- ```sh
- cd hudi-notebooks
- sh build.sh
- ```
- * Start the Environment
- ```sh
- sh run_spark_hudi.sh start
- ```
+* Clone the [Hudi repository](https://github.com/apache/hudi) to your local
machine.
+* Docker Setup: For macOS, follow the steps in [Install Docker Desktop on
Mac](https://docs.docker.com/desktop/install/mac-install/). For Spark SQL
queries, ensure at least 6 GB of memory and 4 CPUs are allocated to Docker (see
Docker > Preferences > Advanced).
+* Build Docker Images
+
+ ```shell
+ # under Hudi repo root dir
+ cd hudi-notebooks
+ sh build.sh
+ ```
+
+* Start the Environment
+
+ ```shell
+ sh run_spark_hudi.sh start
+ ```
### Meet Your Notebooks
+
#### 1 - Getting Started with Apache Hudi: A Hands-On Guide to CRUD Operations
-This notebook is a beginner friendly, practical guide to working with Apache
Hudi using PySpark. It walks you through the essential CRUD operations (Create,
Read, Update, Delete) on Hudi tables, while also helping you understand key
table types such as Copy-On-Write (COW) and Merge-On-Read (MOR).
-For storage, we use MinIO as an S3-compatible backend, simulating a modern
datalake setup.
+This notebook is a beginner-friendly, practical guide to working with Apache
Hudi using PySpark. It walks you through the essential CRUD operations (Create,
Read, Update, Delete) on Hudi tables, while also helping you understand key
table types such as Copy-On-Write (COW) and Merge-On-Read (MOR).
+
+For storage, we use MinIO as an S3-compatible backend.
**What you will learn:**
-- How to create and update Hudi tables using PySpark
-- The difference between COW and MOR tables
-- Reading data using snapshot and incremental queries
-- How Hudi handles upserts and deletes
+
+* How to create and update Hudi tables using PySpark
+* The difference between COW and MOR tables
+* Reading data using snapshot and incremental queries
+* How Hudi handles upserts and deletes
#### 2 - Deep Dive into Apache Hudi Table & Query Types: Snapshot, RO,
Incremental, Time Travel, CDC
+
This notebook is your hands-on guide to mastering Apache Hudi's advanced query
capabilities. You will explore practical examples of various read modes such as
Snapshot, Read-Optimized (RO), Incremental, Time Travel, and Change Data
Capture (CDC) so you can understand when and how to use each for building
efficient, real-world data pipelines.
**What you will learn:**
-- How to perform Snapshot and Read-Optimized queries
-- Using Incremental pulls for near real-time data processing
-- Querying historical data with Time Travel
-- Capturing changes with CDC for downstream consumption
+
+* How to perform Snapshot and Read-Optimized queries
+* Using incremental pulls for near-real-time data processing
+* Querying historical data with Time Travel
+* Capturing changes with CDC for downstream consumption
#### 3 - Implementing Slowly Changing Dimensions (SCD Type 2 & 4) with Apache
Hudi
+
Dive into this practical guide on implementing two key data warehousing
patterns - Slowly Changing Dimensions (SCD) Type 2 and Type 4 using Apache Hudi.
-SCDs help track changes in dimension data over time without losing historical
context. Instead of overwriting records, these patterns let you maintain a full
history of data changes. Leveraging Hudi's upsert capabilities and rich
metadata, this notebook simplifies what's traditionally a complex process.
+SCDs help track changes in dimension data over time without losing historical
context. Instead of overwriting records, these patterns let you maintain a full
history of data changes. Leveraging Hudi's upsert capabilities and rich
metadata, this notebook simplifies what is traditionally a complex process.
**What you will learn:**
-- SCD Type 2: How to track changes by adding new rows to your dimension tables
-- SCD Type 4: How to manage historical data in a separate history table
+
+* SCD Type 2: How to track changes by adding new rows to your dimension tables
+* SCD Type 4: How to manage historical data in a separate history table
#### 4 - Schema Evolution with Apache Hudi: Concepts and Practical Use
-In real-world data lake environments, schema changes are not just common but
they are expected. Whether you are adding new data attributes, adjusting
existing types, or refactoring nested structures, it's essential that your
pipelines adapt without introducing instability.
+
+In real-world data lakehouse environments, schema changes are not just
common—they are expected. Whether you are adding new data attributes, adjusting
existing types, or refactoring nested structures, it is essential that your
pipelines adapt without introducing instability.
Apache Hudi supports powerful schema evolution capabilities that help you
maintain schema flexibility while ensuring data consistency. In this notebook,
we will explore how Hudi enables safe and efficient schema changes, both at
write time and read time.
**What you will learn:**
-- Schema Evolution on Write:
+
+* Schema Evolution on Write:
Apache Hudi allows safe, backward-compatible schema changes during write
operations. This ensures that you can evolve your schema without rewriting
existing data or breaking your ingestion pipelines.
+* Schema Evolution on Read:
+Hudi also supports schema evolution during reads, enabling more flexible
transformations that do not require rewriting the dataset.
-- Schema Evolution on Read:
-Hudi also supports schema evolution during reads, enabling more flexible
transformations that don't require rewriting the dataset.
+#### 5 - A Hands-On Guide to Hudi SQL Procedures
-#### 5 - A Hands-on Guide to Hudi SQL Procedures
Apache Hudi provides a suite of powerful built-in procedures that can be
executed directly from Spark SQL using the familiar CALL syntax.
These procedures enable you to perform advanced table maintenance, auditing,
and data management tasks without writing any custom code or scripts. Whether
you are compacting data, cleaning old versions, or retrieving metadata, Hudi
SQL procedures make it easy and SQL-friendly.
diff --git a/website/versioned_sidebars/version-1.0.2-sidebars.json
b/website/versioned_sidebars/version-1.0.2-sidebars.json
index 51c1310df69e..b8634292d6ab 100644
--- a/website/versioned_sidebars/version-1.0.2-sidebars.json
+++ b/website/versioned_sidebars/version-1.0.2-sidebars.json
@@ -10,6 +10,7 @@
"flink-quick-start-guide",
"python-rust-quick-start-guide",
"docker_demo",
+ "notebooks",
"use_cases"
]
},
@@ -131,8 +132,7 @@
]
}
]
- },
- "privacy"
+ }
],
"quick_links": [
{