rangareddy opened a new pull request, #18939: URL: https://github.com/apache/hudi/pull/18939
### Describe the issue this Pull Request addresses The `hudi-notebooks` demo environment only shipped a Spark 3 image. This adds a parallel Spark 4 image and improves the existing notebooks. Changes are scoped entirely to the `hudi-notebooks` module (a Docker Compose demo); no production code or public APIs are touched. ### Summary and Changelog **Spark 4 stack** - `Dockerfile.spark4` — Spark 4.0.2 / Scala 2.13 / Java 17 / Hudi 1.1.1. Uses the `hudi-spark4.0-bundle_2.13` bundle and AWS SDK v2 (`software.amazon.awssdk:bundle`), since Hadoop 3.4.x migrated S3A off the v1 SDK. - `conf/spark4/`, `requirements-spark4.txt` (adds the `hudi`/hudi-rs Python package), and a `spark4-hudi` `docker-compose` service on non-colliding ports with its own `data/spark4-events` mount. - `build.sh` gains a parallel `SPARK4_*` version block and a build step tagging `apachehudi/spark4-hudi`. **Notebook reorganization** - Notebooks split into `common/` (shared, baked into both images) and `spark3/` + `spark4/`, each with its own `utils.py` (differing only in default Spark/Scala/Hudi versions). - New Spark 4 hudi-rs example notebook — write with Spark, then query with the native `hudi-rs` reader (snapshot, partition-filter, time-travel, incremental). **Runtime / fixes** - Notebooks run against the in-container Spark standalone master with 4g driver/executor memory (in `spark-defaults.conf`); each service pins a deterministic `hostname`. - The Hudi bundle is added to driver + executor `extraClassPath` (resolved locally, downloaded if missing) to avoid a metadata-table `ClassCastException` on the standalone cluster. - `hoodie.write.table.version=6` added to the Presto example (commented in the Trino example) for query-engine compatibility, with explanatory notes. - Silenced the AWS SDK v1 deprecation banner in the Presto image; fixed a Spark 4 ANSI-mode string-concat error in the SCD notebook (`+` -> `concat_ws`). ### Impact No public API or production behavior change. Affects only the `hudi-notebooks` demo/dev environment (new image, notebooks, docs). ### Risk Level low Isolated to the `hudi-notebooks` demo module. Configs and notebooks were validated statically (shell syntax, `docker compose config`, notebook JSON / Python). The Docker images have not yet been built/run end-to-end. ### Documentation Update Updated `hudi-notebooks/README.md` and `hudi-notebooks/CLAUDE.md` for the new Spark 4 image, services, and notebook layout. No Hudi website/config changes. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [ ] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
