Re: [D] Modernize Bundle Validation CI by Migrating to Testcontainers [hudi]

via GitHub Thu, 18 Dec 2025 09:25:50 -0800


GitHub user voonhous edited a discussion: Modernize Bundle Validation CI by 
Migrating to Testcontainers

# Overview

This discussion proposes a significant modernization of Apache Hudi's bundle
validation infrastructure. Currently, our bundle validation process relies on a
complex combination of Docker-based shell scripts (`ci_run.sh` and
`validate.sh`) to verify the integrity and functionality of our release
artifacts (Spark, Flink, Utilities, etc.).

This infrastructure is critical as it powers three major workflows:

1. `bot.yml`: The active CI workflow running on PRs and commits. **[MAIN]**
2. `release_candidate_validation.yml`: Validation for release candidates
(currently disabled).
3. `maven_artifact_validation.yml`: Post-release validation for Maven Central
artifacts (currently disabled).

The current execution chain involves a GitHub Actions workflow triggering
`ci_run.sh`, which sets up a Docker environment, mounts volumes, and then
executes `validate.sh` inside the container. This script then sequentially runs
a series of tests across various bundles.

# Current CI Structure

## Workflow Execution

- **Triggers**: The system runs on pushes/PRs to master, release-*, and
branch-0.x.
- **Matrix Strategy / Coverage**: We maintain a comprehensive test matrix
covering:
- **Java**: 8, 11, 17
- **Scala**: 2.12, 2.13
- **Spark**: 3.3.x, 3.4.x, 3.5.x, 4.0.0
- **Flink**: 1.17 - 1.20, 2.0
- **Status**: While `bot.yml` is active for standard CI, the release candidate
and Maven artifact validations are currently disabled manually in the YAML
files.

## Validation Process

Inside the Docker container, `validate.sh` performs the following validation
steps sequentially:

1. **Spark & Hadoop MR**: Starts Derby/Hive, runs Hive sync, and validates with
Spark SQL and HiveQL.
2. **Utilities Bundle**: Runs HoodieDeltaStreamer and validates output size and
content via Spark shell.
3. **Utilities Slim Bundle**: Validates the slim bundle in conjunction with the
Spark bundle.
4. **Flink Bundle**: Starts a Flink cluster, runs SQL inserts, and validates
via a compaction script.
5. **Kafka Connect Bundle**: Spins up Zookeeper, Kafka, and Schema Registry to
test the Hudi Sink connector.
6. **Metaserver Bundle**: Starts the Metaserver and validates read/write via
Spark DataSource.
7. **CLI Bundle**: Executes a series of Hudi CLI commands to verify table
management.

## Env Stack

- **Orchestration**: Bash scripts (`ci_run.sh`, `validate.sh`) managing the
lifecycle.
- **Environment**: Custom Docker image (hudi-ci-bundle-validation-base) built
in the CI or pulled from Docker Hub with different tags based on different
versions.
- **Dependencies**: Manual management of Derby, Hive, Spark, Flink, Kafka, and
Zookeeper lifecycles within the script.

# Current Challenges

After analyzing the packaging/bundle-validation/ directory, several pain points
are evident:

- **Sequential Bottlenecks**: Tests run strictly one after another. If the
Spark validation takes time, Flink and Kafka tests must wait, extending the
feedback loop.
- **Fragile Service Management**: Services like Hive, Derby, and Zookeeper are
started as background processes (&) with PIDs captured for later cleanup (kill
$PID). This is prone to zombie processes or port conflicts if a script crashes
early.
- **Debugging Complexity**: "Works on my machine" is hard to achieve.
Reproducing a CI failure requires a developer to build the exact Docker image,
mount the correct volumes, and run the shell script manually, mimicking the CI
environment.
- **Opaque Observability**: Failures often manifest as a generic "exit code 1"
from the shell script. We lack structured JUnit reports for individual
validation steps, making it harder to parse which specific test case failed
without digging into raw logs.
- **Maintenance Overhead**: Logic is split between GitHub Actions YAML,
ci_run.sh (host setup), and validate.sh (container execution). Adding a new
bundle requires updates across multiple files and languages.

# Proposed Solution: Migrate to Testcontainers

## What is Testcontainers?

[Testcontainers](https://testcontainers.com/) is a Java library that supports
JUnit tests, providing lightweight, throwaway instances of common databases,
Selenium web browsers, or anything else that can run in a Docker container.

## Benefits

- **Parallelization**: We can run bundle validations concurrently using JUnit
5's parallel execution features. Spark and Flink validations could run
simultaneously in isolated containers.
- **Developer** Experience: Developers can run bundle validation with a
standard mvn test command. No need to manually build Docker images or set up
volume mounts. <<<<< Writing ITs will not longer be painful and daunting
- **Maintainability**: Validation logic moves from Bash to Java. We can
leverage strong typing, code reuse, and standard Java libraries for assertions
and flow control.
- **Observability**: We gain granular reporting. Each bundle validation becomes
a distinct test case in the JUnit report. Failed tests provide standard stack
traces and assertion errors.
- **Flexibility**: We can use `WaitStrategies` to ensure services like Hive
Metastore are fully ready before starting tests, replacing arbitrary sleep
commands found in current scripts.
- **Consistency**: The test environment is defined in code, ensuring that local
executions match CI executions perfectly.

# Success Metrics / End goal

- Code Quality: Nuke/retire all Bash scripts in favor of typed, testable java
code
- Speed up: With parallelization, we should reduce CI runtime.
- Developer Experience: Ability for a contributor to run `mvn test -pl
packaging/hudi-bundle-validation` or write tests just like how they will write
unit tests, and be able to run them from their IDE without to switch between
terminals and run shell commands.

# References

[Testcontainers Documentation](https://java.testcontainers.org/)

[JUnit 5 Parallel
Execution](https://www.google.com/search?q=https://junit.org/junit5/docs/current/user-guide/%23writing-tests-parallel-execution)

[Apache Hudi Bundle Validation
Readme](https://www.google.com/search?q=https://github.com/apache/hudi/tree/master/packaging/bundle-validation%23readme)

# Next Steps

If the community is aligned with this proposal, we can start drafting out more
concrete plans on how to navigate this migration.

Looking forward to your thoughts!

GitHub link: https://github.com/apache/hudi/discussions/17638

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] Modernize Bundle Validation CI by Migrating to Testcontainers [hudi]

Reply via email to