GitHub user voonhous edited a discussion: Modernize Bundle Validation CI by 
Migrating to Testcontainers

# Overview

This discussion proposes a significant modernization of Apache Hudi's bundle 
validation infrastructure. Currently, our bundle validation process relies on a 
complex combination of Docker-based shell scripts (`ci_run.sh` and 
`validate.sh`) to verify the integrity and functionality of our release 
artifacts (Spark, Flink, Utilities, etc.).

This infrastructure is critical as it powers three major workflows:

1. `bot.yml`: The active CI workflow running on PRs and commits. **[MAIN]**
2. `release_candidate_validation.yml`: Validation for release candidates 
(currently disabled).
3. `maven_artifact_validation.yml`: Post-release validation for Maven Central 
artifacts (currently disabled).

The current execution chain involves a GitHub Actions workflow triggering 
`ci_run.sh`, which sets up a Docker environment, mounts volumes, and then 
executes `validate.sh` inside the container. This script then sequentially runs 
a series of tests across various bundles.


# Current CI Structure

## Workflow Execution

- **Triggers**: The system runs on pushes/PRs to master, release-*, and 
branch-0.x.
- **Matrix Strategy / Coverage**: We maintain a comprehensive test matrix 
covering:
    - **Java**: 8, 11, 17
    - **Scala**: 2.12, 2.13
    - **Spark**: 3.3.x, 3.4.x, 3.5.x, 4.0.0
    - **Flink**: 1.17 - 1.20, 2.0
- **Status**: While `bot.yml` is active for standard CI, the release candidate 
and Maven artifact validations are currently disabled manually in the YAML 
files.


## Validation Process

Inside the Docker container, `validate.sh` performs the following validation 
steps sequentially:

1. **Spark & Hadoop MR**: Starts Derby/Hive, runs Hive sync, and validates with 
Spark SQL and HiveQL.
2. **Utilities Bundle**: Runs HoodieDeltaStreamer and validates output size and 
content via Spark shell.
3. **Utilities Slim Bundle**: Validates the slim bundle in conjunction with the 
Spark bundle.
4. **Flink Bundle**: Starts a Flink cluster, runs SQL inserts, and validates 
via a compaction script.
5. **Kafka Connect Bundle**: Spins up Zookeeper, Kafka, and Schema Registry to 
test the Hudi Sink connector.
6. **Metaserver Bundle**: Starts the Metaserver and validates read/write via 
Spark DataSource.
7. **CLI Bundle**: Executes a series of Hudi CLI commands to verify table 
management.

## Env Stack

- **Orchestration**: Bash scripts (`ci_run.sh`, `validate.sh`) managing the 
lifecycle.
- **Environment**: Custom Docker image (hudi-ci-bundle-validation-base) built 
in the CI or pulled from Docker Hub with different tags based on different 
versions.
- **Dependencies**: Manual management of Derby, Hive, Spark, Flink, Kafka, and 
Zookeeper lifecycles within the script.


# Current Challenges

After analyzing the packaging/bundle-validation/ directory, several pain points 
are evident:

- **Sequential Bottlenecks**: Tests run strictly one after another. If the 
Spark validation takes time, Flink and Kafka tests must wait, extending the 
feedback loop.
- **Fragile Service Management**: Services like Hive, Derby, and Zookeeper are 
started as background processes (&) with PIDs captured for later cleanup (kill 
$PID). This is prone to zombie processes or port conflicts if a script crashes 
early.
- **Debugging Complexity**: "Works on my machine" is hard to achieve. 
Reproducing a CI failure requires a developer to build the exact Docker image, 
mount the correct volumes, and run the shell script manually, mimicking the CI 
environment.
- **Opaque Observability**: Failures often manifest as a generic "exit code 1" 
from the shell script. We lack structured JUnit reports for individual 
validation steps, making it harder to parse which specific test case failed 
without digging into raw logs.
- **Maintenance Overhead**: Logic is split between GitHub Actions YAML, 
ci_run.sh (host setup), and validate.sh (container execution). Adding a new 
bundle requires updates across multiple files and languages.


# Proposed Solution: Migrate to Testcontainers

## What is Testcontainers?

[Testcontainers](https://testcontainers.com/) is a Java library that supports 
JUnit tests, providing lightweight, throwaway instances of common databases, 
Selenium web browsers, or anything else that can run in a Docker container.


## Benefits

- **Parallelization**: We can run bundle validations concurrently using JUnit 
5's parallel execution features. Spark and Flink validations could run 
simultaneously in isolated containers.
- **Developer** Experience: Developers can run bundle validation with a 
standard mvn test command. No need to manually build Docker images or set up 
volume mounts. <<<<< Writing ITs will not longer be painful and daunting
- **Maintainability**: Validation logic moves from Bash to Java. We can 
leverage strong typing, code reuse, and standard Java libraries for assertions 
and flow control.
- **Observability**: We gain granular reporting. Each bundle validation becomes 
a distinct test case in the JUnit report. Failed tests provide standard stack 
traces and assertion errors.
- **Flexibility**: We can use `WaitStrategies` to ensure services like Hive 
Metastore are fully ready before starting tests, replacing arbitrary sleep 
commands found in current scripts.
- **Consistency**: The test environment is defined in code, ensuring that local 
executions match CI executions perfectly.

# Success Metrics / End goal

- Code Quality: Nuke/retire all Bash scripts in favor of typed, testable java 
code
- Speed up: With parallelization, we should reduce CI runtime. 
- Developer Experience: Ability for a contributor to run `mvn test -pl 
packaging/hudi-bundle-validation` or write tests just like how they will write 
unit tests, and be able to run them from their IDE without to switch between 
terminals and run shell commands.

# References

[Testcontainers Documentation](https://java.testcontainers.org/)

[JUnit 5 Parallel 
Execution](https://www.google.com/search?q=https://junit.org/junit5/docs/current/user-guide/%23writing-tests-parallel-execution)

[Apache Hudi Bundle Validation 
Readme](https://www.google.com/search?q=https://github.com/apache/hudi/tree/master/packaging/bundle-validation%23readme)


# Next Steps

If the community is aligned with this proposal, we can start drafting out more 
concrete plans on how to navigate this migration.

Looking forward to your thoughts!

GitHub link: https://github.com/apache/hudi/discussions/17638

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to