Re: Call for collaboration: CI and Test Improvements

Ayush Saxena Fri, 03 Apr 2026 01:57:51 -0700

Hi Aaron,
Thanks for initiating this discussion, we should definitely spend some
time investigating this area.

I wanted to share a few ideas and observations:

* Enable parallel test execution
We should enable the parallel tests profile for all modules that don’t
currently use it (similar to [1]) to reduce overall test time.

* Revisit Yetus responsibilities
Currently, Apache Yetus handles builds and tests on JDK 17, and also
performs a build on JDK 21.
We could consider:
Keeping Yetus focused on JDK 17 (build + test)
Moving JDK 21 builds to GitHub Actions, potentially limiting them to trunk only

*  Parallelize module test execution in Yetus
Yetus identifies modified modules, but when multiple modules are
impacted, it runs tests serially. This appears to be a major
contributor to longer CI times. Parallelizing this step could provide
significant improvements.

* Move non-critical stages out of Yetus
Stages like Javadoc generation could be shifted to GitHub Actions,
reducing load and runtime in Yetus.

* Optimize slow tests
We should identify the tests that take the most time and see where
optimization is possible. For example, some test classes start and
stop Mini*Cluster instances for each test—there may be scope to reuse
them or reduce lifecycle overhead.

* Remove redundant / low-value tests
It’s also worth reviewing tests that are redundant or
disproportionately expensive relative to the value they provide, and
trimming or refactoring them where appropriate.

* Checkstyle / FindBugs handling
I understand we run checks on both trunk and branches because
violation detection relies on diffs (i.e., no diff → no violations).
Not sure if there’s an alternative approach here, but worth exploring
if we want to simplify the pipeline.

* Enable parallel builds
We should also ensure parallel build execution is enabled (as in [2])
wherever possible.

-Ayush

[1] https://issues.apache.org/jira/browse/HDFS-14888
[2] https://issues.apache.org/jira/browse/HADOOP-18394

On Fri, 3 Apr 2026 at 10:43, Cheng Pan <[email protected]> wrote:
>
> Thanks for raising this thread, it's really nice that one PMC member is 
> taking the lead in improving Hadoop CI!
>
> I can share some of the issues I've observed so far.
>
> 1. The current CI pipeline runs in serial; I recall that, with good luck (no 
> crash, no OOM), it could take over 28 hours to complete the entire CI 
> pipeline.
> 2. Some jobs runs two times, one time on trunk branch (the PR target branch), 
> one on the PR branch. I haven't looked into the underlying reasons, but I 
> think we might be able to omit the trunk one.
> 3. Hadoop CI runs on Jenkins servers maintained by ASF, and these machines 
> appear to have some stability issues, sometimes.
>
> I tried to improve it, but I found that I first had to figure out how Yetus 
> works, which is a tool mainly composed of shell and scripting language, which 
> is quite a challenge for me. Subsequently, I changed direction and explored 
> whether Hadoop CI could be migrated to GitHub Actions(GHA), then get some 
> good news, and aslo some challenges.
>
> 1. We should test Hadoop inside container instead of virtual machine, for 
> consistent native libraries installation, that means we should built the 
> Dockerfile on-the-fly, then use it as the container for testing.
> 2. When runs GHA inside container, there are some limitations[1], for 
> example, unable to change USER to non-root, this causes some HDFS tests, 
> especially those cases related permission, do not work properly.
> 3. The Standard GitHub-hosted runner[2] has 4C16G, is not sufficent for some 
> tests.
>
> Given the current situation, I think we can do the following immediately:
>
> 1. Move some CI jobs, e.g., native compile test on Debian, Rocky, from 
> Jenkins to GHA, and run them in parallel.
> 2. Investigate whether we can skip run CI on trunk branch for PR.
>
> Additionally, I'm also investigating if we can get rid some native 
> code/dependencies in the future, for example:
>
> - In HADOOP-19839, I found the modern JDK already provided fast enough 
> CRC32/CRC32C built-in implementation, do we need to maintain Hadoop's native 
> CRC32/CRC32C in `libhadoop`?
> - In HADOOP-19855, I'm investigating replacing native zstd C bindings with 
> zstd-jni library, which is the de facto choice for JVM applications for the 
> zStandard compression algorithm.
>
> [1] 
> https://docs.github.com/en/actions/reference/workflows-and-actions/dockerfile-support#user
> [2] https://docs.github.com/en/actions/reference/runners/github-hosted-runners
>
> Thanks,
> Cheng Pan
>
>
>
> > On Apr 3, 2026, at 06:28, Aaron Fabbri <[email protected]> wrote:
> >
> > I'd like to put some effort into improving our CI run time and reliability
> > but I need your help. I don't know how everything works, and there is too
> > much work to do for one person.
> >
> > Join me in an informal "interest group" of folks that are interested in:
> >
> > - Reducing runtime of existing CI / branch tests.
> > - Eliminating flaky tests.
> > - Improving test coverage and tooling.
> >
> > Please reply to this thread if you are interested in helping, or if you
> > have ideas for specific technical issues to address. We can use this JIRA
> > to track related efforts:
> >
> > https://issues.apache.org/jira/browse/HADOOP-19820
> >
> > You can also tag me in the #hadoop channel on ASF Slack:
> > https://the-asf.slack.com/archives/CDSDT7A0H
> >
> > (I'll volunteer to keep this mailing list updated on any interesting
> > discussions there).
> >
> > Thanks!
> > Aaron <[email protected]>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Call for collaboration: CI and Test Improvements

Reply via email to