Hi Aaron, Thanks for initiating this discussion, we should definitely spend some time investigating this area.
I wanted to share a few ideas and observations: * Enable parallel test execution We should enable the parallel tests profile for all modules that don’t currently use it (similar to [1]) to reduce overall test time. * Revisit Yetus responsibilities Currently, Apache Yetus handles builds and tests on JDK 17, and also performs a build on JDK 21. We could consider: Keeping Yetus focused on JDK 17 (build + test) Moving JDK 21 builds to GitHub Actions, potentially limiting them to trunk only * Parallelize module test execution in Yetus Yetus identifies modified modules, but when multiple modules are impacted, it runs tests serially. This appears to be a major contributor to longer CI times. Parallelizing this step could provide significant improvements. * Move non-critical stages out of Yetus Stages like Javadoc generation could be shifted to GitHub Actions, reducing load and runtime in Yetus. * Optimize slow tests We should identify the tests that take the most time and see where optimization is possible. For example, some test classes start and stop Mini*Cluster instances for each test—there may be scope to reuse them or reduce lifecycle overhead. * Remove redundant / low-value tests It’s also worth reviewing tests that are redundant or disproportionately expensive relative to the value they provide, and trimming or refactoring them where appropriate. * Checkstyle / FindBugs handling I understand we run checks on both trunk and branches because violation detection relies on diffs (i.e., no diff → no violations). Not sure if there’s an alternative approach here, but worth exploring if we want to simplify the pipeline. * Enable parallel builds We should also ensure parallel build execution is enabled (as in [2]) wherever possible. -Ayush [1] https://issues.apache.org/jira/browse/HDFS-14888 [2] https://issues.apache.org/jira/browse/HADOOP-18394 On Fri, 3 Apr 2026 at 10:43, Cheng Pan <[email protected]> wrote: > > Thanks for raising this thread, it's really nice that one PMC member is > taking the lead in improving Hadoop CI! > > I can share some of the issues I've observed so far. > > 1. The current CI pipeline runs in serial; I recall that, with good luck (no > crash, no OOM), it could take over 28 hours to complete the entire CI > pipeline. > 2. Some jobs runs two times, one time on trunk branch (the PR target branch), > one on the PR branch. I haven't looked into the underlying reasons, but I > think we might be able to omit the trunk one. > 3. Hadoop CI runs on Jenkins servers maintained by ASF, and these machines > appear to have some stability issues, sometimes. > > I tried to improve it, but I found that I first had to figure out how Yetus > works, which is a tool mainly composed of shell and scripting language, which > is quite a challenge for me. Subsequently, I changed direction and explored > whether Hadoop CI could be migrated to GitHub Actions(GHA), then get some > good news, and aslo some challenges. > > 1. We should test Hadoop inside container instead of virtual machine, for > consistent native libraries installation, that means we should built the > Dockerfile on-the-fly, then use it as the container for testing. > 2. When runs GHA inside container, there are some limitations[1], for > example, unable to change USER to non-root, this causes some HDFS tests, > especially those cases related permission, do not work properly. > 3. The Standard GitHub-hosted runner[2] has 4C16G, is not sufficent for some > tests. > > Given the current situation, I think we can do the following immediately: > > 1. Move some CI jobs, e.g., native compile test on Debian, Rocky, from > Jenkins to GHA, and run them in parallel. > 2. Investigate whether we can skip run CI on trunk branch for PR. > > Additionally, I'm also investigating if we can get rid some native > code/dependencies in the future, for example: > > - In HADOOP-19839, I found the modern JDK already provided fast enough > CRC32/CRC32C built-in implementation, do we need to maintain Hadoop's native > CRC32/CRC32C in `libhadoop`? > - In HADOOP-19855, I'm investigating replacing native zstd C bindings with > zstd-jni library, which is the de facto choice for JVM applications for the > zStandard compression algorithm. > > [1] > https://docs.github.com/en/actions/reference/workflows-and-actions/dockerfile-support#user > [2] https://docs.github.com/en/actions/reference/runners/github-hosted-runners > > Thanks, > Cheng Pan > > > > > On Apr 3, 2026, at 06:28, Aaron Fabbri <[email protected]> wrote: > > > > I'd like to put some effort into improving our CI run time and reliability > > but I need your help. I don't know how everything works, and there is too > > much work to do for one person. > > > > Join me in an informal "interest group" of folks that are interested in: > > > > - Reducing runtime of existing CI / branch tests. > > - Eliminating flaky tests. > > - Improving test coverage and tooling. > > > > Please reply to this thread if you are interested in helping, or if you > > have ideas for specific technical issues to address. We can use this JIRA > > to track related efforts: > > > > https://issues.apache.org/jira/browse/HADOOP-19820 > > > > You can also tag me in the #hadoop channel on ASF Slack: > > https://the-asf.slack.com/archives/CDSDT7A0H > > > > (I'll volunteer to keep this mailing list updated on any interesting > > discussions there). > > > > Thanks! > > Aaron <[email protected]> > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
