Re: Unit Test Notes

Andrew Purtell Fri, 22 May 2020 13:55:59 -0700

Thank you for sending these detailed notes. I fear I will be duplicating
these efforts on branch-1 when (eventually) preparing for 1.7.0.


On Fri, May 22, 2020 at 1:32 PM Stack <st...@duboce.net> wrote:

> After a bit of work, there are currently no flakies in branch-2.3 and all
> tests passed over the last ten nightlies (a nightly is a comprehensive
> build that runs the full test suite once for jdk8+hadoop2, again for
> jdk8+hadoop3, and again for jdk11+hadoop3). You can see this by looking at
> our flakies dashboard for branch-2.3 [1][2]. Branch-2 is not too far behind
> with one flakey and a recent nightly test failure [3].
>
> This 'cleanliness' is a little noteworthy, IMO.
>
> Other branches have not had the same focus so their state varies w/
> attention paid.
>
> Attempts were also recently made at speeding up the jenkins test builds
> playing w/ maven forkcount, shrinking test resource usage, and with the
> maven -T which allows manipulating levels of maven module build/test
> parallelism (HBASE-24150, HBASE-24072, etc.). There was little yield to be
> had here...perhaps a 20% improvement. Complications included: jenkins build
> slaves allow two executors/builds to run at the same time so when an hbase
> build runs, it is sharing the machine w/ another (often another hbase
> build); host and docker resource constraints; and that our module
> inter-dependency constrains how much parallelism is allowed.
>
> As part of the above work in branch-2/branch-2.3, tests were run locally on
> various hardware. It should come as no surprise that the experience varied
> w/ environment (less so as flakies were addressed). On better hardware,
> tests can be made run more furiously so they use all the machine and
> complete faster.
>
> The settings we have as our defaults are configured to suit the Apache
> Jenkins build environment which is usually 16CPUs/48G. As said above,
> Jenkins slaves allow two builds machines so halve these resources when an
> HBase build runs on Apache Infrastructure. So as to be considerate of our
> companion Apache projects, defaults are relatively 'mild': our forkcount is
> set to 0.25 all of the CPUs in the machine. On Apache Jenkins, 0.25*16CPU
> == 4 CPUs for hbase build. We also set -T2 which means up to two modules
> building in parallel where possible (each with above configured forkcount).
> Our test suites on Jenkins continue to take hours.
>
> On a 40CPU linux machine with the below arguments where we use half the
> CPUs in the machine (and ulimit -u 40960), all tests run in just under an
> hour:
>
>   $ x="0.50C" ; nohup mvn -T2 -Dsurefire.firstPartForkCount=$x
> -Dsurefire.secondPartForkCount=$x test -PrunAllTests
>
> Upping the forkcount on this machine beyond 0.50C tended to bring a rush of
> tests exiting... (To be investigated). On this machine, tests currently
> pass about 80% of the time. To be improved.
>
> On an anemic 4CPU VM, I can run the below and it will pass 60% of the time.
> It takes ~5hours:
>
>   $ x="1.0C" ; nohup mvn -T2 -Dsurefire.firstPartForkCount=$x
> -Dsurefire.secondPartForkCount=$x test -PrunAllTests
>
> On a mac w/ 12CPUs, I can run same command as above. It passes with about
> same frequency and takes just over 1 1/2 hours.
>
> On my laptop it is less reliable passing about 1/3rd of the time in about 2
> 1/2 hours.
>
> If I use less resources, a lesser forkcount, the tests complete more often
> (but take correspondingly longer).
>
> Going forward, we will continue to watch branch-2/branch-2.3. Regards
> speedup, there is a bunch to do. A large win is to be had improving the
> HDFS mini cluster adding configuration (lots of resources such as pool
> thread counts are hard coded and numbers that are large for small test run)
> and working on speeding startup times.
>
> oao,
> S
>
> 1.
>
> https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-2.3/lastSuccessfulBuild/artifact/dashboard.html
> 2. Unfortunately, the nightly list shows reds though all tests passed
> because of report assemblage issues being addressed by infra:
> https://issues.apache.org/jira/browse/INFRA-20025
> 3.
>
> https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-2/lastSuccessfulBuild/artifact/dashboard.html
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: Unit Test Notes

Reply via email to