Re: What is the situation for our UTs now?

Stack Tue, 03 Mar 2020 22:56:24 -0800

Upstream branch-2 and master nightlies don't look too bad currently. There
are a few bad runs where there were a bunch of hangs which makes things
look bad. I upped the number of tests we show from 5 to 10 on branch-2 and
master which makes it so a failed tests shows longer in the top half of the
flakies page -- and more flakies are listed. On the bottom half, I'd upped
the ferocity with which we run on GCE to draw out flakies. Needless to say,
they fail more often when contended resources. I might knock the ferocity
down in the next day or so but am trying to land some patches that cut down
on resource usage and want to see how these do in the flakie runs first.

Master I haven't looked at much... looks like branch-2?  Branch-2.2 and
branch-2.1 look sleepy. Similar amounts of flakies in the nightlies. They
don't have the ferocity upped so the lower-half GCE section looks 'better'.
I can make them look like branch-2 and master if folks want (smile) but its
probably ok letting the flakies lie in branches that are being bypassed.

Generally,  I've been working on unit tests with inspiration and help from
Mark Miller and Nick. Our tests are in a poor state. They take so long,
they don't get run anywhere else other than up on jenkins. They rarely pass
and only then on accident if minimal parallelism and jitter. On multi-core
machines, they use 1 to 2 cores only -- even if the machine has tens of
them.

I have been trying to burn down the flakies, make the tests complete
successfully in less time with more parallelism, using all of the machine,
and make them pass both on jenkins and locally. Of late, have been focused
on branch-2 since it is calming down getting ready for a 2.3.0RC0. Having
some success but its a  nasty job where it is hard to claim advances
because the flakies vary w/ the context in which the tests are run.
Hopefully we'll turn a corner on jenkins soon for folks to enjoy.

Shout if need more detail.
S

On Tue, Mar 3, 2020 at 6:00 PM 张铎(Duo Zhang) <[email protected]> wrote:

> But why branch-2.2 and branch-2.1 are still fine?
>
> Sean Busbey <[email protected]> 于2020年3月4日周三 上午9:24写道：
>
> > I agree in principle that excluding 100s of UTs isn't good. But we don't
> > really have better options given the state of tests and testing hardware
> > currently available to us.
> >
> > On Tue, Mar 3, 2020, 19:14 张铎(Duo Zhang) <[email protected]> wrote:
> >
> > > I think the problem is all UTs are failing randomly...
> > >
> > > And it is also not a good idea to exclude hundreds of UTs in pre
> commit?
> > >
> > > Sean Busbey <[email protected]> 于2020年3月4日周三 上午9:11写道：
> > >
> > > > Everything in the flake list should be skipped at precommit time. Is
> > that
> > > > not happening?
> > > >
> > > > Are we keeping a shorter flake window so things are bouncing in and
> out
> > > of
> > > > the list?
> > > >
> > > > On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) <[email protected]>
> > wrote:
> > > >
> > > > > I see recently there are lots of 'flaky tests' related issues been
> > > > resolved
> > > > > but seems the situation is getting worse? For branch-2.2 the flaky
> > page
> > > > is
> > > > > fine, but for master it is totally a mess...
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html
> > > > >
> > > > > Lots of UTs are in trouble and it makes it really hard to pass the
> > pre
> > > > > commit check which means it is really hard to contribute to the
> > > > project...
> > > > >
> > > > > We need to fix this soon...
> > > > >
> > > >
> > >
> >
>

Re: What is the situation for our UTs now?

Reply via email to