Re: What is the situation for our UTs now?

Stack Wed, 04 Mar 2020 08:42:22 -0800

On Wed, Mar 4, 2020 at 3:34 AM 张铎(Duo Zhang) <[email protected]> wrote:


> Due to the resource limit I do not think it is a good idea to increase the
> forkCount...
>
>
Which fork count are you referring too? The fork count is about what it
always was after doing the math just that now we size based off the machine
cpu count; this should make the config able to adapt some to the hardware
they are being run on.

There is also the -T argument which I tried to up on general builds but it
was causing too many failures so I reverted; on nightlies and patch builds
we are running the default of one maven thread.

I think you might be referring to the -T I added re-running flakies: mostly
the second panel in these pages
https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-2/lastSuccessfulBuild/artifact/dashboard.html.
I set it to 0.5C instead of default 1 thread. It makes the flakes fail more
often; highlights what happens when resource contention; i.e. makes the
flakies fail more reliably. I set it about a week ago. I've been keeping an
eye on it but was working elsewhere on tests. I was hoping to land patches
in the next days that dealt with resource use that I hoped would put a dent
in the current failure lists. Let me dial this down so we disturb flakies
less and they can hide again (and see if this makes a difference to the
master patch builds).


> FWIW, can we do this on a feature branch and move master and branch-2 back?
>

Which aspect? Fixing falkies? Or re-running the flakey list aggressively? I
don't think we want the former out on a feature branch. The latter would
require infra duplication hackery of not only branch nightlie but a
complementary flakey rerun duplication.


>
> See here
>
> https://github.com/apache/hbase/pull/1221
>
> We tried several times and always got a large amount of failed UTs which
> are not related to the patch. And we even excluded hundreds of UTs due to
> the flaky list!
>
>
I've not been tracking master closely. Is anyone? Let me down the ferocity
of the flakie re-runs to see if it makes a difference.


> This makes it almost impossible to contribute to the project. Even after
> several tries we get a green result, due to the excluded hundreds of UTs,
> no one know if the patch breaks something.
>
>
Yeah, this is a problem. Let me pay more attention here. Let me take a look
at master branch. Patch builds were doing pretty well up until recently.
S


> Thanks.
>
> Stack <[email protected]> 于2020年3月4日周三 下午2:55写道：
>
> > Upstream branch-2 and master nightlies don't look too bad currently.
> There
> > are a few bad runs where there were a bunch of hangs which makes things
> > look bad. I upped the number of tests we show from 5 to 10 on branch-2
> and
> > master which makes it so a failed tests shows longer in the top half of
> the
> > flakies page -- and more flakies are listed. On the bottom half, I'd
> upped
> > the ferocity with which we run on GCE to draw out flakies. Needless to
> say,
> > they fail more often when contended resources. I might knock the ferocity
> > down in the next day or so but am trying to land some patches that cut
> down
> > on resource usage and want to see how these do in the flakie runs first.
> >
> > Master I haven't looked at much... looks like branch-2?  Branch-2.2 and
> > branch-2.1 look sleepy. Similar amounts of flakies in the nightlies. They
> > don't have the ferocity upped so the lower-half GCE section looks
> 'better'.
> > I can make them look like branch-2 and master if folks want (smile) but
> its
> > probably ok letting the flakies lie in branches that are being bypassed.
> >
> > Generally,  I've been working on unit tests with inspiration and help
> from
> > Mark Miller and Nick. Our tests are in a poor state. They take so long,
> > they don't get run anywhere else other than up on jenkins. They rarely
> pass
> > and only then on accident if minimal parallelism and jitter. On
> multi-core
> > machines, they use 1 to 2 cores only -- even if the machine has tens of
> > them.
> >
> > I have been trying to burn down the flakies, make the tests complete
> > successfully in less time with more parallelism, using all of the
> machine,
> > and make them pass both on jenkins and locally. Of late, have been
> focused
> > on branch-2 since it is calming down getting ready for a 2.3.0RC0. Having
> > some success but its a  nasty job where it is hard to claim advances
> > because the flakies vary w/ the context in which the tests are run.
> > Hopefully we'll turn a corner on jenkins soon for folks to enjoy.
> >
> > Shout if need more detail.
> > S
> >
> >
> > On Tue, Mar 3, 2020 at 6:00 PM 张铎(Duo Zhang) <[email protected]>
> > wrote:
> >
> > > But why branch-2.2 and branch-2.1 are still fine?
> > >
> > > Sean Busbey <[email protected]> 于2020年3月4日周三 上午9:24写道：
> > >
> > > > I agree in principle that excluding 100s of UTs isn't good. But we
> > don't
> > > > really have better options given the state of tests and testing
> > hardware
> > > > currently available to us.
> > > >
> > > > On Tue, Mar 3, 2020, 19:14 张铎(Duo Zhang) <[email protected]>
> > wrote:
> > > >
> > > > > I think the problem is all UTs are failing randomly...
> > > > >
> > > > > And it is also not a good idea to exclude hundreds of UTs in pre
> > > commit?
> > > > >
> > > > > Sean Busbey <[email protected]> 于2020年3月4日周三 上午9:11写道：
> > > > >
> > > > > > Everything in the flake list should be skipped at precommit time.
> > Is
> > > > that
> > > > > > not happening?
> > > > > >
> > > > > > Are we keeping a shorter flake window so things are bouncing in
> and
> > > out
> > > > > of
> > > > > > the list?
> > > > > >
> > > > > > On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > > I see recently there are lots of 'flaky tests' related issues
> > been
> > > > > > resolved
> > > > > > > but seems the situation is getting worse? For branch-2.2 the
> > flaky
> > > > page
> > > > > > is
> > > > > > > fine, but for master it is totally a mess...
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html
> > > > > > >
> > > > > > > Lots of UTs are in trouble and it makes it really hard to pass
> > the
> > > > pre
> > > > > > > commit check which means it is really hard to contribute to the
> > > > > > project...
> > > > > > >
> > > > > > > We need to fix this soon...
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: What is the situation for our UTs now?

Reply via email to