Re: What is the situation for our UTs now?

Guanghao Zhang Wed, 04 Mar 2020 19:22:26 -0800

>
> I took a look at master branch. Its not in same state as branch-2. Looking
> at nightlies, it seems a bit worse, I see backup tests failing (we don't
> have this in branch-2).
>
The backup ut may related to  HBASE-23912. Let me take a look.


Stack <[email protected]> 于2020年3月5日周四 上午8:06写道：

> On Wed, Mar 4, 2020 at 3:24 PM 张铎(Duo Zhang) <[email protected]>
> wrote:
>
> > OK, let's keep an eye on the flaky list of master and branch-2 till this
> > weekend.
> >
> > If it is in a bad state then let's discussion again.
> >
> >
> Agree.
>
> On the rerunning of flakies, I downed the ferocity. It was NOT 0.5C as I'd
> thought but 1.0C. I made it 0.25C. After the change, I got a blue dot for
> first time in a long time (not something to celebrate I'd say since I know
> we've not fixed all flakies). Looking at the machine these tests run on,
> its a 16core with 512G of RAM so 0.25C is a forkcount of 4. Before I went
> messing it was hardcoded to 3 so close enough. Let me push this change on
> master too.
>
> Would be good to go back to 1.0C at some time so flakies stay in the flakey
> list until fixed but I can work offline first on knocking down the length
> of the flakey list before hoisting us back to 1.0C.
>
> I'll work at downing the length of the flakey lists over the next day. Lets
> see if it helps w/ patch builds.
>
> I took a look at master branch. Its not in same state as branch-2. Looking
> at nightlies, it seems a bit worse, I see backup tests failing (we don't
> have this in branch-2).
>
> Thanks,
> S
>
>
> > Stack <[email protected]> 于2020年3月5日周四 上午12:41写道：
> >
> > > On Wed, Mar 4, 2020 at 3:42 AM 张铎(Duo Zhang) <[email protected]>
> > > wrote:
> > >
> > > > And speak a little more on increasing the forkCount. In fact, the
> test
> > > > category is not too rough. The LargeTests just means the test will
> run
> > a
> > > > bit long, does not mean it will consume more resources. Maybe the
> tests
> > > > just have lots of Thread.sleep so we declare it as LargeTests.
> > > >
> > > >
> > > I've done a few passes on test categorization of late. The notion had
> > > rotted pretty bad but should be cleaned up now.
> > >
> > >
> > > > What I can see is that, all the replication related tests are flaky
> > now.
> > > > This is reasonable. In replication tests, usually we have to set up
> at
> > > > least two mini clusters, and the replication system itself will make
> > use
> > > of
> > > > lots of threads. So if you run several replication related tests
> > > together,
> > > > it will easy to overload and cause the UTs to timeout or OOM.
> > > >
> > > >
> > > We have at least one test that makes four clusters inside the one JVM.
> > >
> > > Yeah, the resource usage in general needs weeding.
> > >
> > > Perhaps you are arguing that we just let the state of tests as they
> are?
> > > That we let long tests run in series in case two or more might run
> > together
> > > and fail because they are profligate in their resource use?
> > >
> > I mean increasing the fork count will lead to a random test result as the
> > test category can not describe the resource usage clearly. You can run
> > maybe 20+ light-weighted UTs without problem, but if you run 5 tests
> which
> > set up 4 mini clusters, the resource will be exhausted and cause the
> tests
> > to fail, or at least make it really slow and fail the tests...
> >
> > >
> > >
> > >
> > > > So, again, let's do this on a feature branch. It is fine to mess
> things
> > > up
> > > > on a feature branch. You can do everything you want as the
> intermediate
> > > > state does not effect others. On master and branch-2 it is another
> > > story. I
> > > > do not think this should be a blocker for 2.3.0 or 3.0.0.
> > > >
> > > > See previous note.
> > >
> > > Thanks,
> > > S
> > >
> > >
> > > > Thanks.
> > > >
> > > > 张铎(Duo Zhang) <[email protected]> 于2020年3月4日周三 下午7:34写道：
> > > >
> > > > > Due to the resource limit I do not think it is a good idea to
> > increase
> > > > the
> > > > > forkCount...
> > > > >
> > > > > FWIW, can we do this on a feature branch and move master and
> branch-2
> > > > back?
> > > > >
> > > > > See here
> > > > >
> > > > > https://github.com/apache/hbase/pull/1221
> > > > >
> > > > > We tried several times and always got a large amount of failed UTs
> > > which
> > > > > are not related to the patch. And we even excluded hundreds of UTs
> > due
> > > to
> > > > > the flaky list!
> > > > >
> > > > > This makes it almost impossible to contribute to the project. Even
> > > after
> > > > > several tries we get a green result, due to the excluded hundreds
> of
> > > UTs,
> > > > > no one know if the patch breaks something.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Stack <[email protected]> 于2020年3月4日周三 下午2:55写道：
> > > > >
> > > > >> Upstream branch-2 and master nightlies don't look too bad
> currently.
> > > > There
> > > > >> are a few bad runs where there were a bunch of hangs which makes
> > > things
> > > > >> look bad. I upped the number of tests we show from 5 to 10 on
> > branch-2
> > > > and
> > > > >> master which makes it so a failed tests shows longer in the top
> half
> > > of
> > > > >> the
> > > > >> flakies page -- and more flakies are listed. On the bottom half,
> I'd
> > > > upped
> > > > >> the ferocity with which we run on GCE to draw out flakies.
> Needless
> > to
> > > > >> say,
> > > > >> they fail more often when contended resources. I might knock the
> > > > ferocity
> > > > >> down in the next day or so but am trying to land some patches that
> > cut
> > > > >> down
> > > > >> on resource usage and want to see how these do in the flakie runs
> > > first.
> > > > >>
> > > > >> Master I haven't looked at much... looks like branch-2?
> Branch-2.2
> > > and
> > > > >> branch-2.1 look sleepy. Similar amounts of flakies in the
> nightlies.
> > > > They
> > > > >> don't have the ferocity upped so the lower-half GCE section looks
> > > > >> 'better'.
> > > > >> I can make them look like branch-2 and master if folks want
> (smile)
> > > but
> > > > >> its
> > > > >> probably ok letting the flakies lie in branches that are being
> > > bypassed.
> > > > >>
> > > > >> Generally,  I've been working on unit tests with inspiration and
> > help
> > > > from
> > > > >> Mark Miller and Nick. Our tests are in a poor state. They take so
> > > long,
> > > > >> they don't get run anywhere else other than up on jenkins. They
> > rarely
> > > > >> pass
> > > > >> and only then on accident if minimal parallelism and jitter. On
> > > > multi-core
> > > > >> machines, they use 1 to 2 cores only -- even if the machine has
> tens
> > > of
> > > > >> them.
> > > > >>
> > > > >> I have been trying to burn down the flakies, make the tests
> complete
> > > > >> successfully in less time with more parallelism, using all of the
> > > > machine,
> > > > >> and make them pass both on jenkins and locally. Of late, have been
> > > > focused
> > > > >> on branch-2 since it is calming down getting ready for a 2.3.0RC0.
> > > > Having
> > > > >> some success but its a  nasty job where it is hard to claim
> advances
> > > > >> because the flakies vary w/ the context in which the tests are
> run.
> > > > >> Hopefully we'll turn a corner on jenkins soon for folks to enjoy.
> > > > >>
> > > > >> Shout if need more detail.
> > > > >> S
> > > > >>
> > > > >>
> > > > >> On Tue, Mar 3, 2020 at 6:00 PM 张铎(Duo Zhang) <
> [email protected]
> > >
> > > > >> wrote:
> > > > >>
> > > > >> > But why branch-2.2 and branch-2.1 are still fine?
> > > > >> >
> > > > >> > Sean Busbey <[email protected]> 于2020年3月4日周三 上午9:24写道：
> > > > >> >
> > > > >> > > I agree in principle that excluding 100s of UTs isn't good.
> But
> > we
> > > > >> don't
> > > > >> > > really have better options given the state of tests and
> testing
> > > > >> hardware
> > > > >> > > currently available to us.
> > > > >> > >
> > > > >> > > On Tue, Mar 3, 2020, 19:14 张铎(Duo Zhang) <
> [email protected]
> > >
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > I think the problem is all UTs are failing randomly...
> > > > >> > > >
> > > > >> > > > And it is also not a good idea to exclude hundreds of UTs in
> > pre
> > > > >> > commit?
> > > > >> > > >
> > > > >> > > > Sean Busbey <[email protected]> 于2020年3月4日周三 上午9:11写道：
> > > > >> > > >
> > > > >> > > > > Everything in the flake list should be skipped at
> precommit
> > > > time.
> > > > >> Is
> > > > >> > > that
> > > > >> > > > > not happening?
> > > > >> > > > >
> > > > >> > > > > Are we keeping a shorter flake window so things are
> bouncing
> > > in
> > > > >> and
> > > > >> > out
> > > > >> > > > of
> > > > >> > > > > the list?
> > > > >> > > > >
> > > > >> > > > > On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) <
> > > [email protected]
> > > > >
> > > > >> > > wrote:
> > > > >> > > > >
> > > > >> > > > > > I see recently there are lots of 'flaky tests' related
> > > issues
> > > > >> been
> > > > >> > > > > resolved
> > > > >> > > > > > but seems the situation is getting worse? For branch-2.2
> > the
> > > > >> flaky
> > > > >> > > page
> > > > >> > > > > is
> > > > >> > > > > > fine, but for master it is totally a mess...
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html
> > > > >> > > > > >
> > > > >> > > > > > Lots of UTs are in trouble and it makes it really hard
> to
> > > pass
> > > > >> the
> > > > >> > > pre
> > > > >> > > > > > commit check which means it is really hard to contribute
> > to
> > > > the
> > > > >> > > > > project...
> > > > >> > > > > >
> > > > >> > > > > > We need to fix this soon...
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: What is the situation for our UTs now?

Reply via email to