> > I took a look at master branch. Its not in same state as branch-2. Looking > at nightlies, it seems a bit worse, I see backup tests failing (we don't > have this in branch-2). > The backup ut may related to HBASE-23912. Let me take a look.
Stack <[email protected]> 于2020年3月5日周四 上午8:06写道: > On Wed, Mar 4, 2020 at 3:24 PM 张铎(Duo Zhang) <[email protected]> > wrote: > > > OK, let's keep an eye on the flaky list of master and branch-2 till this > > weekend. > > > > If it is in a bad state then let's discussion again. > > > > > Agree. > > On the rerunning of flakies, I downed the ferocity. It was NOT 0.5C as I'd > thought but 1.0C. I made it 0.25C. After the change, I got a blue dot for > first time in a long time (not something to celebrate I'd say since I know > we've not fixed all flakies). Looking at the machine these tests run on, > its a 16core with 512G of RAM so 0.25C is a forkcount of 4. Before I went > messing it was hardcoded to 3 so close enough. Let me push this change on > master too. > > Would be good to go back to 1.0C at some time so flakies stay in the flakey > list until fixed but I can work offline first on knocking down the length > of the flakey list before hoisting us back to 1.0C. > > I'll work at downing the length of the flakey lists over the next day. Lets > see if it helps w/ patch builds. > > I took a look at master branch. Its not in same state as branch-2. Looking > at nightlies, it seems a bit worse, I see backup tests failing (we don't > have this in branch-2). > > Thanks, > S > > > > Stack <[email protected]> 于2020年3月5日周四 上午12:41写道: > > > > > On Wed, Mar 4, 2020 at 3:42 AM 张铎(Duo Zhang) <[email protected]> > > > wrote: > > > > > > > And speak a little more on increasing the forkCount. In fact, the > test > > > > category is not too rough. The LargeTests just means the test will > run > > a > > > > bit long, does not mean it will consume more resources. Maybe the > tests > > > > just have lots of Thread.sleep so we declare it as LargeTests. > > > > > > > > > > > I've done a few passes on test categorization of late. The notion had > > > rotted pretty bad but should be cleaned up now. > > > > > > > > > > What I can see is that, all the replication related tests are flaky > > now. > > > > This is reasonable. In replication tests, usually we have to set up > at > > > > least two mini clusters, and the replication system itself will make > > use > > > of > > > > lots of threads. So if you run several replication related tests > > > together, > > > > it will easy to overload and cause the UTs to timeout or OOM. > > > > > > > > > > > We have at least one test that makes four clusters inside the one JVM. > > > > > > Yeah, the resource usage in general needs weeding. > > > > > > Perhaps you are arguing that we just let the state of tests as they > are? > > > That we let long tests run in series in case two or more might run > > together > > > and fail because they are profligate in their resource use? > > > > > I mean increasing the fork count will lead to a random test result as the > > test category can not describe the resource usage clearly. You can run > > maybe 20+ light-weighted UTs without problem, but if you run 5 tests > which > > set up 4 mini clusters, the resource will be exhausted and cause the > tests > > to fail, or at least make it really slow and fail the tests... > > > > > > > > > > > > > > > So, again, let's do this on a feature branch. It is fine to mess > things > > > up > > > > on a feature branch. You can do everything you want as the > intermediate > > > > state does not effect others. On master and branch-2 it is another > > > story. I > > > > do not think this should be a blocker for 2.3.0 or 3.0.0. > > > > > > > > See previous note. > > > > > > Thanks, > > > S > > > > > > > > > > Thanks. > > > > > > > > 张铎(Duo Zhang) <[email protected]> 于2020年3月4日周三 下午7:34写道: > > > > > > > > > Due to the resource limit I do not think it is a good idea to > > increase > > > > the > > > > > forkCount... > > > > > > > > > > FWIW, can we do this on a feature branch and move master and > branch-2 > > > > back? > > > > > > > > > > See here > > > > > > > > > > https://github.com/apache/hbase/pull/1221 > > > > > > > > > > We tried several times and always got a large amount of failed UTs > > > which > > > > > are not related to the patch. And we even excluded hundreds of UTs > > due > > > to > > > > > the flaky list! > > > > > > > > > > This makes it almost impossible to contribute to the project. Even > > > after > > > > > several tries we get a green result, due to the excluded hundreds > of > > > UTs, > > > > > no one know if the patch breaks something. > > > > > > > > > > Thanks. > > > > > > > > > > Stack <[email protected]> 于2020年3月4日周三 下午2:55写道: > > > > > > > > > >> Upstream branch-2 and master nightlies don't look too bad > currently. > > > > There > > > > >> are a few bad runs where there were a bunch of hangs which makes > > > things > > > > >> look bad. I upped the number of tests we show from 5 to 10 on > > branch-2 > > > > and > > > > >> master which makes it so a failed tests shows longer in the top > half > > > of > > > > >> the > > > > >> flakies page -- and more flakies are listed. On the bottom half, > I'd > > > > upped > > > > >> the ferocity with which we run on GCE to draw out flakies. > Needless > > to > > > > >> say, > > > > >> they fail more often when contended resources. I might knock the > > > > ferocity > > > > >> down in the next day or so but am trying to land some patches that > > cut > > > > >> down > > > > >> on resource usage and want to see how these do in the flakie runs > > > first. > > > > >> > > > > >> Master I haven't looked at much... looks like branch-2? > Branch-2.2 > > > and > > > > >> branch-2.1 look sleepy. Similar amounts of flakies in the > nightlies. > > > > They > > > > >> don't have the ferocity upped so the lower-half GCE section looks > > > > >> 'better'. > > > > >> I can make them look like branch-2 and master if folks want > (smile) > > > but > > > > >> its > > > > >> probably ok letting the flakies lie in branches that are being > > > bypassed. > > > > >> > > > > >> Generally, I've been working on unit tests with inspiration and > > help > > > > from > > > > >> Mark Miller and Nick. Our tests are in a poor state. They take so > > > long, > > > > >> they don't get run anywhere else other than up on jenkins. They > > rarely > > > > >> pass > > > > >> and only then on accident if minimal parallelism and jitter. On > > > > multi-core > > > > >> machines, they use 1 to 2 cores only -- even if the machine has > tens > > > of > > > > >> them. > > > > >> > > > > >> I have been trying to burn down the flakies, make the tests > complete > > > > >> successfully in less time with more parallelism, using all of the > > > > machine, > > > > >> and make them pass both on jenkins and locally. Of late, have been > > > > focused > > > > >> on branch-2 since it is calming down getting ready for a 2.3.0RC0. > > > > Having > > > > >> some success but its a nasty job where it is hard to claim > advances > > > > >> because the flakies vary w/ the context in which the tests are > run. > > > > >> Hopefully we'll turn a corner on jenkins soon for folks to enjoy. > > > > >> > > > > >> Shout if need more detail. > > > > >> S > > > > >> > > > > >> > > > > >> On Tue, Mar 3, 2020 at 6:00 PM 张铎(Duo Zhang) < > [email protected] > > > > > > > >> wrote: > > > > >> > > > > >> > But why branch-2.2 and branch-2.1 are still fine? > > > > >> > > > > > >> > Sean Busbey <[email protected]> 于2020年3月4日周三 上午9:24写道: > > > > >> > > > > > >> > > I agree in principle that excluding 100s of UTs isn't good. > But > > we > > > > >> don't > > > > >> > > really have better options given the state of tests and > testing > > > > >> hardware > > > > >> > > currently available to us. > > > > >> > > > > > > >> > > On Tue, Mar 3, 2020, 19:14 张铎(Duo Zhang) < > [email protected] > > > > > > > >> wrote: > > > > >> > > > > > > >> > > > I think the problem is all UTs are failing randomly... > > > > >> > > > > > > > >> > > > And it is also not a good idea to exclude hundreds of UTs in > > pre > > > > >> > commit? > > > > >> > > > > > > > >> > > > Sean Busbey <[email protected]> 于2020年3月4日周三 上午9:11写道: > > > > >> > > > > > > > >> > > > > Everything in the flake list should be skipped at > precommit > > > > time. > > > > >> Is > > > > >> > > that > > > > >> > > > > not happening? > > > > >> > > > > > > > > >> > > > > Are we keeping a shorter flake window so things are > bouncing > > > in > > > > >> and > > > > >> > out > > > > >> > > > of > > > > >> > > > > the list? > > > > >> > > > > > > > > >> > > > > On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) < > > > [email protected] > > > > > > > > > >> > > wrote: > > > > >> > > > > > > > > >> > > > > > I see recently there are lots of 'flaky tests' related > > > issues > > > > >> been > > > > >> > > > > resolved > > > > >> > > > > > but seems the situation is getting worse? For branch-2.2 > > the > > > > >> flaky > > > > >> > > page > > > > >> > > > > is > > > > >> > > > > > fine, but for master it is totally a mess... > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html > > > > >> > > > > > > > > > >> > > > > > Lots of UTs are in trouble and it makes it really hard > to > > > pass > > > > >> the > > > > >> > > pre > > > > >> > > > > > commit check which means it is really hard to contribute > > to > > > > the > > > > >> > > > > project... > > > > >> > > > > > > > > > >> > > > > > We need to fix this soon... > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > >
