Internally, we also did some works to reduce the flaky, here are the main things we've done:
* using retry rule to retry in case the zk client lost it's connection, this could happen if the quorum tests is running on unstable environment and the leader election happened. * using random port instead of sequentially to avoid the port racing when running tests concurrently * changing tests to avoid using the same test path when creating/deleting nodes These greatly reduced the flaky internally, we should try those if we're seeing similar issues in the Jenkins. Fangmin On Sat, Oct 13, 2018 at 10:48 AM Bogdan Kanivets <[email protected]> wrote: > I've looked into flakiness couple months ago (special attention on > testManyChildWatchersAutoReset). In my opinion the problem is a) and c). > Unfortunately I don't have data to back this claim. > > I don't remember seeing many 'port binding' exceptions. Unless 'port > assignment' issue manifested as some other exception. > > Before decreasing number of threads I think more data should be > collected/visualized > > 1) Flaky dashboard is great, but we should add another report that maps > 'error causes' to builds/tests > 2) Flaky dash can be extended to save more history (for example like this > https://www.chromium.org/developers/testing/flakiness-dashboard) > 3) PreCommit builds should be included in dashboard > 4) We should have a common clean benchmark. For example - take > AWS t3.xlarge instance with set linux distro, jvm, zk commit sha and run > tests (current 8 threads) for 8 hours with 1 min cooldown. > > Due to recent employment change, I got sidetracked, but I really want to > get to the bottom of this. > I'm going to setup 4) and report results to this mailing list. Also willing > to work on other items. > > > > > > > On Sat, Oct 13, 2018 at 4:59 AM Enrico Olivelli <[email protected]> > wrote: > > > Il ven 12 ott 2018, 23:17 Benjamin Reed <[email protected]> ha scritto: > > > > > i think the unique port assignment (d) is more problematic than it > > > appears. there is a race between finding a free port and actually > > > grabbing it. i think that contributes to the flakiness. > > > > > > > This is very hard to solve for our test cases, because we need to build > > configs before starting the groups of servers. > > For tests in single server it will be easier, you just have to start the > > server on port zero, get the port and the create client configs. > > I don't know how much it will be worth > > > > Enrico > > > > > > > ben > > > On Fri, Oct 12, 2018 at 8:50 AM Andor Molnar <[email protected]> wrote: > > > > > > > > That is a completely valid point. I started to investigate flakies > for > > > exactly the same reason, if you remember the thread that I started a > > while > > > ago. It was later abandoned unfortunately, because I’ve run into a few > > > issues: > > > > > > > > - We nailed down that in order to release 3.5 stable, we have to make > > > sure it’s not worse than 3.4 by comparing the builds: but these builds > > are > > > not comparable, because 3.4 tests running single threaded while 3.5 > > > multithreaded showing problems which might also exist on 3.4, > > > > > > > > - Neither of them running C++ tests for some reason, but that’s not > > > really an issue here, > > > > > > > > - Looks like tests on 3.5 is just as solid as on 3.4, because running > > > them on a dedicated, single threaded environment show almost all tests > > > succeeding, > > > > > > > > - I think the root cause of failing unit tests could be one (or more) > > of > > > the following: > > > > a) Environmental: Jenkins slave gets overloaded with other > > > builds and multithreaded test running makes things even worse: starving > > JDK > > > threads and ZK instances (both clients and servers) are unable to > operate > > > > b) Conceptional: ZK unit tests were not designed to run on > > > multiple threads: I investigated the unique port assignment feature > which > > > is looking good, but there could be other possible gaps which makes > them > > > unreliable when running simultaneously. > > > > c) Bad testing: testing ZK in the wrong way, making bad > > > assumption (e.g. not syncing clients), etc. > > > > d) Bug in the server. > > > > > > > > I feel that finding case d) with these tests is super hard, because a > > > test report doesn’t give any information on what could go wrong with > > > ZooKeeper. More or less guessing is your only option. > > > > > > > > Finding c) is a little bit easier, I’m trying to submit patches on > them > > > and hopefully making some progress. > > > > > > > > The huge pain in the arse though are a) and b): people desperately > keep > > > commenting “please retest this” on github to get a green build while > > > testing is going in a direction to hide real problems: I mean people > > > started not to care about a failing build, because “it must be some > flaky > > > unrelated to my patch”. Which is bad, but the shame is it’s true 90% > > > percent of cases. > > > > > > > > I’m just trying to find some ways - besides fixing c) and d) flakies > - > > > to get more reliable and more informative Jenkins builds. Don’t want to > > > make a huge turnaround, but I think if we can get a significantly more > > > reliable build for the price of slightly longer build time running on 4 > > > threads instead of 8, I say let’s do it. > > > > > > > > As always, any help from the community is more than welcome and > > > appreciated. > > > > > > > > Thanks, > > > > Andor > > > > > > > > > > > > > > > > > > > > > On 2018. Oct 12., at 16:52, Patrick Hunt <[email protected]> wrote: > > > > > > > > > > iirc the number of threads was increased to improve performance. > > > Reducing > > > > > is fine, but do we understand why it's failing? Perhaps it's > finding > > > real > > > > > issues as a result of the artificial concurrency/load. > > > > > > > > > > Patrick > > > > > > > > > > On Fri, Oct 12, 2018 at 7:12 AM Andor Molnar > > > <[email protected]> > > > > > wrote: > > > > > > > > > >> Thanks for the feedback. > > > > >> I'm running a few tests now: branch-3.5 on 2 threads and trunk on > 4 > > > threads > > > > >> to see what's the impact on the build time. > > > > >> > > > > >> Github PR job is hard to configure, because its settings are hard > > > coded > > > > >> into a shell script in the codebase. I have to open PR for that. > > > > >> > > > > >> Andor > > > > >> > > > > >> > > > > >> > > > > >> On Fri, Oct 12, 2018 at 2:46 PM, Norbert Kalmar < > > > > >> [email protected]> wrote: > > > > >> > > > > >>> +1, running the tests locally with 1 thread always passes (well, > I > > > run it > > > > >>> about 5 times, but still) > > > > >>> On the other hand, running it on 8 threads yields similarly flaky > > > results > > > > >>> as Apache runs. (Although it is much faster, but if we have to > run > > > 6-8-10 > > > > >>> times sometimes to get a green run...) > > > > >>> > > > > >>> Norbert > > > > >>> > > > > >>> On Fri, Oct 12, 2018 at 2:05 PM Enrico Olivelli < > > [email protected] > > > > > > > > >>> wrote: > > > > >>> > > > > >>>> +1 > > > > >>>> > > > > >>>> Enrico > > > > >>>> > > > > >>>> Il ven 12 ott 2018, 13:52 Andor Molnar <[email protected]> ha > > > scritto: > > > > >>>> > > > > >>>>> Hi, > > > > >>>>> > > > > >>>>> What do you think of changing number of threads running unit > > tests > > > in > > > > >>>>> Jenkins from current 8 to 4 or even 2? > > > > >>>>> > > > > >>>>> Running unit tests inside Cloudera environment on a single > thread > > > > >> shows > > > > >>>> the > > > > >>>>> builds much more stable. That would be probably too slow, but > > maybe > > > > >>>> running > > > > >>>>> at least less threads would improve the situation. > > > > >>>>> > > > > >>>>> It's getting very annoying that I cannot get a green build on > > > GitHub > > > > >>> with > > > > >>>>> only a few retests. > > > > >>>>> > > > > >>>>> Regards, > > > > >>>>> Andor > > > > >>>>> > > > > >>>> -- > > > > >>>> > > > > >>>> > > > > >>>> -- Enrico Olivelli > > > > >>>> > > > > >>> > > > > >> > > > > > > > > > -- > > > > > > -- Enrico Olivelli > > >
