Thank you guys. This is great help. I remember your efforts Bogdan, as far as I remember you observer thread starvation in multiple runs on Apache Jenkins. Correct my if I’m wrong.
I’ve created an umbrella Jira to capture all flaky test fixing efforts here: https://issues.apache.org/jira/browse/ZOOKEEPER-3170 <https://issues.apache.org/jira/browse/ZOOKEEPER-3170> All previous flaky-related tickets have been converted to sub-tasks. Some of them might not be up-to-date, please consider reviewing them and close if possible. Additionally feel free to create new sub-tasks to capture your actual work. I’ve already modified Trunk and branch-3.5 builds to run on 4 threads for testing initially. It resulted in slightly more stable tests: Trunk (java 8) - failing 1/4 (since #229) - build time increased by 40-45% Trunk (java 9) - failing 0/2 (since #993) - ~40% Trunk (java 10) - failing 1/2 (since #280) - branch-3.5 (java 8) - failing 0/4 (since #1153) - ~35-45% However the pattern is not big enough and results are inaccurate, so I need more builds. I also need to fix a bug in SSL to get java9/10 builds working on 3.5. Please let me know if I should revert the changes. Precommit build is still running on 8 threads, but I’d like to change that one too. Regards, Andor > On 2018. Oct 15., at 9:31, Bogdan Kanivets <bkaniv...@gmail.com> wrote: > > Fangmin, > > Those are good ideas. > > FYI, I've stated running tests continuously in aws m1.xlarge. > https://github.com/lavacat/zookeeper-tests-lab > > So far, I've done ~ 12 runs of trunk. Same common offenders as in Flaky > dash: testManyChildWatchersAutoReset, testPurgeWhenLogRollingInProgress > I'll do some more runs, then try to come up with report. > > I'm using aws and not Apache Jenkins env because of better > control/observability. > > > > > On Sun, Oct 14, 2018 at 4:58 PM Fangmin Lv <lvfang...@gmail.com> wrote: > >> Internally, we also did some works to reduce the flaky, here are the main >> things we've done: >> >> * using retry rule to retry in case the zk client lost it's connection, >> this could happen if the quorum tests is running on unstable environment >> and the leader election happened. >> * using random port instead of sequentially to avoid the port racing when >> running tests concurrently >> * changing tests to avoid using the same test path when creating/deleting >> nodes >> >> These greatly reduced the flaky internally, we should try those if we're >> seeing similar issues in the Jenkins. >> >> Fangmin >> >> On Sat, Oct 13, 2018 at 10:48 AM Bogdan Kanivets <bkaniv...@gmail.com> >> wrote: >> >>> I've looked into flakiness couple months ago (special attention on >>> testManyChildWatchersAutoReset). In my opinion the problem is a) and c). >>> Unfortunately I don't have data to back this claim. >>> >>> I don't remember seeing many 'port binding' exceptions. Unless 'port >>> assignment' issue manifested as some other exception. >>> >>> Before decreasing number of threads I think more data should be >>> collected/visualized >>> >>> 1) Flaky dashboard is great, but we should add another report that maps >>> 'error causes' to builds/tests >>> 2) Flaky dash can be extended to save more history (for example like this >>> https://www.chromium.org/developers/testing/flakiness-dashboard) >>> 3) PreCommit builds should be included in dashboard >>> 4) We should have a common clean benchmark. For example - take >>> AWS t3.xlarge instance with set linux distro, jvm, zk commit sha and run >>> tests (current 8 threads) for 8 hours with 1 min cooldown. >>> >>> Due to recent employment change, I got sidetracked, but I really want to >>> get to the bottom of this. >>> I'm going to setup 4) and report results to this mailing list. Also >> willing >>> to work on other items. >>> >>> >>> >>> >>> >>> >>> On Sat, Oct 13, 2018 at 4:59 AM Enrico Olivelli <eolive...@gmail.com> >>> wrote: >>> >>>> Il ven 12 ott 2018, 23:17 Benjamin Reed <br...@apache.org> ha scritto: >>>> >>>>> i think the unique port assignment (d) is more problematic than it >>>>> appears. there is a race between finding a free port and actually >>>>> grabbing it. i think that contributes to the flakiness. >>>>> >>>> >>>> This is very hard to solve for our test cases, because we need to build >>>> configs before starting the groups of servers. >>>> For tests in single server it will be easier, you just have to start >> the >>>> server on port zero, get the port and the create client configs. >>>> I don't know how much it will be worth >>>> >>>> Enrico >>>> >>>> >>>>> ben >>>>> On Fri, Oct 12, 2018 at 8:50 AM Andor Molnar <an...@apache.org> >> wrote: >>>>>> >>>>>> That is a completely valid point. I started to investigate flakies >>> for >>>>> exactly the same reason, if you remember the thread that I started a >>>> while >>>>> ago. It was later abandoned unfortunately, because I’ve run into a >> few >>>>> issues: >>>>>> >>>>>> - We nailed down that in order to release 3.5 stable, we have to >> make >>>>> sure it’s not worse than 3.4 by comparing the builds: but these >> builds >>>> are >>>>> not comparable, because 3.4 tests running single threaded while 3.5 >>>>> multithreaded showing problems which might also exist on 3.4, >>>>>> >>>>>> - Neither of them running C++ tests for some reason, but that’s not >>>>> really an issue here, >>>>>> >>>>>> - Looks like tests on 3.5 is just as solid as on 3.4, because >> running >>>>> them on a dedicated, single threaded environment show almost all >> tests >>>>> succeeding, >>>>>> >>>>>> - I think the root cause of failing unit tests could be one (or >> more) >>>> of >>>>> the following: >>>>>> a) Environmental: Jenkins slave gets overloaded with other >>>>> builds and multithreaded test running makes things even worse: >> starving >>>> JDK >>>>> threads and ZK instances (both clients and servers) are unable to >>> operate >>>>>> b) Conceptional: ZK unit tests were not designed to run on >>>>> multiple threads: I investigated the unique port assignment feature >>> which >>>>> is looking good, but there could be other possible gaps which makes >>> them >>>>> unreliable when running simultaneously. >>>>>> c) Bad testing: testing ZK in the wrong way, making bad >>>>> assumption (e.g. not syncing clients), etc. >>>>>> d) Bug in the server. >>>>>> >>>>>> I feel that finding case d) with these tests is super hard, >> because a >>>>> test report doesn’t give any information on what could go wrong with >>>>> ZooKeeper. More or less guessing is your only option. >>>>>> >>>>>> Finding c) is a little bit easier, I’m trying to submit patches on >>> them >>>>> and hopefully making some progress. >>>>>> >>>>>> The huge pain in the arse though are a) and b): people desperately >>> keep >>>>> commenting “please retest this” on github to get a green build while >>>>> testing is going in a direction to hide real problems: I mean people >>>>> started not to care about a failing build, because “it must be some >>> flaky >>>>> unrelated to my patch”. Which is bad, but the shame is it’s true 90% >>>>> percent of cases. >>>>>> >>>>>> I’m just trying to find some ways - besides fixing c) and d) >> flakies >>> - >>>>> to get more reliable and more informative Jenkins builds. Don’t want >> to >>>>> make a huge turnaround, but I think if we can get a significantly >> more >>>>> reliable build for the price of slightly longer build time running >> on 4 >>>>> threads instead of 8, I say let’s do it. >>>>>> >>>>>> As always, any help from the community is more than welcome and >>>>> appreciated. >>>>>> >>>>>> Thanks, >>>>>> Andor >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On 2018. Oct 12., at 16:52, Patrick Hunt <ph...@apache.org> >> wrote: >>>>>>> >>>>>>> iirc the number of threads was increased to improve performance. >>>>> Reducing >>>>>>> is fine, but do we understand why it's failing? Perhaps it's >>> finding >>>>> real >>>>>>> issues as a result of the artificial concurrency/load. >>>>>>> >>>>>>> Patrick >>>>>>> >>>>>>> On Fri, Oct 12, 2018 at 7:12 AM Andor Molnar >>>>> <an...@cloudera.com.invalid> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks for the feedback. >>>>>>>> I'm running a few tests now: branch-3.5 on 2 threads and trunk >> on >>> 4 >>>>> threads >>>>>>>> to see what's the impact on the build time. >>>>>>>> >>>>>>>> Github PR job is hard to configure, because its settings are >> hard >>>>> coded >>>>>>>> into a shell script in the codebase. I have to open PR for that. >>>>>>>> >>>>>>>> Andor >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Oct 12, 2018 at 2:46 PM, Norbert Kalmar < >>>>>>>> nkal...@cloudera.com.invalid> wrote: >>>>>>>> >>>>>>>>> +1, running the tests locally with 1 thread always passes >> (well, >>> I >>>>> run it >>>>>>>>> about 5 times, but still) >>>>>>>>> On the other hand, running it on 8 threads yields similarly >> flaky >>>>> results >>>>>>>>> as Apache runs. (Although it is much faster, but if we have to >>> run >>>>> 6-8-10 >>>>>>>>> times sometimes to get a green run...) >>>>>>>>> >>>>>>>>> Norbert >>>>>>>>> >>>>>>>>> On Fri, Oct 12, 2018 at 2:05 PM Enrico Olivelli < >>>> eolive...@gmail.com >>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> +1 >>>>>>>>>> >>>>>>>>>> Enrico >>>>>>>>>> >>>>>>>>>> Il ven 12 ott 2018, 13:52 Andor Molnar <an...@apache.org> ha >>>>> scritto: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> What do you think of changing number of threads running unit >>>> tests >>>>> in >>>>>>>>>>> Jenkins from current 8 to 4 or even 2? >>>>>>>>>>> >>>>>>>>>>> Running unit tests inside Cloudera environment on a single >>> thread >>>>>>>> shows >>>>>>>>>> the >>>>>>>>>>> builds much more stable. That would be probably too slow, but >>>> maybe >>>>>>>>>> running >>>>>>>>>>> at least less threads would improve the situation. >>>>>>>>>>> >>>>>>>>>>> It's getting very annoying that I cannot get a green build on >>>>> GitHub >>>>>>>>> with >>>>>>>>>>> only a few retests. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Andor >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- Enrico Olivelli >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> -- >>>> >>>> >>>> -- Enrico Olivelli >>>> >>> >>