Lots of ConnectionLoss and Address already in use failures on branch34_java9. Looks like specific to Jenkins slave H22.
Andor On Mon, Oct 15, 2018 at 2:50 PM, Andor Molnar <[email protected]> wrote: > +1 > > > > On Mon, Oct 15, 2018 at 1:55 PM, Enrico Olivelli <[email protected]> > wrote: > >> Il giorno lun 15 ott 2018 alle ore 12:46 Andor Molnar >> <[email protected]> ha scritto: >> > >> > Thank you guys. This is great help. >> > >> > I remember your efforts Bogdan, as far as I remember you observer >> thread starvation in multiple runs on Apache Jenkins. Correct my if I’m >> wrong. >> > >> > I’ve created an umbrella Jira to capture all flaky test fixing efforts >> here: >> > https://issues.apache.org/jira/browse/ZOOKEEPER-3170 < >> https://issues.apache.org/jira/browse/ZOOKEEPER-3170> >> > >> > All previous flaky-related tickets have been converted to sub-tasks. >> Some of them might not be up-to-date, please consider reviewing them and >> close if possible. Additionally feel free to create new sub-tasks to >> capture your actual work. >> > >> > I’ve already modified Trunk and branch-3.5 builds to run on 4 threads >> for testing initially. It resulted in slightly more stable tests: >> >> +1 >> >> I have assigned the umbrella issue to you Andor as you are driving >> this important task. is is ok ? >> >> thank you >> >> Enrico >> >> >> > >> > Trunk (java 8) - failing 1/4 (since #229) - build time increased by >> 40-45% >> > Trunk (java 9) - failing 0/2 (since #993) - ~40% >> > Trunk (java 10) - failing 1/2 (since #280) - >> > branch-3.5 (java 8) - failing 0/4 (since #1153) - ~35-45% >> > >> > However the pattern is not big enough and results are inaccurate, so I >> need more builds. I also need to fix a bug in SSL to get java9/10 builds >> working on 3.5. >> > >> > Please let me know if I should revert the changes. Precommit build is >> still running on 8 threads, but I’d like to change that one too. >> > >> > Regards, >> > Andor >> > >> > >> > >> > > On 2018. Oct 15., at 9:31, Bogdan Kanivets <[email protected]> >> wrote: >> > > >> > > Fangmin, >> > > >> > > Those are good ideas. >> > > >> > > FYI, I've stated running tests continuously in aws m1.xlarge. >> > > https://github.com/lavacat/zookeeper-tests-lab >> > > >> > > So far, I've done ~ 12 runs of trunk. Same common offenders as in >> Flaky >> > > dash: testManyChildWatchersAutoReset, testPurgeWhenLogRollingInProgr >> ess >> > > I'll do some more runs, then try to come up with report. >> > > >> > > I'm using aws and not Apache Jenkins env because of better >> > > control/observability. >> > > >> > > >> > > >> > > >> > > On Sun, Oct 14, 2018 at 4:58 PM Fangmin Lv <[email protected]> >> wrote: >> > > >> > >> Internally, we also did some works to reduce the flaky, here are the >> main >> > >> things we've done: >> > >> >> > >> * using retry rule to retry in case the zk client lost it's >> connection, >> > >> this could happen if the quorum tests is running on unstable >> environment >> > >> and the leader election happened. >> > >> * using random port instead of sequentially to avoid the port racing >> when >> > >> running tests concurrently >> > >> * changing tests to avoid using the same test path when >> creating/deleting >> > >> nodes >> > >> >> > >> These greatly reduced the flaky internally, we should try those if >> we're >> > >> seeing similar issues in the Jenkins. >> > >> >> > >> Fangmin >> > >> >> > >> On Sat, Oct 13, 2018 at 10:48 AM Bogdan Kanivets < >> [email protected]> >> > >> wrote: >> > >> >> > >>> I've looked into flakiness couple months ago (special attention on >> > >>> testManyChildWatchersAutoReset). In my opinion the problem is a) >> and c). >> > >>> Unfortunately I don't have data to back this claim. >> > >>> >> > >>> I don't remember seeing many 'port binding' exceptions. Unless 'port >> > >>> assignment' issue manifested as some other exception. >> > >>> >> > >>> Before decreasing number of threads I think more data should be >> > >>> collected/visualized >> > >>> >> > >>> 1) Flaky dashboard is great, but we should add another report that >> maps >> > >>> 'error causes' to builds/tests >> > >>> 2) Flaky dash can be extended to save more history (for example >> like this >> > >>> https://www.chromium.org/developers/testing/flakiness-dashboard) >> > >>> 3) PreCommit builds should be included in dashboard >> > >>> 4) We should have a common clean benchmark. For example - take >> > >>> AWS t3.xlarge instance with set linux distro, jvm, zk commit sha >> and run >> > >>> tests (current 8 threads) for 8 hours with 1 min cooldown. >> > >>> >> > >>> Due to recent employment change, I got sidetracked, but I really >> want to >> > >>> get to the bottom of this. >> > >>> I'm going to setup 4) and report results to this mailing list. Also >> > >> willing >> > >>> to work on other items. >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> On Sat, Oct 13, 2018 at 4:59 AM Enrico Olivelli < >> [email protected]> >> > >>> wrote: >> > >>> >> > >>>> Il ven 12 ott 2018, 23:17 Benjamin Reed <[email protected]> ha >> scritto: >> > >>>> >> > >>>>> i think the unique port assignment (d) is more problematic than it >> > >>>>> appears. there is a race between finding a free port and actually >> > >>>>> grabbing it. i think that contributes to the flakiness. >> > >>>>> >> > >>>> >> > >>>> This is very hard to solve for our test cases, because we need to >> build >> > >>>> configs before starting the groups of servers. >> > >>>> For tests in single server it will be easier, you just have to >> start >> > >> the >> > >>>> server on port zero, get the port and the create client configs. >> > >>>> I don't know how much it will be worth >> > >>>> >> > >>>> Enrico >> > >>>> >> > >>>> >> > >>>>> ben >> > >>>>> On Fri, Oct 12, 2018 at 8:50 AM Andor Molnar <[email protected]> >> > >> wrote: >> > >>>>>> >> > >>>>>> That is a completely valid point. I started to investigate >> flakies >> > >>> for >> > >>>>> exactly the same reason, if you remember the thread that I >> started a >> > >>>> while >> > >>>>> ago. It was later abandoned unfortunately, because I’ve run into a >> > >> few >> > >>>>> issues: >> > >>>>>> >> > >>>>>> - We nailed down that in order to release 3.5 stable, we have to >> > >> make >> > >>>>> sure it’s not worse than 3.4 by comparing the builds: but these >> > >> builds >> > >>>> are >> > >>>>> not comparable, because 3.4 tests running single threaded while >> 3.5 >> > >>>>> multithreaded showing problems which might also exist on 3.4, >> > >>>>>> >> > >>>>>> - Neither of them running C++ tests for some reason, but that’s >> not >> > >>>>> really an issue here, >> > >>>>>> >> > >>>>>> - Looks like tests on 3.5 is just as solid as on 3.4, because >> > >> running >> > >>>>> them on a dedicated, single threaded environment show almost all >> > >> tests >> > >>>>> succeeding, >> > >>>>>> >> > >>>>>> - I think the root cause of failing unit tests could be one (or >> > >> more) >> > >>>> of >> > >>>>> the following: >> > >>>>>> a) Environmental: Jenkins slave gets overloaded with other >> > >>>>> builds and multithreaded test running makes things even worse: >> > >> starving >> > >>>> JDK >> > >>>>> threads and ZK instances (both clients and servers) are unable to >> > >>> operate >> > >>>>>> b) Conceptional: ZK unit tests were not designed to run on >> > >>>>> multiple threads: I investigated the unique port assignment >> feature >> > >>> which >> > >>>>> is looking good, but there could be other possible gaps which >> makes >> > >>> them >> > >>>>> unreliable when running simultaneously. >> > >>>>>> c) Bad testing: testing ZK in the wrong way, making bad >> > >>>>> assumption (e.g. not syncing clients), etc. >> > >>>>>> d) Bug in the server. >> > >>>>>> >> > >>>>>> I feel that finding case d) with these tests is super hard, >> > >> because a >> > >>>>> test report doesn’t give any information on what could go wrong >> with >> > >>>>> ZooKeeper. More or less guessing is your only option. >> > >>>>>> >> > >>>>>> Finding c) is a little bit easier, I’m trying to submit patches >> on >> > >>> them >> > >>>>> and hopefully making some progress. >> > >>>>>> >> > >>>>>> The huge pain in the arse though are a) and b): people >> desperately >> > >>> keep >> > >>>>> commenting “please retest this” on github to get a green build >> while >> > >>>>> testing is going in a direction to hide real problems: I mean >> people >> > >>>>> started not to care about a failing build, because “it must be >> some >> > >>> flaky >> > >>>>> unrelated to my patch”. Which is bad, but the shame is it’s true >> 90% >> > >>>>> percent of cases. >> > >>>>>> >> > >>>>>> I’m just trying to find some ways - besides fixing c) and d) >> > >> flakies >> > >>> - >> > >>>>> to get more reliable and more informative Jenkins builds. Don’t >> want >> > >> to >> > >>>>> make a huge turnaround, but I think if we can get a significantly >> > >> more >> > >>>>> reliable build for the price of slightly longer build time running >> > >> on 4 >> > >>>>> threads instead of 8, I say let’s do it. >> > >>>>>> >> > >>>>>> As always, any help from the community is more than welcome and >> > >>>>> appreciated. >> > >>>>>> >> > >>>>>> Thanks, >> > >>>>>> Andor >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>>> On 2018. Oct 12., at 16:52, Patrick Hunt <[email protected]> >> > >> wrote: >> > >>>>>>> >> > >>>>>>> iirc the number of threads was increased to improve performance. >> > >>>>> Reducing >> > >>>>>>> is fine, but do we understand why it's failing? Perhaps it's >> > >>> finding >> > >>>>> real >> > >>>>>>> issues as a result of the artificial concurrency/load. >> > >>>>>>> >> > >>>>>>> Patrick >> > >>>>>>> >> > >>>>>>> On Fri, Oct 12, 2018 at 7:12 AM Andor Molnar >> > >>>>> <[email protected]> >> > >>>>>>> wrote: >> > >>>>>>> >> > >>>>>>>> Thanks for the feedback. >> > >>>>>>>> I'm running a few tests now: branch-3.5 on 2 threads and trunk >> > >> on >> > >>> 4 >> > >>>>> threads >> > >>>>>>>> to see what's the impact on the build time. >> > >>>>>>>> >> > >>>>>>>> Github PR job is hard to configure, because its settings are >> > >> hard >> > >>>>> coded >> > >>>>>>>> into a shell script in the codebase. I have to open PR for >> that. >> > >>>>>>>> >> > >>>>>>>> Andor >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> On Fri, Oct 12, 2018 at 2:46 PM, Norbert Kalmar < >> > >>>>>>>> [email protected]> wrote: >> > >>>>>>>> >> > >>>>>>>>> +1, running the tests locally with 1 thread always passes >> > >> (well, >> > >>> I >> > >>>>> run it >> > >>>>>>>>> about 5 times, but still) >> > >>>>>>>>> On the other hand, running it on 8 threads yields similarly >> > >> flaky >> > >>>>> results >> > >>>>>>>>> as Apache runs. (Although it is much faster, but if we have to >> > >>> run >> > >>>>> 6-8-10 >> > >>>>>>>>> times sometimes to get a green run...) >> > >>>>>>>>> >> > >>>>>>>>> Norbert >> > >>>>>>>>> >> > >>>>>>>>> On Fri, Oct 12, 2018 at 2:05 PM Enrico Olivelli < >> > >>>> [email protected] >> > >>>>>> >> > >>>>>>>>> wrote: >> > >>>>>>>>> >> > >>>>>>>>>> +1 >> > >>>>>>>>>> >> > >>>>>>>>>> Enrico >> > >>>>>>>>>> >> > >>>>>>>>>> Il ven 12 ott 2018, 13:52 Andor Molnar <[email protected]> ha >> > >>>>> scritto: >> > >>>>>>>>>> >> > >>>>>>>>>>> Hi, >> > >>>>>>>>>>> >> > >>>>>>>>>>> What do you think of changing number of threads running unit >> > >>>> tests >> > >>>>> in >> > >>>>>>>>>>> Jenkins from current 8 to 4 or even 2? >> > >>>>>>>>>>> >> > >>>>>>>>>>> Running unit tests inside Cloudera environment on a single >> > >>> thread >> > >>>>>>>> shows >> > >>>>>>>>>> the >> > >>>>>>>>>>> builds much more stable. That would be probably too slow, >> but >> > >>>> maybe >> > >>>>>>>>>> running >> > >>>>>>>>>>> at least less threads would improve the situation. >> > >>>>>>>>>>> >> > >>>>>>>>>>> It's getting very annoying that I cannot get a green build >> on >> > >>>>> GitHub >> > >>>>>>>>> with >> > >>>>>>>>>>> only a few retests. >> > >>>>>>>>>>> >> > >>>>>>>>>>> Regards, >> > >>>>>>>>>>> Andor >> > >>>>>>>>>>> >> > >>>>>>>>>> -- >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> -- Enrico Olivelli >> > >>>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>> >> > >>>>>> >> > >>>>> >> > >>>> -- >> > >>>> >> > >>>> >> > >>>> -- Enrico Olivelli >> > >>>> >> > >>> >> > >> >> > >> > >
