Re: Decrease number of threads in Jenkins builds to reduce flakyness

Andor Molnar Thu, 18 Oct 2018 22:02:45 -0700

Lots of ConnectionLoss and Address already in use failures on
branch34_java9.
Looks like specific to Jenkins slave H22.


Andor



On Mon, Oct 15, 2018 at 2:50 PM, Andor Molnar <[email protected]> wrote:

> +1
>
>
>
> On Mon, Oct 15, 2018 at 1:55 PM, Enrico Olivelli <[email protected]>
> wrote:
>
>> Il giorno lun 15 ott 2018 alle ore 12:46 Andor Molnar
>> <[email protected]> ha scritto:
>> >
>> > Thank you guys. This is great help.
>> >
>> > I remember your efforts Bogdan, as far as I remember you observer
>> thread starvation in multiple runs on Apache Jenkins. Correct my if I’m
>> wrong.
>> >
>> > I’ve created an umbrella Jira to capture all flaky test fixing efforts
>> here:
>> > https://issues.apache.org/jira/browse/ZOOKEEPER-3170 <
>> https://issues.apache.org/jira/browse/ZOOKEEPER-3170>
>> >
>> > All previous flaky-related tickets have been converted to sub-tasks.
>> Some of them might not be up-to-date, please consider reviewing them and
>> close if possible. Additionally feel free to create new sub-tasks to
>> capture your actual work.
>> >
>> > I’ve already modified Trunk and branch-3.5 builds to run on 4 threads
>> for testing initially. It resulted in slightly more stable tests:
>>
>> +1
>>
>> I have assigned the umbrella issue to you Andor as you are driving
>> this important task. is is ok ?
>>
>> thank you
>>
>> Enrico
>>
>>
>> >
>> > Trunk (java 8) - failing 1/4 (since #229) - build time increased by
>> 40-45%
>> > Trunk (java 9) - failing 0/2 (since #993) - ~40%
>> > Trunk (java 10) - failing 1/2 (since #280) -
>> > branch-3.5 (java 8) - failing 0/4 (since #1153) - ~35-45%
>> >
>> > However the pattern is not big enough and results are inaccurate, so I
>> need more builds. I also need to fix a bug in SSL to get java9/10 builds
>> working on 3.5.
>> >
>> > Please let me know if I should revert the changes. Precommit build is
>> still running on 8 threads, but I’d like to change that one too.
>> >
>> > Regards,
>> > Andor
>> >
>> >
>> >
>> > > On 2018. Oct 15., at 9:31, Bogdan Kanivets <[email protected]>
>> wrote:
>> > >
>> > > Fangmin,
>> > >
>> > > Those are good ideas.
>> > >
>> > > FYI, I've stated running tests continuously in aws m1.xlarge.
>> > > https://github.com/lavacat/zookeeper-tests-lab
>> > >
>> > > So far, I've done ~ 12 runs of trunk. Same common offenders as in
>> Flaky
>> > > dash: testManyChildWatchersAutoReset, testPurgeWhenLogRollingInProgr
>> ess
>> > > I'll do some more runs, then try to come up with report.
>> > >
>> > > I'm using aws and not Apache Jenkins env because of better
>> > > control/observability.
>> > >
>> > >
>> > >
>> > >
>> > > On Sun, Oct 14, 2018 at 4:58 PM Fangmin Lv <[email protected]>
>> wrote:
>> > >
>> > >> Internally, we also did some works to reduce the flaky, here are the
>> main
>> > >> things we've done:
>> > >>
>> > >> * using retry rule to retry in case the zk client lost it's
>> connection,
>> > >> this could happen if the quorum tests is running on unstable
>> environment
>> > >> and the leader election happened.
>> > >> * using random port instead of sequentially to avoid the port racing
>> when
>> > >> running tests concurrently
>> > >> * changing tests to avoid using the same test path when
>> creating/deleting
>> > >> nodes
>> > >>
>> > >> These greatly reduced the flaky internally, we should try those if
>> we're
>> > >> seeing similar issues in the Jenkins.
>> > >>
>> > >> Fangmin
>> > >>
>> > >> On Sat, Oct 13, 2018 at 10:48 AM Bogdan Kanivets <
>> [email protected]>
>> > >> wrote:
>> > >>
>> > >>> I've looked into flakiness couple months ago (special attention on
>> > >>> testManyChildWatchersAutoReset). In my opinion the problem is a)
>> and c).
>> > >>> Unfortunately I don't have data to back this claim.
>> > >>>
>> > >>> I don't remember seeing many 'port binding' exceptions. Unless 'port
>> > >>> assignment' issue manifested as some other exception.
>> > >>>
>> > >>> Before decreasing number of threads I think more data should be
>> > >>> collected/visualized
>> > >>>
>> > >>> 1) Flaky dashboard is great, but we should add another report that
>> maps
>> > >>> 'error causes' to builds/tests
>> > >>> 2) Flaky dash can be extended to save more history (for example
>> like this
>> > >>> https://www.chromium.org/developers/testing/flakiness-dashboard)
>> > >>> 3) PreCommit builds should be included in dashboard
>> > >>> 4) We should have a common clean benchmark. For example - take
>> > >>> AWS t3.xlarge instance with set linux distro, jvm, zk commit sha
>> and run
>> > >>> tests (current 8 threads) for 8 hours with 1 min cooldown.
>> > >>>
>> > >>> Due to recent employment change, I got sidetracked, but I really
>> want to
>> > >>> get to the bottom of this.
>> > >>> I'm going to setup 4) and report results to this mailing list. Also
>> > >> willing
>> > >>> to work on other items.
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> On Sat, Oct 13, 2018 at 4:59 AM Enrico Olivelli <
>> [email protected]>
>> > >>> wrote:
>> > >>>
>> > >>>> Il ven 12 ott 2018, 23:17 Benjamin Reed <[email protected]> ha
>> scritto:
>> > >>>>
>> > >>>>> i think the unique port assignment (d) is more problematic than it
>> > >>>>> appears. there is a race between finding a free port and actually
>> > >>>>> grabbing it. i think that contributes to the flakiness.
>> > >>>>>
>> > >>>>
>> > >>>> This is very hard to solve for our test cases, because we need to
>> build
>> > >>>> configs before starting the groups of servers.
>> > >>>> For tests in single server it will be easier, you just have to
>> start
>> > >> the
>> > >>>> server on port zero, get the port and the create client configs.
>> > >>>> I don't know how much it will be worth
>> > >>>>
>> > >>>> Enrico
>> > >>>>
>> > >>>>
>> > >>>>> ben
>> > >>>>> On Fri, Oct 12, 2018 at 8:50 AM Andor Molnar <[email protected]>
>> > >> wrote:
>> > >>>>>>
>> > >>>>>> That is a completely valid point. I started to investigate
>> flakies
>> > >>> for
>> > >>>>> exactly the same reason, if you remember the thread that I
>> started a
>> > >>>> while
>> > >>>>> ago. It was later abandoned unfortunately, because I’ve run into a
>> > >> few
>> > >>>>> issues:
>> > >>>>>>
>> > >>>>>> - We nailed down that in order to release 3.5 stable, we have to
>> > >> make
>> > >>>>> sure it’s not worse than 3.4 by comparing the builds: but these
>> > >> builds
>> > >>>> are
>> > >>>>> not comparable, because 3.4 tests running single threaded while
>> 3.5
>> > >>>>> multithreaded showing problems which might also exist on 3.4,
>> > >>>>>>
>> > >>>>>> - Neither of them running C++ tests for some reason, but that’s
>> not
>> > >>>>> really an issue here,
>> > >>>>>>
>> > >>>>>> - Looks like tests on 3.5 is just as solid as on 3.4, because
>> > >> running
>> > >>>>> them on a dedicated, single threaded environment show almost all
>> > >> tests
>> > >>>>> succeeding,
>> > >>>>>>
>> > >>>>>> - I think the root cause of failing unit tests could be one (or
>> > >> more)
>> > >>>> of
>> > >>>>> the following:
>> > >>>>>>        a) Environmental: Jenkins slave gets overloaded with other
>> > >>>>> builds and multithreaded test running makes things even worse:
>> > >> starving
>> > >>>> JDK
>> > >>>>> threads and ZK instances (both clients and servers) are unable to
>> > >>> operate
>> > >>>>>>        b) Conceptional: ZK unit tests were not designed to run on
>> > >>>>> multiple threads: I investigated the unique port assignment
>> feature
>> > >>> which
>> > >>>>> is looking good, but there could be other possible gaps which
>> makes
>> > >>> them
>> > >>>>> unreliable when running simultaneously.
>> > >>>>>>        c) Bad testing: testing ZK in the wrong way, making bad
>> > >>>>> assumption (e.g. not syncing clients), etc.
>> > >>>>>>        d) Bug in the server.
>> > >>>>>>
>> > >>>>>> I feel that finding case d) with these tests is super hard,
>> > >> because a
>> > >>>>> test report doesn’t give any information on what could go wrong
>> with
>> > >>>>> ZooKeeper. More or less guessing is your only option.
>> > >>>>>>
>> > >>>>>> Finding c) is a little bit easier, I’m trying to submit patches
>> on
>> > >>> them
>> > >>>>> and hopefully making some progress.
>> > >>>>>>
>> > >>>>>> The huge pain in the arse though are a) and b): people
>> desperately
>> > >>> keep
>> > >>>>> commenting “please retest this” on github to get a green build
>> while
>> > >>>>> testing is going in a direction to hide real problems: I mean
>> people
>> > >>>>> started not to care about a failing build, because “it must be
>> some
>> > >>> flaky
>> > >>>>> unrelated to my patch”. Which is bad, but the shame is it’s true
>> 90%
>> > >>>>> percent of cases.
>> > >>>>>>
>> > >>>>>> I’m just trying to find some ways - besides fixing c) and d)
>> > >> flakies
>> > >>> -
>> > >>>>> to get more reliable and more informative Jenkins builds. Don’t
>> want
>> > >> to
>> > >>>>> make a huge turnaround, but I think if we can get a significantly
>> > >> more
>> > >>>>> reliable build for the price of slightly longer build time running
>> > >> on 4
>> > >>>>> threads instead of 8, I say let’s do it.
>> > >>>>>>
>> > >>>>>> As always, any help from the community is more than welcome and
>> > >>>>> appreciated.
>> > >>>>>>
>> > >>>>>> Thanks,
>> > >>>>>> Andor
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>> On 2018. Oct 12., at 16:52, Patrick Hunt <[email protected]>
>> > >> wrote:
>> > >>>>>>>
>> > >>>>>>> iirc the number of threads was increased to improve performance.
>> > >>>>> Reducing
>> > >>>>>>> is fine, but do we understand why it's failing? Perhaps it's
>> > >>> finding
>> > >>>>> real
>> > >>>>>>> issues as a result of the artificial concurrency/load.
>> > >>>>>>>
>> > >>>>>>> Patrick
>> > >>>>>>>
>> > >>>>>>> On Fri, Oct 12, 2018 at 7:12 AM Andor Molnar
>> > >>>>> <[email protected]>
>> > >>>>>>> wrote:
>> > >>>>>>>
>> > >>>>>>>> Thanks for the feedback.
>> > >>>>>>>> I'm running a few tests now: branch-3.5 on 2 threads and trunk
>> > >> on
>> > >>> 4
>> > >>>>> threads
>> > >>>>>>>> to see what's the impact on the build time.
>> > >>>>>>>>
>> > >>>>>>>> Github PR job is hard to configure, because its settings are
>> > >> hard
>> > >>>>> coded
>> > >>>>>>>> into a shell script in the codebase. I have to open PR for
>> that.
>> > >>>>>>>>
>> > >>>>>>>> Andor
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> On Fri, Oct 12, 2018 at 2:46 PM, Norbert Kalmar <
>> > >>>>>>>> [email protected]> wrote:
>> > >>>>>>>>
>> > >>>>>>>>> +1, running the tests locally with 1 thread always passes
>> > >> (well,
>> > >>> I
>> > >>>>> run it
>> > >>>>>>>>> about 5 times, but still)
>> > >>>>>>>>> On the other hand, running it on 8 threads yields similarly
>> > >> flaky
>> > >>>>> results
>> > >>>>>>>>> as Apache runs. (Although it is much faster, but if we have to
>> > >>> run
>> > >>>>> 6-8-10
>> > >>>>>>>>> times sometimes to get a green run...)
>> > >>>>>>>>>
>> > >>>>>>>>> Norbert
>> > >>>>>>>>>
>> > >>>>>>>>> On Fri, Oct 12, 2018 at 2:05 PM Enrico Olivelli <
>> > >>>> [email protected]
>> > >>>>>>
>> > >>>>>>>>> wrote:
>> > >>>>>>>>>
>> > >>>>>>>>>> +1
>> > >>>>>>>>>>
>> > >>>>>>>>>> Enrico
>> > >>>>>>>>>>
>> > >>>>>>>>>> Il ven 12 ott 2018, 13:52 Andor Molnar <[email protected]> ha
>> > >>>>> scritto:
>> > >>>>>>>>>>
>> > >>>>>>>>>>> Hi,
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> What do you think of changing number of threads running unit
>> > >>>> tests
>> > >>>>> in
>> > >>>>>>>>>>> Jenkins from current 8 to 4 or even 2?
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Running unit tests inside Cloudera environment on a single
>> > >>> thread
>> > >>>>>>>> shows
>> > >>>>>>>>>> the
>> > >>>>>>>>>>> builds much more stable. That would be probably too slow,
>> but
>> > >>>> maybe
>> > >>>>>>>>>> running
>> > >>>>>>>>>>> at least less threads would improve the situation.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> It's getting very annoying that I cannot get a green build
>> on
>> > >>>>> GitHub
>> > >>>>>>>>> with
>> > >>>>>>>>>>> only a few retests.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Regards,
>> > >>>>>>>>>>> Andor
>> > >>>>>>>>>>>
>> > >>>>>>>>>> --
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> -- Enrico Olivelli
>> > >>>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>
>> > >>>>>>
>> > >>>>>
>> > >>>> --
>> > >>>>
>> > >>>>
>> > >>>> -- Enrico Olivelli
>> > >>>>
>> > >>>
>> > >>
>> >
>>
>
>

Re: Decrease number of threads in Jenkins builds to reduce flakyness

Reply via email to