Thank you guys. This is great help.

I remember your efforts Bogdan, as far as I remember you observer thread 
starvation in multiple runs on Apache Jenkins. Correct my if I’m wrong.

I’ve created an umbrella Jira to capture all flaky test fixing efforts here:
https://issues.apache.org/jira/browse/ZOOKEEPER-3170 
<https://issues.apache.org/jira/browse/ZOOKEEPER-3170>

All previous flaky-related tickets have been converted to sub-tasks. Some of 
them might not be up-to-date, please consider reviewing them and close if 
possible. Additionally feel free to create new sub-tasks to capture your actual 
work.

I’ve already modified Trunk and branch-3.5 builds to run on 4 threads for 
testing initially. It resulted in slightly more stable tests:

Trunk (java 8) - failing 1/4 (since #229) - build time increased by 40-45%
Trunk (java 9) - failing 0/2 (since #993) - ~40%
Trunk (java 10) - failing 1/2 (since #280) - 
branch-3.5 (java 8) - failing 0/4 (since #1153) - ~35-45%

However the pattern is not big enough and results are inaccurate, so I need 
more builds. I also need to fix a bug in SSL to get java9/10 builds working on 
3.5.

Please let me know if I should revert the changes. Precommit build is still 
running on 8 threads, but I’d like to change that one too.

Regards,
Andor
 


> On 2018. Oct 15., at 9:31, Bogdan Kanivets <bkaniv...@gmail.com> wrote:
> 
> Fangmin,
> 
> Those are good ideas.
> 
> FYI, I've stated running tests continuously in aws m1.xlarge.
> https://github.com/lavacat/zookeeper-tests-lab
> 
> So far, I've done ~ 12 runs of trunk. Same common offenders as in Flaky
> dash: testManyChildWatchersAutoReset, testPurgeWhenLogRollingInProgress
> I'll do some more runs, then try to come up with report.
> 
> I'm using aws and not Apache Jenkins env because of better
> control/observability.
> 
> 
> 
> 
> On Sun, Oct 14, 2018 at 4:58 PM Fangmin Lv <lvfang...@gmail.com> wrote:
> 
>> Internally, we also did some works to reduce the flaky, here are the main
>> things we've done:
>> 
>> * using retry rule to retry in case the zk client lost it's connection,
>> this could happen if the quorum tests is running on unstable environment
>> and the leader election happened.
>> * using random port instead of sequentially to avoid the port racing when
>> running tests concurrently
>> * changing tests to avoid using the same test path when creating/deleting
>> nodes
>> 
>> These greatly reduced the flaky internally, we should try those if we're
>> seeing similar issues in the Jenkins.
>> 
>> Fangmin
>> 
>> On Sat, Oct 13, 2018 at 10:48 AM Bogdan Kanivets <bkaniv...@gmail.com>
>> wrote:
>> 
>>> I've looked into flakiness couple months ago (special attention on
>>> testManyChildWatchersAutoReset). In my opinion the problem is a) and c).
>>> Unfortunately I don't have data to back this claim.
>>> 
>>> I don't remember seeing many 'port binding' exceptions. Unless 'port
>>> assignment' issue manifested as some other exception.
>>> 
>>> Before decreasing number of threads I think more data should be
>>> collected/visualized
>>> 
>>> 1) Flaky dashboard is great, but we should add another report that maps
>>> 'error causes' to builds/tests
>>> 2) Flaky dash can be extended to save more history (for example like this
>>> https://www.chromium.org/developers/testing/flakiness-dashboard)
>>> 3) PreCommit builds should be included in dashboard
>>> 4) We should have a common clean benchmark. For example - take
>>> AWS t3.xlarge instance with set linux distro, jvm, zk commit sha and run
>>> tests (current 8 threads) for 8 hours with 1 min cooldown.
>>> 
>>> Due to recent employment change, I got sidetracked, but I really want to
>>> get to the bottom of this.
>>> I'm going to setup 4) and report results to this mailing list. Also
>> willing
>>> to work on other items.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Sat, Oct 13, 2018 at 4:59 AM Enrico Olivelli <eolive...@gmail.com>
>>> wrote:
>>> 
>>>> Il ven 12 ott 2018, 23:17 Benjamin Reed <br...@apache.org> ha scritto:
>>>> 
>>>>> i think the unique port assignment (d) is more problematic than it
>>>>> appears. there is a race between finding a free port and actually
>>>>> grabbing it. i think that contributes to the flakiness.
>>>>> 
>>>> 
>>>> This is very hard to solve for our test cases, because we need to build
>>>> configs before starting the groups of servers.
>>>> For tests in single server it will be easier, you just have to start
>> the
>>>> server on port zero, get the port and the create client configs.
>>>> I don't know how much it will be worth
>>>> 
>>>> Enrico
>>>> 
>>>> 
>>>>> ben
>>>>> On Fri, Oct 12, 2018 at 8:50 AM Andor Molnar <an...@apache.org>
>> wrote:
>>>>>> 
>>>>>> That is a completely valid point. I started to investigate flakies
>>> for
>>>>> exactly the same reason, if you remember the thread that I started a
>>>> while
>>>>> ago. It was later abandoned unfortunately, because I’ve run into a
>> few
>>>>> issues:
>>>>>> 
>>>>>> - We nailed down that in order to release 3.5 stable, we have to
>> make
>>>>> sure it’s not worse than 3.4 by comparing the builds: but these
>> builds
>>>> are
>>>>> not comparable, because 3.4 tests running single threaded while 3.5
>>>>> multithreaded showing problems which might also exist on 3.4,
>>>>>> 
>>>>>> - Neither of them running C++ tests for some reason, but that’s not
>>>>> really an issue here,
>>>>>> 
>>>>>> - Looks like tests on 3.5 is just as solid as on 3.4, because
>> running
>>>>> them on a dedicated, single threaded environment show almost all
>> tests
>>>>> succeeding,
>>>>>> 
>>>>>> - I think the root cause of failing unit tests could be one (or
>> more)
>>>> of
>>>>> the following:
>>>>>>        a) Environmental: Jenkins slave gets overloaded with other
>>>>> builds and multithreaded test running makes things even worse:
>> starving
>>>> JDK
>>>>> threads and ZK instances (both clients and servers) are unable to
>>> operate
>>>>>>        b) Conceptional: ZK unit tests were not designed to run on
>>>>> multiple threads: I investigated the unique port assignment feature
>>> which
>>>>> is looking good, but there could be other possible gaps which makes
>>> them
>>>>> unreliable when running simultaneously.
>>>>>>        c) Bad testing: testing ZK in the wrong way, making bad
>>>>> assumption (e.g. not syncing clients), etc.
>>>>>>        d) Bug in the server.
>>>>>> 
>>>>>> I feel that finding case d) with these tests is super hard,
>> because a
>>>>> test report doesn’t give any information on what could go wrong with
>>>>> ZooKeeper. More or less guessing is your only option.
>>>>>> 
>>>>>> Finding c) is a little bit easier, I’m trying to submit patches on
>>> them
>>>>> and hopefully making some progress.
>>>>>> 
>>>>>> The huge pain in the arse though are a) and b): people desperately
>>> keep
>>>>> commenting “please retest this” on github to get a green build while
>>>>> testing is going in a direction to hide real problems: I mean people
>>>>> started not to care about a failing build, because “it must be some
>>> flaky
>>>>> unrelated to my patch”. Which is bad, but the shame is it’s true 90%
>>>>> percent of cases.
>>>>>> 
>>>>>> I’m just trying to find some ways - besides fixing c) and d)
>> flakies
>>> -
>>>>> to get more reliable and more informative Jenkins builds. Don’t want
>> to
>>>>> make a huge turnaround, but I think if we can get a significantly
>> more
>>>>> reliable build for the price of slightly longer build time running
>> on 4
>>>>> threads instead of 8, I say let’s do it.
>>>>>> 
>>>>>> As always, any help from the community is more than welcome and
>>>>> appreciated.
>>>>>> 
>>>>>> Thanks,
>>>>>> Andor
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 2018. Oct 12., at 16:52, Patrick Hunt <ph...@apache.org>
>> wrote:
>>>>>>> 
>>>>>>> iirc the number of threads was increased to improve performance.
>>>>> Reducing
>>>>>>> is fine, but do we understand why it's failing? Perhaps it's
>>> finding
>>>>> real
>>>>>>> issues as a result of the artificial concurrency/load.
>>>>>>> 
>>>>>>> Patrick
>>>>>>> 
>>>>>>> On Fri, Oct 12, 2018 at 7:12 AM Andor Molnar
>>>>> <an...@cloudera.com.invalid>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Thanks for the feedback.
>>>>>>>> I'm running a few tests now: branch-3.5 on 2 threads and trunk
>> on
>>> 4
>>>>> threads
>>>>>>>> to see what's the impact on the build time.
>>>>>>>> 
>>>>>>>> Github PR job is hard to configure, because its settings are
>> hard
>>>>> coded
>>>>>>>> into a shell script in the codebase. I have to open PR for that.
>>>>>>>> 
>>>>>>>> Andor
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Oct 12, 2018 at 2:46 PM, Norbert Kalmar <
>>>>>>>> nkal...@cloudera.com.invalid> wrote:
>>>>>>>> 
>>>>>>>>> +1, running the tests locally with 1 thread always passes
>> (well,
>>> I
>>>>> run it
>>>>>>>>> about 5 times, but still)
>>>>>>>>> On the other hand, running it on 8 threads yields similarly
>> flaky
>>>>> results
>>>>>>>>> as Apache runs. (Although it is much faster, but if we have to
>>> run
>>>>> 6-8-10
>>>>>>>>> times sometimes to get a green run...)
>>>>>>>>> 
>>>>>>>>> Norbert
>>>>>>>>> 
>>>>>>>>> On Fri, Oct 12, 2018 at 2:05 PM Enrico Olivelli <
>>>> eolive...@gmail.com
>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> +1
>>>>>>>>>> 
>>>>>>>>>> Enrico
>>>>>>>>>> 
>>>>>>>>>> Il ven 12 ott 2018, 13:52 Andor Molnar <an...@apache.org> ha
>>>>> scritto:
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> What do you think of changing number of threads running unit
>>>> tests
>>>>> in
>>>>>>>>>>> Jenkins from current 8 to 4 or even 2?
>>>>>>>>>>> 
>>>>>>>>>>> Running unit tests inside Cloudera environment on a single
>>> thread
>>>>>>>> shows
>>>>>>>>>> the
>>>>>>>>>>> builds much more stable. That would be probably too slow, but
>>>> maybe
>>>>>>>>>> running
>>>>>>>>>>> at least less threads would improve the situation.
>>>>>>>>>>> 
>>>>>>>>>>> It's getting very annoying that I cannot get a green build on
>>>>> GitHub
>>>>>>>>> with
>>>>>>>>>>> only a few retests.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Andor
>>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -- Enrico Olivelli
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> --
>>>> 
>>>> 
>>>> -- Enrico Olivelli
>>>> 
>>> 
>> 

Reply via email to