Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-26 Thread Andor Molnar
Hi all,

I’ve updated a bunch of old tickets under this umbrella to reflect the most 
up-to-date situation:

https://issues.apache.org/jira/browse/ZOOKEEPER-3170 


Flakies which are not flakies anymore have been closed and will create new 
tickets according to latest flaky report.

Please let me know or reply in Jira, if you have any concerns.

Thanks,
Andor



> On 2018. Oct 22., at 18:35, Andor Molnár  wrote:
> 
> Thanks Bogdan, so far so good.
> 
> testNodeDataChanged is an old beast, I've a possible fix for that from
> @afine:
> 
> https://github.com/apache/zookeeper/pull/300
> 
> Would be great if we could review it and get rid of this flaky.
> 
> 
> Andor
> 
> 
> 
> 
> On 10/20/18 06:41, Bogdan Kanivets wrote:
>> I think the argument for keeping concurrency is that it may manifest some
>> unknown problems with the code.
>> 
>> Maybe a middle ground  - move largest offenders into separate junit tag and
>> run them after rest of the test with threads=1. Hopefully this will make
>> life better for PRs.
>> 
>> On the note of largest offenders, I've done 44 runs on aws r3.large with
>> various thread settings (1, 2, 4, 8).
>> Failure counts:
>>  1 testNextConfigAlreadyActive
>>  1 testNonExistingOpCode
>>  1 testRaceConditionBetweenLeaderAndAckRequestProcessor
>>  1 testWatcherDisconnectOnClose
>>  2 testDoubleElection
>>  5 testCurrentServersAreObserversInNextConfig
>>  5 testNormalFollowerRunWithDiff
>>  7 startSingleServerTest
>> 18 testNodeDataChanged
>> 
>> Haven't seen testPurgeWhenLogRollingInProgress
>> or testManyChildWatchersAutoReset failing yet.
>> 
>> 
>> 
>> On Thu, Oct 18, 2018 at 10:03 PM Michael Han  wrote:
>> 
>>> It's a good idea to reduce the concurrency of to eliminate flakyness. Looks
>>> like single threaded unit tests on trunk is pretty stable
>>> https://builds.apache.org/job/zookeeper-trunk-single-thread/ (some
>>> failures
>>> are due to C tests). The build time is longer, but not too bad (for
>>> pre-commit build, for nightly build, build time should not be a concern at
>>> all).
>>> 
>>> 
>>> On Mon, Oct 15, 2018 at 5:50 AM Andor Molnar 
>>> wrote:
>>> 
 +1
 
 
 
 On Mon, Oct 15, 2018 at 1:55 PM, Enrico Olivelli 
 wrote:
 
> Il giorno lun 15 ott 2018 alle ore 12:46 Andor Molnar
>  ha scritto:
>> Thank you guys. This is great help.
>> 
>> I remember your efforts Bogdan, as far as I remember you observer
 thread
> starvation in multiple runs on Apache Jenkins. Correct my if I’m wrong.
>> I’ve created an umbrella Jira to capture all flaky test fixing
>>> efforts
> here:
>> https://issues.apache.org/jira/browse/ZOOKEEPER-3170 <
> https://issues.apache.org/jira/browse/ZOOKEEPER-3170>
>> All previous flaky-related tickets have been converted to sub-tasks.
> Some of them might not be up-to-date, please consider reviewing them
>>> and
> close if possible. Additionally feel free to create new sub-tasks to
> capture your actual work.
>> I’ve already modified Trunk and branch-3.5 builds to run on 4 threads
> for testing initially. It resulted in slightly more stable tests:
> 
> +1
> 
> I have assigned the umbrella issue to you Andor as you are driving
> this important task. is is ok ?
> 
> thank you
> 
> Enrico
> 
> 
>> Trunk (java 8) - failing 1/4 (since #229) - build time increased by
> 40-45%
>> Trunk (java 9) - failing 0/2 (since #993) - ~40%
>> Trunk (java 10) - failing 1/2 (since #280) -
>> branch-3.5 (java 8) - failing 0/4 (since #1153) - ~35-45%
>> 
>> However the pattern is not big enough and results are inaccurate, so
>>> I
> need more builds. I also need to fix a bug in SSL to get java9/10
>>> builds
> working on 3.5.
>> Please let me know if I should revert the changes. Precommit build is
> still running on 8 threads, but I’d like to change that one too.
>> Regards,
>> Andor
>> 
>> 
>> 
>>> On 2018. Oct 15., at 9:31, Bogdan Kanivets 
> wrote:
>>> Fangmin,
>>> 
>>> Those are good ideas.
>>> 
>>> FYI, I've stated running tests continuously in aws m1.xlarge.
>>> https://github.com/lavacat/zookeeper-tests-lab
>>> 
>>> So far, I've done ~ 12 runs of trunk. Same common offenders as in
 Flaky
>>> dash: testManyChildWatchersAutoReset,
>>> testPurgeWhenLogRollingInProgr
> ess
>>> I'll do some more runs, then try to come up with report.
>>> 
>>> I'm using aws and not Apache Jenkins env because of better
>>> control/observability.
>>> 
>>> 
>>> 
>>> 
>>> On Sun, Oct 14, 2018 at 4:58 PM Fangmin Lv 
> wrote:
 Internally, we also did some works to reduce the flaky, here are
>>> the
> main
 things we've done:
 
 * using retry rule to retry in case the zk client lost it's

Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-22 Thread Andor Molnár
Thanks Bogdan, so far so good.

testNodeDataChanged is an old beast, I've a possible fix for that from
@afine:

https://github.com/apache/zookeeper/pull/300

Would be great if we could review it and get rid of this flaky.


Andor




On 10/20/18 06:41, Bogdan Kanivets wrote:
> I think the argument for keeping concurrency is that it may manifest some
> unknown problems with the code.
>
> Maybe a middle ground  - move largest offenders into separate junit tag and
> run them after rest of the test with threads=1. Hopefully this will make
> life better for PRs.
>
> On the note of largest offenders, I've done 44 runs on aws r3.large with
> various thread settings (1, 2, 4, 8).
> Failure counts:
>   1 testNextConfigAlreadyActive
>   1 testNonExistingOpCode
>   1 testRaceConditionBetweenLeaderAndAckRequestProcessor
>   1 testWatcherDisconnectOnClose
>   2 testDoubleElection
>   5 testCurrentServersAreObserversInNextConfig
>   5 testNormalFollowerRunWithDiff
>   7 startSingleServerTest
>  18 testNodeDataChanged
>
> Haven't seen testPurgeWhenLogRollingInProgress
> or testManyChildWatchersAutoReset failing yet.
>
>
>
> On Thu, Oct 18, 2018 at 10:03 PM Michael Han  wrote:
>
>> It's a good idea to reduce the concurrency of to eliminate flakyness. Looks
>> like single threaded unit tests on trunk is pretty stable
>> https://builds.apache.org/job/zookeeper-trunk-single-thread/ (some
>> failures
>> are due to C tests). The build time is longer, but not too bad (for
>> pre-commit build, for nightly build, build time should not be a concern at
>> all).
>>
>>
>> On Mon, Oct 15, 2018 at 5:50 AM Andor Molnar 
>> wrote:
>>
>>> +1
>>>
>>>
>>>
>>> On Mon, Oct 15, 2018 at 1:55 PM, Enrico Olivelli 
>>> wrote:
>>>
 Il giorno lun 15 ott 2018 alle ore 12:46 Andor Molnar
  ha scritto:
> Thank you guys. This is great help.
>
> I remember your efforts Bogdan, as far as I remember you observer
>>> thread
 starvation in multiple runs on Apache Jenkins. Correct my if I’m wrong.
> I’ve created an umbrella Jira to capture all flaky test fixing
>> efforts
 here:
> https://issues.apache.org/jira/browse/ZOOKEEPER-3170 <
 https://issues.apache.org/jira/browse/ZOOKEEPER-3170>
> All previous flaky-related tickets have been converted to sub-tasks.
 Some of them might not be up-to-date, please consider reviewing them
>> and
 close if possible. Additionally feel free to create new sub-tasks to
 capture your actual work.
> I’ve already modified Trunk and branch-3.5 builds to run on 4 threads
 for testing initially. It resulted in slightly more stable tests:

 +1

 I have assigned the umbrella issue to you Andor as you are driving
 this important task. is is ok ?

 thank you

 Enrico


> Trunk (java 8) - failing 1/4 (since #229) - build time increased by
 40-45%
> Trunk (java 9) - failing 0/2 (since #993) - ~40%
> Trunk (java 10) - failing 1/2 (since #280) -
> branch-3.5 (java 8) - failing 0/4 (since #1153) - ~35-45%
>
> However the pattern is not big enough and results are inaccurate, so
>> I
 need more builds. I also need to fix a bug in SSL to get java9/10
>> builds
 working on 3.5.
> Please let me know if I should revert the changes. Precommit build is
 still running on 8 threads, but I’d like to change that one too.
> Regards,
> Andor
>
>
>
>> On 2018. Oct 15., at 9:31, Bogdan Kanivets 
 wrote:
>> Fangmin,
>>
>> Those are good ideas.
>>
>> FYI, I've stated running tests continuously in aws m1.xlarge.
>> https://github.com/lavacat/zookeeper-tests-lab
>>
>> So far, I've done ~ 12 runs of trunk. Same common offenders as in
>>> Flaky
>> dash: testManyChildWatchersAutoReset,
>> testPurgeWhenLogRollingInProgr
 ess
>> I'll do some more runs, then try to come up with report.
>>
>> I'm using aws and not Apache Jenkins env because of better
>> control/observability.
>>
>>
>>
>>
>> On Sun, Oct 14, 2018 at 4:58 PM Fangmin Lv 
 wrote:
>>> Internally, we also did some works to reduce the flaky, here are
>> the
 main
>>> things we've done:
>>>
>>> * using retry rule to retry in case the zk client lost it's
 connection,
>>> this could happen if the quorum tests is running on unstable
 environment
>>> and the leader election happened.
>>> * using random port instead of sequentially to avoid the port
>> racing
 when
>>> running tests concurrently
>>> * changing tests to avoid using the same test path when
 creating/deleting
>>> nodes
>>>
>>> These greatly reduced the flaky internally, we should try those if
 we're
>>> seeing similar issues in the Jenkins.
>>>
>>> Fangmin
>>>
>>> On Sat, Oct 13, 2018 at 10:48 AM Bogdan Kanivets <
>>> bkaniv...@gmail.com
>>> wrote:
>>>
 

Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-19 Thread Bogdan Kanivets
I think the argument for keeping concurrency is that it may manifest some
unknown problems with the code.

Maybe a middle ground  - move largest offenders into separate junit tag and
run them after rest of the test with threads=1. Hopefully this will make
life better for PRs.

On the note of largest offenders, I've done 44 runs on aws r3.large with
various thread settings (1, 2, 4, 8).
Failure counts:
  1 testNextConfigAlreadyActive
  1 testNonExistingOpCode
  1 testRaceConditionBetweenLeaderAndAckRequestProcessor
  1 testWatcherDisconnectOnClose
  2 testDoubleElection
  5 testCurrentServersAreObserversInNextConfig
  5 testNormalFollowerRunWithDiff
  7 startSingleServerTest
 18 testNodeDataChanged

Haven't seen testPurgeWhenLogRollingInProgress
or testManyChildWatchersAutoReset failing yet.



On Thu, Oct 18, 2018 at 10:03 PM Michael Han  wrote:

> It's a good idea to reduce the concurrency of to eliminate flakyness. Looks
> like single threaded unit tests on trunk is pretty stable
> https://builds.apache.org/job/zookeeper-trunk-single-thread/ (some
> failures
> are due to C tests). The build time is longer, but not too bad (for
> pre-commit build, for nightly build, build time should not be a concern at
> all).
>
>
> On Mon, Oct 15, 2018 at 5:50 AM Andor Molnar 
> wrote:
>
> > +1
> >
> >
> >
> > On Mon, Oct 15, 2018 at 1:55 PM, Enrico Olivelli 
> > wrote:
> >
> > > Il giorno lun 15 ott 2018 alle ore 12:46 Andor Molnar
> > >  ha scritto:
> > > >
> > > > Thank you guys. This is great help.
> > > >
> > > > I remember your efforts Bogdan, as far as I remember you observer
> > thread
> > > starvation in multiple runs on Apache Jenkins. Correct my if I’m wrong.
> > > >
> > > > I’ve created an umbrella Jira to capture all flaky test fixing
> efforts
> > > here:
> > > > https://issues.apache.org/jira/browse/ZOOKEEPER-3170 <
> > > https://issues.apache.org/jira/browse/ZOOKEEPER-3170>
> > > >
> > > > All previous flaky-related tickets have been converted to sub-tasks.
> > > Some of them might not be up-to-date, please consider reviewing them
> and
> > > close if possible. Additionally feel free to create new sub-tasks to
> > > capture your actual work.
> > > >
> > > > I’ve already modified Trunk and branch-3.5 builds to run on 4 threads
> > > for testing initially. It resulted in slightly more stable tests:
> > >
> > > +1
> > >
> > > I have assigned the umbrella issue to you Andor as you are driving
> > > this important task. is is ok ?
> > >
> > > thank you
> > >
> > > Enrico
> > >
> > >
> > > >
> > > > Trunk (java 8) - failing 1/4 (since #229) - build time increased by
> > > 40-45%
> > > > Trunk (java 9) - failing 0/2 (since #993) - ~40%
> > > > Trunk (java 10) - failing 1/2 (since #280) -
> > > > branch-3.5 (java 8) - failing 0/4 (since #1153) - ~35-45%
> > > >
> > > > However the pattern is not big enough and results are inaccurate, so
> I
> > > need more builds. I also need to fix a bug in SSL to get java9/10
> builds
> > > working on 3.5.
> > > >
> > > > Please let me know if I should revert the changes. Precommit build is
> > > still running on 8 threads, but I’d like to change that one too.
> > > >
> > > > Regards,
> > > > Andor
> > > >
> > > >
> > > >
> > > > > On 2018. Oct 15., at 9:31, Bogdan Kanivets 
> > > wrote:
> > > > >
> > > > > Fangmin,
> > > > >
> > > > > Those are good ideas.
> > > > >
> > > > > FYI, I've stated running tests continuously in aws m1.xlarge.
> > > > > https://github.com/lavacat/zookeeper-tests-lab
> > > > >
> > > > > So far, I've done ~ 12 runs of trunk. Same common offenders as in
> > Flaky
> > > > > dash: testManyChildWatchersAutoReset,
> testPurgeWhenLogRollingInProgr
> > > ess
> > > > > I'll do some more runs, then try to come up with report.
> > > > >
> > > > > I'm using aws and not Apache Jenkins env because of better
> > > > > control/observability.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Oct 14, 2018 at 4:58 PM Fangmin Lv 
> > > wrote:
> > > > >
> > > > >> Internally, we also did some works to reduce the flaky, here are
> the
> > > main
> > > > >> things we've done:
> > > > >>
> > > > >> * using retry rule to retry in case the zk client lost it's
> > > connection,
> > > > >> this could happen if the quorum tests is running on unstable
> > > environment
> > > > >> and the leader election happened.
> > > > >> * using random port instead of sequentially to avoid the port
> racing
> > > when
> > > > >> running tests concurrently
> > > > >> * changing tests to avoid using the same test path when
> > > creating/deleting
> > > > >> nodes
> > > > >>
> > > > >> These greatly reduced the flaky internally, we should try those if
> > > we're
> > > > >> seeing similar issues in the Jenkins.
> > > > >>
> > > > >> Fangmin
> > > > >>
> > > > >> On Sat, Oct 13, 2018 at 10:48 AM Bogdan Kanivets <
> > bkaniv...@gmail.com
> > > >
> > > > >> wrote:
> > > > >>
> > > > >>> I've looked into flakiness couple months ago (special attention
> on

Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-18 Thread Michael Han
It's a good idea to reduce the concurrency of to eliminate flakyness. Looks
like single threaded unit tests on trunk is pretty stable
https://builds.apache.org/job/zookeeper-trunk-single-thread/ (some failures
are due to C tests). The build time is longer, but not too bad (for
pre-commit build, for nightly build, build time should not be a concern at
all).


On Mon, Oct 15, 2018 at 5:50 AM Andor Molnar 
wrote:

> +1
>
>
>
> On Mon, Oct 15, 2018 at 1:55 PM, Enrico Olivelli 
> wrote:
>
> > Il giorno lun 15 ott 2018 alle ore 12:46 Andor Molnar
> >  ha scritto:
> > >
> > > Thank you guys. This is great help.
> > >
> > > I remember your efforts Bogdan, as far as I remember you observer
> thread
> > starvation in multiple runs on Apache Jenkins. Correct my if I’m wrong.
> > >
> > > I’ve created an umbrella Jira to capture all flaky test fixing efforts
> > here:
> > > https://issues.apache.org/jira/browse/ZOOKEEPER-3170 <
> > https://issues.apache.org/jira/browse/ZOOKEEPER-3170>
> > >
> > > All previous flaky-related tickets have been converted to sub-tasks.
> > Some of them might not be up-to-date, please consider reviewing them and
> > close if possible. Additionally feel free to create new sub-tasks to
> > capture your actual work.
> > >
> > > I’ve already modified Trunk and branch-3.5 builds to run on 4 threads
> > for testing initially. It resulted in slightly more stable tests:
> >
> > +1
> >
> > I have assigned the umbrella issue to you Andor as you are driving
> > this important task. is is ok ?
> >
> > thank you
> >
> > Enrico
> >
> >
> > >
> > > Trunk (java 8) - failing 1/4 (since #229) - build time increased by
> > 40-45%
> > > Trunk (java 9) - failing 0/2 (since #993) - ~40%
> > > Trunk (java 10) - failing 1/2 (since #280) -
> > > branch-3.5 (java 8) - failing 0/4 (since #1153) - ~35-45%
> > >
> > > However the pattern is not big enough and results are inaccurate, so I
> > need more builds. I also need to fix a bug in SSL to get java9/10 builds
> > working on 3.5.
> > >
> > > Please let me know if I should revert the changes. Precommit build is
> > still running on 8 threads, but I’d like to change that one too.
> > >
> > > Regards,
> > > Andor
> > >
> > >
> > >
> > > > On 2018. Oct 15., at 9:31, Bogdan Kanivets 
> > wrote:
> > > >
> > > > Fangmin,
> > > >
> > > > Those are good ideas.
> > > >
> > > > FYI, I've stated running tests continuously in aws m1.xlarge.
> > > > https://github.com/lavacat/zookeeper-tests-lab
> > > >
> > > > So far, I've done ~ 12 runs of trunk. Same common offenders as in
> Flaky
> > > > dash: testManyChildWatchersAutoReset, testPurgeWhenLogRollingInProgr
> > ess
> > > > I'll do some more runs, then try to come up with report.
> > > >
> > > > I'm using aws and not Apache Jenkins env because of better
> > > > control/observability.
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Oct 14, 2018 at 4:58 PM Fangmin Lv 
> > wrote:
> > > >
> > > >> Internally, we also did some works to reduce the flaky, here are the
> > main
> > > >> things we've done:
> > > >>
> > > >> * using retry rule to retry in case the zk client lost it's
> > connection,
> > > >> this could happen if the quorum tests is running on unstable
> > environment
> > > >> and the leader election happened.
> > > >> * using random port instead of sequentially to avoid the port racing
> > when
> > > >> running tests concurrently
> > > >> * changing tests to avoid using the same test path when
> > creating/deleting
> > > >> nodes
> > > >>
> > > >> These greatly reduced the flaky internally, we should try those if
> > we're
> > > >> seeing similar issues in the Jenkins.
> > > >>
> > > >> Fangmin
> > > >>
> > > >> On Sat, Oct 13, 2018 at 10:48 AM Bogdan Kanivets <
> bkaniv...@gmail.com
> > >
> > > >> wrote:
> > > >>
> > > >>> I've looked into flakiness couple months ago (special attention on
> > > >>> testManyChildWatchersAutoReset). In my opinion the problem is a)
> > and c).
> > > >>> Unfortunately I don't have data to back this claim.
> > > >>>
> > > >>> I don't remember seeing many 'port binding' exceptions. Unless
> 'port
> > > >>> assignment' issue manifested as some other exception.
> > > >>>
> > > >>> Before decreasing number of threads I think more data should be
> > > >>> collected/visualized
> > > >>>
> > > >>> 1) Flaky dashboard is great, but we should add another report that
> > maps
> > > >>> 'error causes' to builds/tests
> > > >>> 2) Flaky dash can be extended to save more history (for example
> like
> > this
> > > >>> https://www.chromium.org/developers/testing/flakiness-dashboard)
> > > >>> 3) PreCommit builds should be included in dashboard
> > > >>> 4) We should have a common clean benchmark. For example - take
> > > >>> AWS t3.xlarge instance with set linux distro, jvm, zk commit sha
> and
> > run
> > > >>> tests (current 8 threads) for 8 hours with 1 min cooldown.
> > > >>>
> > > >>> Due to recent employment change, I got sidetracked, but I really
> > want to
> > > >>> get to the bottom of this.
> > > 

Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-18 Thread Andor Molnar
Lots of ConnectionLoss and Address already in use failures on
branch34_java9.
Looks like specific to Jenkins slave H22.

Andor



On Mon, Oct 15, 2018 at 2:50 PM, Andor Molnar  wrote:

> +1
>
>
>
> On Mon, Oct 15, 2018 at 1:55 PM, Enrico Olivelli 
> wrote:
>
>> Il giorno lun 15 ott 2018 alle ore 12:46 Andor Molnar
>>  ha scritto:
>> >
>> > Thank you guys. This is great help.
>> >
>> > I remember your efforts Bogdan, as far as I remember you observer
>> thread starvation in multiple runs on Apache Jenkins. Correct my if I’m
>> wrong.
>> >
>> > I’ve created an umbrella Jira to capture all flaky test fixing efforts
>> here:
>> > https://issues.apache.org/jira/browse/ZOOKEEPER-3170 <
>> https://issues.apache.org/jira/browse/ZOOKEEPER-3170>
>> >
>> > All previous flaky-related tickets have been converted to sub-tasks.
>> Some of them might not be up-to-date, please consider reviewing them and
>> close if possible. Additionally feel free to create new sub-tasks to
>> capture your actual work.
>> >
>> > I’ve already modified Trunk and branch-3.5 builds to run on 4 threads
>> for testing initially. It resulted in slightly more stable tests:
>>
>> +1
>>
>> I have assigned the umbrella issue to you Andor as you are driving
>> this important task. is is ok ?
>>
>> thank you
>>
>> Enrico
>>
>>
>> >
>> > Trunk (java 8) - failing 1/4 (since #229) - build time increased by
>> 40-45%
>> > Trunk (java 9) - failing 0/2 (since #993) - ~40%
>> > Trunk (java 10) - failing 1/2 (since #280) -
>> > branch-3.5 (java 8) - failing 0/4 (since #1153) - ~35-45%
>> >
>> > However the pattern is not big enough and results are inaccurate, so I
>> need more builds. I also need to fix a bug in SSL to get java9/10 builds
>> working on 3.5.
>> >
>> > Please let me know if I should revert the changes. Precommit build is
>> still running on 8 threads, but I’d like to change that one too.
>> >
>> > Regards,
>> > Andor
>> >
>> >
>> >
>> > > On 2018. Oct 15., at 9:31, Bogdan Kanivets 
>> wrote:
>> > >
>> > > Fangmin,
>> > >
>> > > Those are good ideas.
>> > >
>> > > FYI, I've stated running tests continuously in aws m1.xlarge.
>> > > https://github.com/lavacat/zookeeper-tests-lab
>> > >
>> > > So far, I've done ~ 12 runs of trunk. Same common offenders as in
>> Flaky
>> > > dash: testManyChildWatchersAutoReset, testPurgeWhenLogRollingInProgr
>> ess
>> > > I'll do some more runs, then try to come up with report.
>> > >
>> > > I'm using aws and not Apache Jenkins env because of better
>> > > control/observability.
>> > >
>> > >
>> > >
>> > >
>> > > On Sun, Oct 14, 2018 at 4:58 PM Fangmin Lv 
>> wrote:
>> > >
>> > >> Internally, we also did some works to reduce the flaky, here are the
>> main
>> > >> things we've done:
>> > >>
>> > >> * using retry rule to retry in case the zk client lost it's
>> connection,
>> > >> this could happen if the quorum tests is running on unstable
>> environment
>> > >> and the leader election happened.
>> > >> * using random port instead of sequentially to avoid the port racing
>> when
>> > >> running tests concurrently
>> > >> * changing tests to avoid using the same test path when
>> creating/deleting
>> > >> nodes
>> > >>
>> > >> These greatly reduced the flaky internally, we should try those if
>> we're
>> > >> seeing similar issues in the Jenkins.
>> > >>
>> > >> Fangmin
>> > >>
>> > >> On Sat, Oct 13, 2018 at 10:48 AM Bogdan Kanivets <
>> bkaniv...@gmail.com>
>> > >> wrote:
>> > >>
>> > >>> I've looked into flakiness couple months ago (special attention on
>> > >>> testManyChildWatchersAutoReset). In my opinion the problem is a)
>> and c).
>> > >>> Unfortunately I don't have data to back this claim.
>> > >>>
>> > >>> I don't remember seeing many 'port binding' exceptions. Unless 'port
>> > >>> assignment' issue manifested as some other exception.
>> > >>>
>> > >>> Before decreasing number of threads I think more data should be
>> > >>> collected/visualized
>> > >>>
>> > >>> 1) Flaky dashboard is great, but we should add another report that
>> maps
>> > >>> 'error causes' to builds/tests
>> > >>> 2) Flaky dash can be extended to save more history (for example
>> like this
>> > >>> https://www.chromium.org/developers/testing/flakiness-dashboard)
>> > >>> 3) PreCommit builds should be included in dashboard
>> > >>> 4) We should have a common clean benchmark. For example - take
>> > >>> AWS t3.xlarge instance with set linux distro, jvm, zk commit sha
>> and run
>> > >>> tests (current 8 threads) for 8 hours with 1 min cooldown.
>> > >>>
>> > >>> Due to recent employment change, I got sidetracked, but I really
>> want to
>> > >>> get to the bottom of this.
>> > >>> I'm going to setup 4) and report results to this mailing list. Also
>> > >> willing
>> > >>> to work on other items.
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> On Sat, Oct 13, 2018 at 4:59 AM Enrico Olivelli <
>> eolive...@gmail.com>
>> > >>> wrote:
>> > >>>
>> >  Il ven 12 ott 2018, 23:17 Benjamin Reed  ha
>> scritto:
>> > 
>> > 

Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-15 Thread Andor Molnar
+1



On Mon, Oct 15, 2018 at 1:55 PM, Enrico Olivelli 
wrote:

> Il giorno lun 15 ott 2018 alle ore 12:46 Andor Molnar
>  ha scritto:
> >
> > Thank you guys. This is great help.
> >
> > I remember your efforts Bogdan, as far as I remember you observer thread
> starvation in multiple runs on Apache Jenkins. Correct my if I’m wrong.
> >
> > I’ve created an umbrella Jira to capture all flaky test fixing efforts
> here:
> > https://issues.apache.org/jira/browse/ZOOKEEPER-3170 <
> https://issues.apache.org/jira/browse/ZOOKEEPER-3170>
> >
> > All previous flaky-related tickets have been converted to sub-tasks.
> Some of them might not be up-to-date, please consider reviewing them and
> close if possible. Additionally feel free to create new sub-tasks to
> capture your actual work.
> >
> > I’ve already modified Trunk and branch-3.5 builds to run on 4 threads
> for testing initially. It resulted in slightly more stable tests:
>
> +1
>
> I have assigned the umbrella issue to you Andor as you are driving
> this important task. is is ok ?
>
> thank you
>
> Enrico
>
>
> >
> > Trunk (java 8) - failing 1/4 (since #229) - build time increased by
> 40-45%
> > Trunk (java 9) - failing 0/2 (since #993) - ~40%
> > Trunk (java 10) - failing 1/2 (since #280) -
> > branch-3.5 (java 8) - failing 0/4 (since #1153) - ~35-45%
> >
> > However the pattern is not big enough and results are inaccurate, so I
> need more builds. I also need to fix a bug in SSL to get java9/10 builds
> working on 3.5.
> >
> > Please let me know if I should revert the changes. Precommit build is
> still running on 8 threads, but I’d like to change that one too.
> >
> > Regards,
> > Andor
> >
> >
> >
> > > On 2018. Oct 15., at 9:31, Bogdan Kanivets 
> wrote:
> > >
> > > Fangmin,
> > >
> > > Those are good ideas.
> > >
> > > FYI, I've stated running tests continuously in aws m1.xlarge.
> > > https://github.com/lavacat/zookeeper-tests-lab
> > >
> > > So far, I've done ~ 12 runs of trunk. Same common offenders as in Flaky
> > > dash: testManyChildWatchersAutoReset, testPurgeWhenLogRollingInProgr
> ess
> > > I'll do some more runs, then try to come up with report.
> > >
> > > I'm using aws and not Apache Jenkins env because of better
> > > control/observability.
> > >
> > >
> > >
> > >
> > > On Sun, Oct 14, 2018 at 4:58 PM Fangmin Lv 
> wrote:
> > >
> > >> Internally, we also did some works to reduce the flaky, here are the
> main
> > >> things we've done:
> > >>
> > >> * using retry rule to retry in case the zk client lost it's
> connection,
> > >> this could happen if the quorum tests is running on unstable
> environment
> > >> and the leader election happened.
> > >> * using random port instead of sequentially to avoid the port racing
> when
> > >> running tests concurrently
> > >> * changing tests to avoid using the same test path when
> creating/deleting
> > >> nodes
> > >>
> > >> These greatly reduced the flaky internally, we should try those if
> we're
> > >> seeing similar issues in the Jenkins.
> > >>
> > >> Fangmin
> > >>
> > >> On Sat, Oct 13, 2018 at 10:48 AM Bogdan Kanivets  >
> > >> wrote:
> > >>
> > >>> I've looked into flakiness couple months ago (special attention on
> > >>> testManyChildWatchersAutoReset). In my opinion the problem is a)
> and c).
> > >>> Unfortunately I don't have data to back this claim.
> > >>>
> > >>> I don't remember seeing many 'port binding' exceptions. Unless 'port
> > >>> assignment' issue manifested as some other exception.
> > >>>
> > >>> Before decreasing number of threads I think more data should be
> > >>> collected/visualized
> > >>>
> > >>> 1) Flaky dashboard is great, but we should add another report that
> maps
> > >>> 'error causes' to builds/tests
> > >>> 2) Flaky dash can be extended to save more history (for example like
> this
> > >>> https://www.chromium.org/developers/testing/flakiness-dashboard)
> > >>> 3) PreCommit builds should be included in dashboard
> > >>> 4) We should have a common clean benchmark. For example - take
> > >>> AWS t3.xlarge instance with set linux distro, jvm, zk commit sha and
> run
> > >>> tests (current 8 threads) for 8 hours with 1 min cooldown.
> > >>>
> > >>> Due to recent employment change, I got sidetracked, but I really
> want to
> > >>> get to the bottom of this.
> > >>> I'm going to setup 4) and report results to this mailing list. Also
> > >> willing
> > >>> to work on other items.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Sat, Oct 13, 2018 at 4:59 AM Enrico Olivelli  >
> > >>> wrote:
> > >>>
> >  Il ven 12 ott 2018, 23:17 Benjamin Reed  ha
> scritto:
> > 
> > > i think the unique port assignment (d) is more problematic than it
> > > appears. there is a race between finding a free port and actually
> > > grabbing it. i think that contributes to the flakiness.
> > >
> > 
> >  This is very hard to solve for our test cases, because we need to
> build
> >  configs before starting the groups of servers.
> >  For 

Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-15 Thread Enrico Olivelli
Il giorno lun 15 ott 2018 alle ore 12:46 Andor Molnar
 ha scritto:
>
> Thank you guys. This is great help.
>
> I remember your efforts Bogdan, as far as I remember you observer thread 
> starvation in multiple runs on Apache Jenkins. Correct my if I’m wrong.
>
> I’ve created an umbrella Jira to capture all flaky test fixing efforts here:
> https://issues.apache.org/jira/browse/ZOOKEEPER-3170 
> 
>
> All previous flaky-related tickets have been converted to sub-tasks. Some of 
> them might not be up-to-date, please consider reviewing them and close if 
> possible. Additionally feel free to create new sub-tasks to capture your 
> actual work.
>
> I’ve already modified Trunk and branch-3.5 builds to run on 4 threads for 
> testing initially. It resulted in slightly more stable tests:

+1

I have assigned the umbrella issue to you Andor as you are driving
this important task. is is ok ?

thank you

Enrico


>
> Trunk (java 8) - failing 1/4 (since #229) - build time increased by 40-45%
> Trunk (java 9) - failing 0/2 (since #993) - ~40%
> Trunk (java 10) - failing 1/2 (since #280) -
> branch-3.5 (java 8) - failing 0/4 (since #1153) - ~35-45%
>
> However the pattern is not big enough and results are inaccurate, so I need 
> more builds. I also need to fix a bug in SSL to get java9/10 builds working 
> on 3.5.
>
> Please let me know if I should revert the changes. Precommit build is still 
> running on 8 threads, but I’d like to change that one too.
>
> Regards,
> Andor
>
>
>
> > On 2018. Oct 15., at 9:31, Bogdan Kanivets  wrote:
> >
> > Fangmin,
> >
> > Those are good ideas.
> >
> > FYI, I've stated running tests continuously in aws m1.xlarge.
> > https://github.com/lavacat/zookeeper-tests-lab
> >
> > So far, I've done ~ 12 runs of trunk. Same common offenders as in Flaky
> > dash: testManyChildWatchersAutoReset, testPurgeWhenLogRollingInProgress
> > I'll do some more runs, then try to come up with report.
> >
> > I'm using aws and not Apache Jenkins env because of better
> > control/observability.
> >
> >
> >
> >
> > On Sun, Oct 14, 2018 at 4:58 PM Fangmin Lv  wrote:
> >
> >> Internally, we also did some works to reduce the flaky, here are the main
> >> things we've done:
> >>
> >> * using retry rule to retry in case the zk client lost it's connection,
> >> this could happen if the quorum tests is running on unstable environment
> >> and the leader election happened.
> >> * using random port instead of sequentially to avoid the port racing when
> >> running tests concurrently
> >> * changing tests to avoid using the same test path when creating/deleting
> >> nodes
> >>
> >> These greatly reduced the flaky internally, we should try those if we're
> >> seeing similar issues in the Jenkins.
> >>
> >> Fangmin
> >>
> >> On Sat, Oct 13, 2018 at 10:48 AM Bogdan Kanivets 
> >> wrote:
> >>
> >>> I've looked into flakiness couple months ago (special attention on
> >>> testManyChildWatchersAutoReset). In my opinion the problem is a) and c).
> >>> Unfortunately I don't have data to back this claim.
> >>>
> >>> I don't remember seeing many 'port binding' exceptions. Unless 'port
> >>> assignment' issue manifested as some other exception.
> >>>
> >>> Before decreasing number of threads I think more data should be
> >>> collected/visualized
> >>>
> >>> 1) Flaky dashboard is great, but we should add another report that maps
> >>> 'error causes' to builds/tests
> >>> 2) Flaky dash can be extended to save more history (for example like this
> >>> https://www.chromium.org/developers/testing/flakiness-dashboard)
> >>> 3) PreCommit builds should be included in dashboard
> >>> 4) We should have a common clean benchmark. For example - take
> >>> AWS t3.xlarge instance with set linux distro, jvm, zk commit sha and run
> >>> tests (current 8 threads) for 8 hours with 1 min cooldown.
> >>>
> >>> Due to recent employment change, I got sidetracked, but I really want to
> >>> get to the bottom of this.
> >>> I'm going to setup 4) and report results to this mailing list. Also
> >> willing
> >>> to work on other items.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Sat, Oct 13, 2018 at 4:59 AM Enrico Olivelli 
> >>> wrote:
> >>>
>  Il ven 12 ott 2018, 23:17 Benjamin Reed  ha scritto:
> 
> > i think the unique port assignment (d) is more problematic than it
> > appears. there is a race between finding a free port and actually
> > grabbing it. i think that contributes to the flakiness.
> >
> 
>  This is very hard to solve for our test cases, because we need to build
>  configs before starting the groups of servers.
>  For tests in single server it will be easier, you just have to start
> >> the
>  server on port zero, get the port and the create client configs.
>  I don't know how much it will be worth
> 
>  Enrico
> 
> 
> > ben
> > On Fri, Oct 12, 2018 at 8:50 AM Andor Molnar 
> >> wrote:
> >>
> >> That is a 

Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-15 Thread Andor Molnar
Thank you guys. This is great help.

I remember your efforts Bogdan, as far as I remember you observer thread 
starvation in multiple runs on Apache Jenkins. Correct my if I’m wrong.

I’ve created an umbrella Jira to capture all flaky test fixing efforts here:
https://issues.apache.org/jira/browse/ZOOKEEPER-3170 


All previous flaky-related tickets have been converted to sub-tasks. Some of 
them might not be up-to-date, please consider reviewing them and close if 
possible. Additionally feel free to create new sub-tasks to capture your actual 
work.

I’ve already modified Trunk and branch-3.5 builds to run on 4 threads for 
testing initially. It resulted in slightly more stable tests:

Trunk (java 8) - failing 1/4 (since #229) - build time increased by 40-45%
Trunk (java 9) - failing 0/2 (since #993) - ~40%
Trunk (java 10) - failing 1/2 (since #280) - 
branch-3.5 (java 8) - failing 0/4 (since #1153) - ~35-45%

However the pattern is not big enough and results are inaccurate, so I need 
more builds. I also need to fix a bug in SSL to get java9/10 builds working on 
3.5.

Please let me know if I should revert the changes. Precommit build is still 
running on 8 threads, but I’d like to change that one too.

Regards,
Andor
 


> On 2018. Oct 15., at 9:31, Bogdan Kanivets  wrote:
> 
> Fangmin,
> 
> Those are good ideas.
> 
> FYI, I've stated running tests continuously in aws m1.xlarge.
> https://github.com/lavacat/zookeeper-tests-lab
> 
> So far, I've done ~ 12 runs of trunk. Same common offenders as in Flaky
> dash: testManyChildWatchersAutoReset, testPurgeWhenLogRollingInProgress
> I'll do some more runs, then try to come up with report.
> 
> I'm using aws and not Apache Jenkins env because of better
> control/observability.
> 
> 
> 
> 
> On Sun, Oct 14, 2018 at 4:58 PM Fangmin Lv  wrote:
> 
>> Internally, we also did some works to reduce the flaky, here are the main
>> things we've done:
>> 
>> * using retry rule to retry in case the zk client lost it's connection,
>> this could happen if the quorum tests is running on unstable environment
>> and the leader election happened.
>> * using random port instead of sequentially to avoid the port racing when
>> running tests concurrently
>> * changing tests to avoid using the same test path when creating/deleting
>> nodes
>> 
>> These greatly reduced the flaky internally, we should try those if we're
>> seeing similar issues in the Jenkins.
>> 
>> Fangmin
>> 
>> On Sat, Oct 13, 2018 at 10:48 AM Bogdan Kanivets 
>> wrote:
>> 
>>> I've looked into flakiness couple months ago (special attention on
>>> testManyChildWatchersAutoReset). In my opinion the problem is a) and c).
>>> Unfortunately I don't have data to back this claim.
>>> 
>>> I don't remember seeing many 'port binding' exceptions. Unless 'port
>>> assignment' issue manifested as some other exception.
>>> 
>>> Before decreasing number of threads I think more data should be
>>> collected/visualized
>>> 
>>> 1) Flaky dashboard is great, but we should add another report that maps
>>> 'error causes' to builds/tests
>>> 2) Flaky dash can be extended to save more history (for example like this
>>> https://www.chromium.org/developers/testing/flakiness-dashboard)
>>> 3) PreCommit builds should be included in dashboard
>>> 4) We should have a common clean benchmark. For example - take
>>> AWS t3.xlarge instance with set linux distro, jvm, zk commit sha and run
>>> tests (current 8 threads) for 8 hours with 1 min cooldown.
>>> 
>>> Due to recent employment change, I got sidetracked, but I really want to
>>> get to the bottom of this.
>>> I'm going to setup 4) and report results to this mailing list. Also
>> willing
>>> to work on other items.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Sat, Oct 13, 2018 at 4:59 AM Enrico Olivelli 
>>> wrote:
>>> 
 Il ven 12 ott 2018, 23:17 Benjamin Reed  ha scritto:
 
> i think the unique port assignment (d) is more problematic than it
> appears. there is a race between finding a free port and actually
> grabbing it. i think that contributes to the flakiness.
> 
 
 This is very hard to solve for our test cases, because we need to build
 configs before starting the groups of servers.
 For tests in single server it will be easier, you just have to start
>> the
 server on port zero, get the port and the create client configs.
 I don't know how much it will be worth
 
 Enrico
 
 
> ben
> On Fri, Oct 12, 2018 at 8:50 AM Andor Molnar 
>> wrote:
>> 
>> That is a completely valid point. I started to investigate flakies
>>> for
> exactly the same reason, if you remember the thread that I started a
 while
> ago. It was later abandoned unfortunately, because I’ve run into a
>> few
> issues:
>> 
>> - We nailed down that in order to release 3.5 stable, we have to
>> make
> sure it’s not worse than 3.4 by comparing the builds: but these
>> 

Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-15 Thread Bogdan Kanivets
Fangmin,

Those are good ideas.

FYI, I've stated running tests continuously in aws m1.xlarge.
https://github.com/lavacat/zookeeper-tests-lab

So far, I've done ~ 12 runs of trunk. Same common offenders as in Flaky
dash: testManyChildWatchersAutoReset, testPurgeWhenLogRollingInProgress
I'll do some more runs, then try to come up with report.

I'm using aws and not Apache Jenkins env because of better
control/observability.




On Sun, Oct 14, 2018 at 4:58 PM Fangmin Lv  wrote:

> Internally, we also did some works to reduce the flaky, here are the main
> things we've done:
>
> * using retry rule to retry in case the zk client lost it's connection,
> this could happen if the quorum tests is running on unstable environment
> and the leader election happened.
> * using random port instead of sequentially to avoid the port racing when
> running tests concurrently
> * changing tests to avoid using the same test path when creating/deleting
> nodes
>
> These greatly reduced the flaky internally, we should try those if we're
> seeing similar issues in the Jenkins.
>
> Fangmin
>
> On Sat, Oct 13, 2018 at 10:48 AM Bogdan Kanivets 
> wrote:
>
> > I've looked into flakiness couple months ago (special attention on
> > testManyChildWatchersAutoReset). In my opinion the problem is a) and c).
> > Unfortunately I don't have data to back this claim.
> >
> > I don't remember seeing many 'port binding' exceptions. Unless 'port
> > assignment' issue manifested as some other exception.
> >
> > Before decreasing number of threads I think more data should be
> > collected/visualized
> >
> > 1) Flaky dashboard is great, but we should add another report that maps
> > 'error causes' to builds/tests
> > 2) Flaky dash can be extended to save more history (for example like this
> > https://www.chromium.org/developers/testing/flakiness-dashboard)
> > 3) PreCommit builds should be included in dashboard
> > 4) We should have a common clean benchmark. For example - take
> > AWS t3.xlarge instance with set linux distro, jvm, zk commit sha and run
> > tests (current 8 threads) for 8 hours with 1 min cooldown.
> >
> > Due to recent employment change, I got sidetracked, but I really want to
> > get to the bottom of this.
> > I'm going to setup 4) and report results to this mailing list. Also
> willing
> > to work on other items.
> >
> >
> >
> >
> >
> >
> > On Sat, Oct 13, 2018 at 4:59 AM Enrico Olivelli 
> > wrote:
> >
> > > Il ven 12 ott 2018, 23:17 Benjamin Reed  ha scritto:
> > >
> > > > i think the unique port assignment (d) is more problematic than it
> > > > appears. there is a race between finding a free port and actually
> > > > grabbing it. i think that contributes to the flakiness.
> > > >
> > >
> > > This is very hard to solve for our test cases, because we need to build
> > > configs before starting the groups of servers.
> > > For tests in single server it will be easier, you just have to start
> the
> > > server on port zero, get the port and the create client configs.
> > > I don't know how much it will be worth
> > >
> > > Enrico
> > >
> > >
> > > > ben
> > > > On Fri, Oct 12, 2018 at 8:50 AM Andor Molnar 
> wrote:
> > > > >
> > > > > That is a completely valid point. I started to investigate flakies
> > for
> > > > exactly the same reason, if you remember the thread that I started a
> > > while
> > > > ago. It was later abandoned unfortunately, because I’ve run into a
> few
> > > > issues:
> > > > >
> > > > > - We nailed down that in order to release 3.5 stable, we have to
> make
> > > > sure it’s not worse than 3.4 by comparing the builds: but these
> builds
> > > are
> > > > not comparable, because 3.4 tests running single threaded while 3.5
> > > > multithreaded showing problems which might also exist on 3.4,
> > > > >
> > > > > - Neither of them running C++ tests for some reason, but that’s not
> > > > really an issue here,
> > > > >
> > > > > - Looks like tests on 3.5 is just as solid as on 3.4, because
> running
> > > > them on a dedicated, single threaded environment show almost all
> tests
> > > > succeeding,
> > > > >
> > > > > - I think the root cause of failing unit tests could be one (or
> more)
> > > of
> > > > the following:
> > > > > a) Environmental: Jenkins slave gets overloaded with other
> > > > builds and multithreaded test running makes things even worse:
> starving
> > > JDK
> > > > threads and ZK instances (both clients and servers) are unable to
> > operate
> > > > > b) Conceptional: ZK unit tests were not designed to run on
> > > > multiple threads: I investigated the unique port assignment feature
> > which
> > > > is looking good, but there could be other possible gaps which makes
> > them
> > > > unreliable when running simultaneously.
> > > > > c) Bad testing: testing ZK in the wrong way, making bad
> > > > assumption (e.g. not syncing clients), etc.
> > > > > d) Bug in the server.
> > > > >
> > > > > I feel that finding case d) with these tests is super hard,
> 

Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-14 Thread Fangmin Lv
Internally, we also did some works to reduce the flaky, here are the main
things we've done:

* using retry rule to retry in case the zk client lost it's connection,
this could happen if the quorum tests is running on unstable environment
and the leader election happened.
* using random port instead of sequentially to avoid the port racing when
running tests concurrently
* changing tests to avoid using the same test path when creating/deleting
nodes

These greatly reduced the flaky internally, we should try those if we're
seeing similar issues in the Jenkins.

Fangmin

On Sat, Oct 13, 2018 at 10:48 AM Bogdan Kanivets 
wrote:

> I've looked into flakiness couple months ago (special attention on
> testManyChildWatchersAutoReset). In my opinion the problem is a) and c).
> Unfortunately I don't have data to back this claim.
>
> I don't remember seeing many 'port binding' exceptions. Unless 'port
> assignment' issue manifested as some other exception.
>
> Before decreasing number of threads I think more data should be
> collected/visualized
>
> 1) Flaky dashboard is great, but we should add another report that maps
> 'error causes' to builds/tests
> 2) Flaky dash can be extended to save more history (for example like this
> https://www.chromium.org/developers/testing/flakiness-dashboard)
> 3) PreCommit builds should be included in dashboard
> 4) We should have a common clean benchmark. For example - take
> AWS t3.xlarge instance with set linux distro, jvm, zk commit sha and run
> tests (current 8 threads) for 8 hours with 1 min cooldown.
>
> Due to recent employment change, I got sidetracked, but I really want to
> get to the bottom of this.
> I'm going to setup 4) and report results to this mailing list. Also willing
> to work on other items.
>
>
>
>
>
>
> On Sat, Oct 13, 2018 at 4:59 AM Enrico Olivelli 
> wrote:
>
> > Il ven 12 ott 2018, 23:17 Benjamin Reed  ha scritto:
> >
> > > i think the unique port assignment (d) is more problematic than it
> > > appears. there is a race between finding a free port and actually
> > > grabbing it. i think that contributes to the flakiness.
> > >
> >
> > This is very hard to solve for our test cases, because we need to build
> > configs before starting the groups of servers.
> > For tests in single server it will be easier, you just have to start the
> > server on port zero, get the port and the create client configs.
> > I don't know how much it will be worth
> >
> > Enrico
> >
> >
> > > ben
> > > On Fri, Oct 12, 2018 at 8:50 AM Andor Molnar  wrote:
> > > >
> > > > That is a completely valid point. I started to investigate flakies
> for
> > > exactly the same reason, if you remember the thread that I started a
> > while
> > > ago. It was later abandoned unfortunately, because I’ve run into a few
> > > issues:
> > > >
> > > > - We nailed down that in order to release 3.5 stable, we have to make
> > > sure it’s not worse than 3.4 by comparing the builds: but these builds
> > are
> > > not comparable, because 3.4 tests running single threaded while 3.5
> > > multithreaded showing problems which might also exist on 3.4,
> > > >
> > > > - Neither of them running C++ tests for some reason, but that’s not
> > > really an issue here,
> > > >
> > > > - Looks like tests on 3.5 is just as solid as on 3.4, because running
> > > them on a dedicated, single threaded environment show almost all tests
> > > succeeding,
> > > >
> > > > - I think the root cause of failing unit tests could be one (or more)
> > of
> > > the following:
> > > > a) Environmental: Jenkins slave gets overloaded with other
> > > builds and multithreaded test running makes things even worse: starving
> > JDK
> > > threads and ZK instances (both clients and servers) are unable to
> operate
> > > > b) Conceptional: ZK unit tests were not designed to run on
> > > multiple threads: I investigated the unique port assignment feature
> which
> > > is looking good, but there could be other possible gaps which makes
> them
> > > unreliable when running simultaneously.
> > > > c) Bad testing: testing ZK in the wrong way, making bad
> > > assumption (e.g. not syncing clients), etc.
> > > > d) Bug in the server.
> > > >
> > > > I feel that finding case d) with these tests is super hard, because a
> > > test report doesn’t give any information on what could go wrong with
> > > ZooKeeper. More or less guessing is your only option.
> > > >
> > > > Finding c) is a little bit easier, I’m trying to submit patches on
> them
> > > and hopefully making some progress.
> > > >
> > > > The huge pain in the arse though are a) and b): people desperately
> keep
> > > commenting “please retest this” on github to get a green build while
> > > testing is going in a direction to hide real problems: I mean people
> > > started not to care about a failing build, because “it must be some
> flaky
> > > unrelated to my patch”. Which is bad, but the shame is it’s true 90%
> > > percent of cases.
> > > >
> > > > I’m 

Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-13 Thread Bogdan Kanivets
I've looked into flakiness couple months ago (special attention on
testManyChildWatchersAutoReset). In my opinion the problem is a) and c).
Unfortunately I don't have data to back this claim.

I don't remember seeing many 'port binding' exceptions. Unless 'port
assignment' issue manifested as some other exception.

Before decreasing number of threads I think more data should be
collected/visualized

1) Flaky dashboard is great, but we should add another report that maps
'error causes' to builds/tests
2) Flaky dash can be extended to save more history (for example like this
https://www.chromium.org/developers/testing/flakiness-dashboard)
3) PreCommit builds should be included in dashboard
4) We should have a common clean benchmark. For example - take
AWS t3.xlarge instance with set linux distro, jvm, zk commit sha and run
tests (current 8 threads) for 8 hours with 1 min cooldown.

Due to recent employment change, I got sidetracked, but I really want to
get to the bottom of this.
I'm going to setup 4) and report results to this mailing list. Also willing
to work on other items.






On Sat, Oct 13, 2018 at 4:59 AM Enrico Olivelli  wrote:

> Il ven 12 ott 2018, 23:17 Benjamin Reed  ha scritto:
>
> > i think the unique port assignment (d) is more problematic than it
> > appears. there is a race between finding a free port and actually
> > grabbing it. i think that contributes to the flakiness.
> >
>
> This is very hard to solve for our test cases, because we need to build
> configs before starting the groups of servers.
> For tests in single server it will be easier, you just have to start the
> server on port zero, get the port and the create client configs.
> I don't know how much it will be worth
>
> Enrico
>
>
> > ben
> > On Fri, Oct 12, 2018 at 8:50 AM Andor Molnar  wrote:
> > >
> > > That is a completely valid point. I started to investigate flakies for
> > exactly the same reason, if you remember the thread that I started a
> while
> > ago. It was later abandoned unfortunately, because I’ve run into a few
> > issues:
> > >
> > > - We nailed down that in order to release 3.5 stable, we have to make
> > sure it’s not worse than 3.4 by comparing the builds: but these builds
> are
> > not comparable, because 3.4 tests running single threaded while 3.5
> > multithreaded showing problems which might also exist on 3.4,
> > >
> > > - Neither of them running C++ tests for some reason, but that’s not
> > really an issue here,
> > >
> > > - Looks like tests on 3.5 is just as solid as on 3.4, because running
> > them on a dedicated, single threaded environment show almost all tests
> > succeeding,
> > >
> > > - I think the root cause of failing unit tests could be one (or more)
> of
> > the following:
> > > a) Environmental: Jenkins slave gets overloaded with other
> > builds and multithreaded test running makes things even worse: starving
> JDK
> > threads and ZK instances (both clients and servers) are unable to operate
> > > b) Conceptional: ZK unit tests were not designed to run on
> > multiple threads: I investigated the unique port assignment feature which
> > is looking good, but there could be other possible gaps which makes them
> > unreliable when running simultaneously.
> > > c) Bad testing: testing ZK in the wrong way, making bad
> > assumption (e.g. not syncing clients), etc.
> > > d) Bug in the server.
> > >
> > > I feel that finding case d) with these tests is super hard, because a
> > test report doesn’t give any information on what could go wrong with
> > ZooKeeper. More or less guessing is your only option.
> > >
> > > Finding c) is a little bit easier, I’m trying to submit patches on them
> > and hopefully making some progress.
> > >
> > > The huge pain in the arse though are a) and b): people desperately keep
> > commenting “please retest this” on github to get a green build while
> > testing is going in a direction to hide real problems: I mean people
> > started not to care about a failing build, because “it must be some flaky
> > unrelated to my patch”. Which is bad, but the shame is it’s true 90%
> > percent of cases.
> > >
> > > I’m just trying to find some ways - besides fixing c) and d) flakies -
> > to get more reliable and more informative Jenkins builds. Don’t want to
> > make a huge turnaround, but I think if we can get a significantly more
> > reliable build for the price of slightly longer build time running on 4
> > threads instead of 8, I say let’s do it.
> > >
> > > As always, any help from the community is more than welcome and
> > appreciated.
> > >
> > > Thanks,
> > > Andor
> > >
> > >
> > >
> > >
> > > > On 2018. Oct 12., at 16:52, Patrick Hunt  wrote:
> > > >
> > > > iirc the number of threads was increased to improve performance.
> > Reducing
> > > > is fine, but do we understand why it's failing? Perhaps it's finding
> > real
> > > > issues as a result of the artificial concurrency/load.
> > > >
> > > > Patrick
> > > >
> > > > On Fri, Oct 

Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-13 Thread Enrico Olivelli
Il ven 12 ott 2018, 23:17 Benjamin Reed  ha scritto:

> i think the unique port assignment (d) is more problematic than it
> appears. there is a race between finding a free port and actually
> grabbing it. i think that contributes to the flakiness.
>

This is very hard to solve for our test cases, because we need to build
configs before starting the groups of servers.
For tests in single server it will be easier, you just have to start the
server on port zero, get the port and the create client configs.
I don't know how much it will be worth

Enrico


> ben
> On Fri, Oct 12, 2018 at 8:50 AM Andor Molnar  wrote:
> >
> > That is a completely valid point. I started to investigate flakies for
> exactly the same reason, if you remember the thread that I started a while
> ago. It was later abandoned unfortunately, because I’ve run into a few
> issues:
> >
> > - We nailed down that in order to release 3.5 stable, we have to make
> sure it’s not worse than 3.4 by comparing the builds: but these builds are
> not comparable, because 3.4 tests running single threaded while 3.5
> multithreaded showing problems which might also exist on 3.4,
> >
> > - Neither of them running C++ tests for some reason, but that’s not
> really an issue here,
> >
> > - Looks like tests on 3.5 is just as solid as on 3.4, because running
> them on a dedicated, single threaded environment show almost all tests
> succeeding,
> >
> > - I think the root cause of failing unit tests could be one (or more) of
> the following:
> > a) Environmental: Jenkins slave gets overloaded with other
> builds and multithreaded test running makes things even worse: starving JDK
> threads and ZK instances (both clients and servers) are unable to operate
> > b) Conceptional: ZK unit tests were not designed to run on
> multiple threads: I investigated the unique port assignment feature which
> is looking good, but there could be other possible gaps which makes them
> unreliable when running simultaneously.
> > c) Bad testing: testing ZK in the wrong way, making bad
> assumption (e.g. not syncing clients), etc.
> > d) Bug in the server.
> >
> > I feel that finding case d) with these tests is super hard, because a
> test report doesn’t give any information on what could go wrong with
> ZooKeeper. More or less guessing is your only option.
> >
> > Finding c) is a little bit easier, I’m trying to submit patches on them
> and hopefully making some progress.
> >
> > The huge pain in the arse though are a) and b): people desperately keep
> commenting “please retest this” on github to get a green build while
> testing is going in a direction to hide real problems: I mean people
> started not to care about a failing build, because “it must be some flaky
> unrelated to my patch”. Which is bad, but the shame is it’s true 90%
> percent of cases.
> >
> > I’m just trying to find some ways - besides fixing c) and d) flakies -
> to get more reliable and more informative Jenkins builds. Don’t want to
> make a huge turnaround, but I think if we can get a significantly more
> reliable build for the price of slightly longer build time running on 4
> threads instead of 8, I say let’s do it.
> >
> > As always, any help from the community is more than welcome and
> appreciated.
> >
> > Thanks,
> > Andor
> >
> >
> >
> >
> > > On 2018. Oct 12., at 16:52, Patrick Hunt  wrote:
> > >
> > > iirc the number of threads was increased to improve performance.
> Reducing
> > > is fine, but do we understand why it's failing? Perhaps it's finding
> real
> > > issues as a result of the artificial concurrency/load.
> > >
> > > Patrick
> > >
> > > On Fri, Oct 12, 2018 at 7:12 AM Andor Molnar
> 
> > > wrote:
> > >
> > >> Thanks for the feedback.
> > >> I'm running a few tests now: branch-3.5 on 2 threads and trunk on 4
> threads
> > >> to see what's the impact on the build time.
> > >>
> > >> Github PR job is hard to configure, because its settings are hard
> coded
> > >> into a shell script in the codebase. I have to open PR for that.
> > >>
> > >> Andor
> > >>
> > >>
> > >>
> > >> On Fri, Oct 12, 2018 at 2:46 PM, Norbert Kalmar <
> > >> nkal...@cloudera.com.invalid> wrote:
> > >>
> > >>> +1, running the tests locally with 1 thread always passes (well, I
> run it
> > >>> about 5 times, but still)
> > >>> On the other hand, running it on 8 threads yields similarly flaky
> results
> > >>> as Apache runs. (Although it is much faster, but if we have to run
> 6-8-10
> > >>> times sometimes to get a green run...)
> > >>>
> > >>> Norbert
> > >>>
> > >>> On Fri, Oct 12, 2018 at 2:05 PM Enrico Olivelli  >
> > >>> wrote:
> > >>>
> >  +1
> > 
> >  Enrico
> > 
> >  Il ven 12 ott 2018, 13:52 Andor Molnar  ha
> scritto:
> > 
> > > Hi,
> > >
> > > What do you think of changing number of threads running unit tests
> in
> > > Jenkins from current 8 to 4 or even 2?
> > >
> > > Running unit tests inside Cloudera environment on a single thread

Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-12 Thread Benjamin Reed
i think the unique port assignment (d) is more problematic than it
appears. there is a race between finding a free port and actually
grabbing it. i think that contributes to the flakiness.

ben
On Fri, Oct 12, 2018 at 8:50 AM Andor Molnar  wrote:
>
> That is a completely valid point. I started to investigate flakies for 
> exactly the same reason, if you remember the thread that I started a while 
> ago. It was later abandoned unfortunately, because I’ve run into a few issues:
>
> - We nailed down that in order to release 3.5 stable, we have to make sure 
> it’s not worse than 3.4 by comparing the builds: but these builds are not 
> comparable, because 3.4 tests running single threaded while 3.5 multithreaded 
> showing problems which might also exist on 3.4,
>
> - Neither of them running C++ tests for some reason, but that’s not really an 
> issue here,
>
> - Looks like tests on 3.5 is just as solid as on 3.4, because running them on 
> a dedicated, single threaded environment show almost all tests succeeding,
>
> - I think the root cause of failing unit tests could be one (or more) of the 
> following:
> a) Environmental: Jenkins slave gets overloaded with other builds and 
> multithreaded test running makes things even worse: starving JDK threads and 
> ZK instances (both clients and servers) are unable to operate
> b) Conceptional: ZK unit tests were not designed to run on multiple 
> threads: I investigated the unique port assignment feature which is looking 
> good, but there could be other possible gaps which makes them unreliable when 
> running simultaneously.
> c) Bad testing: testing ZK in the wrong way, making bad assumption 
> (e.g. not syncing clients), etc.
> d) Bug in the server.
>
> I feel that finding case d) with these tests is super hard, because a test 
> report doesn’t give any information on what could go wrong with ZooKeeper. 
> More or less guessing is your only option.
>
> Finding c) is a little bit easier, I’m trying to submit patches on them and 
> hopefully making some progress.
>
> The huge pain in the arse though are a) and b): people desperately keep 
> commenting “please retest this” on github to get a green build while testing 
> is going in a direction to hide real problems: I mean people started not to 
> care about a failing build, because “it must be some flaky unrelated to my 
> patch”. Which is bad, but the shame is it’s true 90% percent of cases.
>
> I’m just trying to find some ways - besides fixing c) and d) flakies - to get 
> more reliable and more informative Jenkins builds. Don’t want to make a huge 
> turnaround, but I think if we can get a significantly more reliable build for 
> the price of slightly longer build time running on 4 threads instead of 8, I 
> say let’s do it.
>
> As always, any help from the community is more than welcome and appreciated.
>
> Thanks,
> Andor
>
>
>
>
> > On 2018. Oct 12., at 16:52, Patrick Hunt  wrote:
> >
> > iirc the number of threads was increased to improve performance. Reducing
> > is fine, but do we understand why it's failing? Perhaps it's finding real
> > issues as a result of the artificial concurrency/load.
> >
> > Patrick
> >
> > On Fri, Oct 12, 2018 at 7:12 AM Andor Molnar 
> > wrote:
> >
> >> Thanks for the feedback.
> >> I'm running a few tests now: branch-3.5 on 2 threads and trunk on 4 threads
> >> to see what's the impact on the build time.
> >>
> >> Github PR job is hard to configure, because its settings are hard coded
> >> into a shell script in the codebase. I have to open PR for that.
> >>
> >> Andor
> >>
> >>
> >>
> >> On Fri, Oct 12, 2018 at 2:46 PM, Norbert Kalmar <
> >> nkal...@cloudera.com.invalid> wrote:
> >>
> >>> +1, running the tests locally with 1 thread always passes (well, I run it
> >>> about 5 times, but still)
> >>> On the other hand, running it on 8 threads yields similarly flaky results
> >>> as Apache runs. (Although it is much faster, but if we have to run 6-8-10
> >>> times sometimes to get a green run...)
> >>>
> >>> Norbert
> >>>
> >>> On Fri, Oct 12, 2018 at 2:05 PM Enrico Olivelli 
> >>> wrote:
> >>>
>  +1
> 
>  Enrico
> 
>  Il ven 12 ott 2018, 13:52 Andor Molnar  ha scritto:
> 
> > Hi,
> >
> > What do you think of changing number of threads running unit tests in
> > Jenkins from current 8 to 4 or even 2?
> >
> > Running unit tests inside Cloudera environment on a single thread
> >> shows
>  the
> > builds much more stable. That would be probably too slow, but maybe
>  running
> > at least less threads would improve the situation.
> >
> > It's getting very annoying that I cannot get a green build on GitHub
> >>> with
> > only a few retests.
> >
> > Regards,
> > Andor
> >
>  --
> 
> 
>  -- Enrico Olivelli
> 
> >>>
> >>
>


Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-12 Thread Andor Molnar
That is a completely valid point. I started to investigate flakies for exactly 
the same reason, if you remember the thread that I started a while ago. It was 
later abandoned unfortunately, because I’ve run into a few issues:

- We nailed down that in order to release 3.5 stable, we have to make sure it’s 
not worse than 3.4 by comparing the builds: but these builds are not 
comparable, because 3.4 tests running single threaded while 3.5 multithreaded 
showing problems which might also exist on 3.4, 

- Neither of them running C++ tests for some reason, but that’s not really an 
issue here,

- Looks like tests on 3.5 is just as solid as on 3.4, because running them on a 
dedicated, single threaded environment show almost all tests succeeding,

- I think the root cause of failing unit tests could be one (or more) of the 
following:
a) Environmental: Jenkins slave gets overloaded with other builds and 
multithreaded test running makes things even worse: starving JDK threads and ZK 
instances (both clients and servers) are unable to operate
b) Conceptional: ZK unit tests were not designed to run on multiple 
threads: I investigated the unique port assignment feature which is looking 
good, but there could be other possible gaps which makes them unreliable when 
running simultaneously. 
c) Bad testing: testing ZK in the wrong way, making bad assumption 
(e.g. not syncing clients), etc.
d) Bug in the server.

I feel that finding case d) with these tests is super hard, because a test 
report doesn’t give any information on what could go wrong with ZooKeeper. More 
or less guessing is your only option.

Finding c) is a little bit easier, I’m trying to submit patches on them and 
hopefully making some progress.

The huge pain in the arse though are a) and b): people desperately keep 
commenting “please retest this” on github to get a green build while testing is 
going in a direction to hide real problems: I mean people started not to care 
about a failing build, because “it must be some flaky unrelated to my patch”. 
Which is bad, but the shame is it’s true 90% percent of cases.

I’m just trying to find some ways - besides fixing c) and d) flakies - to get 
more reliable and more informative Jenkins builds. Don’t want to make a huge 
turnaround, but I think if we can get a significantly more reliable build for 
the price of slightly longer build time running on 4 threads instead of 8, I 
say let’s do it.

As always, any help from the community is more than welcome and appreciated.

Thanks,
Andor




> On 2018. Oct 12., at 16:52, Patrick Hunt  wrote:
> 
> iirc the number of threads was increased to improve performance. Reducing
> is fine, but do we understand why it's failing? Perhaps it's finding real
> issues as a result of the artificial concurrency/load.
> 
> Patrick
> 
> On Fri, Oct 12, 2018 at 7:12 AM Andor Molnar 
> wrote:
> 
>> Thanks for the feedback.
>> I'm running a few tests now: branch-3.5 on 2 threads and trunk on 4 threads
>> to see what's the impact on the build time.
>> 
>> Github PR job is hard to configure, because its settings are hard coded
>> into a shell script in the codebase. I have to open PR for that.
>> 
>> Andor
>> 
>> 
>> 
>> On Fri, Oct 12, 2018 at 2:46 PM, Norbert Kalmar <
>> nkal...@cloudera.com.invalid> wrote:
>> 
>>> +1, running the tests locally with 1 thread always passes (well, I run it
>>> about 5 times, but still)
>>> On the other hand, running it on 8 threads yields similarly flaky results
>>> as Apache runs. (Although it is much faster, but if we have to run 6-8-10
>>> times sometimes to get a green run...)
>>> 
>>> Norbert
>>> 
>>> On Fri, Oct 12, 2018 at 2:05 PM Enrico Olivelli 
>>> wrote:
>>> 
 +1
 
 Enrico
 
 Il ven 12 ott 2018, 13:52 Andor Molnar  ha scritto:
 
> Hi,
> 
> What do you think of changing number of threads running unit tests in
> Jenkins from current 8 to 4 or even 2?
> 
> Running unit tests inside Cloudera environment on a single thread
>> shows
 the
> builds much more stable. That would be probably too slow, but maybe
 running
> at least less threads would improve the situation.
> 
> It's getting very annoying that I cannot get a green build on GitHub
>>> with
> only a few retests.
> 
> Regards,
> Andor
> 
 --
 
 
 -- Enrico Olivelli
 
>>> 
>> 



Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-12 Thread Patrick Hunt
iirc the number of threads was increased to improve performance. Reducing
is fine, but do we understand why it's failing? Perhaps it's finding real
issues as a result of the artificial concurrency/load.

Patrick

On Fri, Oct 12, 2018 at 7:12 AM Andor Molnar 
wrote:

> Thanks for the feedback.
> I'm running a few tests now: branch-3.5 on 2 threads and trunk on 4 threads
> to see what's the impact on the build time.
>
> Github PR job is hard to configure, because its settings are hard coded
> into a shell script in the codebase. I have to open PR for that.
>
> Andor
>
>
>
> On Fri, Oct 12, 2018 at 2:46 PM, Norbert Kalmar <
> nkal...@cloudera.com.invalid> wrote:
>
> > +1, running the tests locally with 1 thread always passes (well, I run it
> > about 5 times, but still)
> > On the other hand, running it on 8 threads yields similarly flaky results
> > as Apache runs. (Although it is much faster, but if we have to run 6-8-10
> > times sometimes to get a green run...)
> >
> > Norbert
> >
> > On Fri, Oct 12, 2018 at 2:05 PM Enrico Olivelli 
> > wrote:
> >
> > > +1
> > >
> > > Enrico
> > >
> > > Il ven 12 ott 2018, 13:52 Andor Molnar  ha scritto:
> > >
> > > > Hi,
> > > >
> > > > What do you think of changing number of threads running unit tests in
> > > > Jenkins from current 8 to 4 or even 2?
> > > >
> > > > Running unit tests inside Cloudera environment on a single thread
> shows
> > > the
> > > > builds much more stable. That would be probably too slow, but maybe
> > > running
> > > > at least less threads would improve the situation.
> > > >
> > > > It's getting very annoying that I cannot get a green build on GitHub
> > with
> > > > only a few retests.
> > > >
> > > > Regards,
> > > > Andor
> > > >
> > > --
> > >
> > >
> > > -- Enrico Olivelli
> > >
> >
>


Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-12 Thread Andor Molnar
Thanks for the feedback.
I'm running a few tests now: branch-3.5 on 2 threads and trunk on 4 threads
to see what's the impact on the build time.

Github PR job is hard to configure, because its settings are hard coded
into a shell script in the codebase. I have to open PR for that.

Andor



On Fri, Oct 12, 2018 at 2:46 PM, Norbert Kalmar <
nkal...@cloudera.com.invalid> wrote:

> +1, running the tests locally with 1 thread always passes (well, I run it
> about 5 times, but still)
> On the other hand, running it on 8 threads yields similarly flaky results
> as Apache runs. (Although it is much faster, but if we have to run 6-8-10
> times sometimes to get a green run...)
>
> Norbert
>
> On Fri, Oct 12, 2018 at 2:05 PM Enrico Olivelli 
> wrote:
>
> > +1
> >
> > Enrico
> >
> > Il ven 12 ott 2018, 13:52 Andor Molnar  ha scritto:
> >
> > > Hi,
> > >
> > > What do you think of changing number of threads running unit tests in
> > > Jenkins from current 8 to 4 or even 2?
> > >
> > > Running unit tests inside Cloudera environment on a single thread shows
> > the
> > > builds much more stable. That would be probably too slow, but maybe
> > running
> > > at least less threads would improve the situation.
> > >
> > > It's getting very annoying that I cannot get a green build on GitHub
> with
> > > only a few retests.
> > >
> > > Regards,
> > > Andor
> > >
> > --
> >
> >
> > -- Enrico Olivelli
> >
>


Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-12 Thread Norbert Kalmar
+1, running the tests locally with 1 thread always passes (well, I run it
about 5 times, but still)
On the other hand, running it on 8 threads yields similarly flaky results
as Apache runs. (Although it is much faster, but if we have to run 6-8-10
times sometimes to get a green run...)

Norbert

On Fri, Oct 12, 2018 at 2:05 PM Enrico Olivelli  wrote:

> +1
>
> Enrico
>
> Il ven 12 ott 2018, 13:52 Andor Molnar  ha scritto:
>
> > Hi,
> >
> > What do you think of changing number of threads running unit tests in
> > Jenkins from current 8 to 4 or even 2?
> >
> > Running unit tests inside Cloudera environment on a single thread shows
> the
> > builds much more stable. That would be probably too slow, but maybe
> running
> > at least less threads would improve the situation.
> >
> > It's getting very annoying that I cannot get a green build on GitHub with
> > only a few retests.
> >
> > Regards,
> > Andor
> >
> --
>
>
> -- Enrico Olivelli
>


Re: Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-12 Thread Enrico Olivelli
+1

Enrico

Il ven 12 ott 2018, 13:52 Andor Molnar  ha scritto:

> Hi,
>
> What do you think of changing number of threads running unit tests in
> Jenkins from current 8 to 4 or even 2?
>
> Running unit tests inside Cloudera environment on a single thread shows the
> builds much more stable. That would be probably too slow, but maybe running
> at least less threads would improve the situation.
>
> It's getting very annoying that I cannot get a green build on GitHub with
> only a few retests.
>
> Regards,
> Andor
>
-- 


-- Enrico Olivelli


Decrease number of threads in Jenkins builds to reduce flakyness

2018-10-12 Thread Andor Molnar
Hi,

What do you think of changing number of threads running unit tests in
Jenkins from current 8 to 4 or even 2?

Running unit tests inside Cloudera environment on a single thread shows the
builds much more stable. That would be probably too slow, but maybe running
at least less threads would improve the situation.

It's getting very annoying that I cannot get a green build on GitHub with
only a few retests.

Regards,
Andor