Re: [OpenDaylight Infrastructure] [Bug 9006] Sporadic cluster failure when member is restarted in OF cluster test

Sam Hague Fri, 25 Aug 2017 04:28:33 -0700

I think NetVirt can sign off on the clustering issues. Let Jamo take a look
when he is up and also sign off. The two jobs [2]  and [2] are showing
better results.


Sometimes there are random failures where a node does not come back
properly such as in job [4]. We try to bring ODL1 back into the cluster but
it fails to come back within 5 minutes. Then we move to the next tests and
they fail. That ODL1 is hitting the below issue. Is there anything we can
do to get past that? We can increase the timeout but why is the cluster in
a bad shape? I don't think the infra is loaded since everything else is
moving along properly - the robot vm is driving the other two nodes. We can
also see odl1 restarting but taking it's time in the failing case.

NetVirt is hitting the issue in bug-9006. NetVirt tests copied the
openflowplugin test pattern to take a node down and bring it back. Then
wait 5 minutes. What I don't understand is why taking 1 node down out of
the three leads to instability? We have three nodes in the cluster. Take 1
down leave other 2 alone. Attempt to bring back the 1 node, wait 5 minutes,
that fails and now the cluster is in a bad state causing the further tests
to fail.

2017-08-25 02:02:38,430 | WARN | saction-32-34'}} | DeadlockMonitor | 126 -
org.opendaylight.controller.config-manager - 0.6.2.SNAPSHOT |
ModuleIdentifier{factoryName='runtime-generated-mapping',
instanceName='runtime-mapping-singleton'} did not finish after 284864 ms

[2]
https://jenkins.opendaylight.org/releng/user/shague/my-views/view/3node/job/netvirt-csit-3node-openstack-ocata-gate-stateful-carbon/25/

[3]
https://jenkins.opendaylight.org/releng/user/shague/my-views/view/3node/job/netvirt-csit-3node-openstack-ocata-upstream-stateful-carbon/

[4]
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-3node-openstack-ocata-gate-stateful-carbon/24/log.html.gz#s1-s1-t13-k2-k2-k8

[5] https://git.opendaylight.org/gerrit/62256

On Thu, Aug 24, 2017 at 7:38 PM, Sam Hague <sha...@redhat.com> wrote:

> I am running some more tests now with the NetVirt CSIT that look promising
> so it might not be a blocker for NetVirt. I am running a few more
> iterations now to know better.
>
> I had reduced the number of test suites so that we could capture just
> clustering issues. Doing so did add a bug in the test code that was causing
> some extra failures. I have that fixed. If the next runs show we are back
> down to just a few clustering bugs then we don't need a blocker from the
> NetVirt side. If lucky the remaining issues are what is in this
> openflowplugin bug here.
>
> For reference the last run [1] is looking better. It has a different test
> code bug in so ignore those, but if you can check the karaf.logs and see if
> you see any clustering issues. I don't think you will since the killing of
> the ODL nodes is broken in this job. 21 and 21 should have that fixed.
>
> [1] https://logs.opendaylight.org/releng/jenkins092/netvirt-
> csit-3node-openstack-ocata-gate-stateful-carbon/20/
>
> On Thu, Aug 24, 2017 at 7:22 PM, Robert Varga <n...@hq.sk> wrote:
>
>> On 24/08/17 22:07, bugzilla-dae...@bugs.opendaylight.org wrote:
>> > *Comment # 14 <https://bugs.opendaylight.org/show_bug.cgi?id=9006#c14>
>> > on bug 9006 <https://bugs.opendaylight.org/show_bug.cgi?id=9006> from
>> > Luis Gomez <mailto:ece...@gmail.com> *
>> >
>> > OK, I think as next step I can try to see if this reproduces outside CI.
>>
>> +infrastructure
>>
>> Guys, we are dealing with an issue which was first reported on
>> 8/17/2017, is blocking Carbon SR2 (due to netvirt CSIT failing) and can
>> either be an infra or code problem.
>>
>> Suspect trigger is the fix for
>> https://bugs.opendaylight.org/show_bug.cgi?id=8941 (merged of
>> 8/12/2017), which is a Carbon -> Carbon SR1 memory leak regression. If
>> that is the case, we need to identify and fix it, as a revert is not
>> really an option.
>>
>> Carbon/Nitrogen are synced up w.r.t. CDS, so this also impacts Nitrogen
>> (where it is a Carbon -> Nitrogen regression).
>>
>> Can you check with RS if there are any issues and/or the public cloud is
>> experiencing issues?
>>
>> Given that inter-node network stability is in question, can we get a
>> limited-use set of slaves in the private cloud? Whatever is needed for
>> netvirt CSIT is sufficient, and we only need to spin it up when we need
>> a really predictable environment... Should I file a helpdesk ticket?
>>
>> Thanks,
>> Robert
>>
>>
>> _______________________________________________
>> infrastructure mailing list
>> infrastructure@lists.opendaylight.org
>> https://lists.opendaylight.org/mailman/listinfo/infrastructure
>>
>>
>

_______________________________________________
infrastructure mailing list
infrastructure@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/infrastructure

Re: [OpenDaylight Infrastructure] [Bug 9006] Sporadic cluster failure when member is restarted in OF cluster test

Reply via email to