Re: [controller-dev] [integration-dev] Carbon SR3: circuit breaker timed out. Transaction aborted due to shutdown.

Ajay Lele Sat, 17 Feb 2018 19:07:30 -0800

On Fri, Feb 16, 2018 at 11:48 AM, Tom Pantelis <tompante...@gmail.com>
wrote:


>
>
> On Fri, Feb 16, 2018 at 2:44 PM, Ajay Lele <ajaysl...@gmail.com> wrote:
>
>>
>>
>> On Fri, Feb 16, 2018 at 11:38 AM, Tom Pantelis <tompante...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Feb 16, 2018 at 2:35 PM, Jamo Luhrsen <jluhr...@gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On 2/16/18 11:33 AM, Tom Pantelis wrote:
>>>> >
>>>> >
>>>> > On Fri, Feb 16, 2018 at 2:26 PM, Jamo Luhrsen <jluhr...@gmail.com
>>>> <mailto:jluhr...@gmail.com>> wrote:
>>>> >
>>>> >     I'm analyzing CSIT failures for our Carbon SR3 candidate.
>>>> >
>>>> >     Something nasty went wrong in a netvirt CSIT job in the middle of
>>>> >     the robot tests. Seems like all functionality is probably broken
>>>> >     after that.
>>>> >
>>>> >     in the karaf.log [0] I see a message about some akka circuit
>>>> breaker
>>>> >     Timed out, then a bunch of RuntimeExceptions: Transaction
>>>> >     aborted due to shutdown.
>>>> >
>>>> >
>>>> > yeah that means akka persistence failed, ie it timed out waiting for
>>>> data to be written to the disk. That kills the
>>>> > shard actor with no recovery.  This can happen if there's slow disk
>>>> access/contention in the env - seen this happen
>>>> > before with internal CSIT env before the disk issue was resolved.
>>>>
>>>> Thanks. I'll report to the infra guys that we are still likely seeing
>>>> some high disk IO latency. There was another job with similar issues.
>>>>
>>>
>>> The timeout(s) can be increased in the akka.conf (would have to look it
>>> up) if it's really problematic although that's really just a band-aid.
>>>
>>
>> Tom - can some logic be put in place to recover from this failure (e.g.
>> eventually restarting the stopped shard). Else the controller has to be
>> restarted to get it out of this state. We had seen this in live envs also
>> with high load and reported it [0]
>>
>> [0] https://jira.opendaylight.org/browse/CONTROLLER-1789
>>
>
>
> Not sure - maybe.
>

Thx Tom. I inadvertently unicasted - adding the DLs back


>
>
>>
>>
>>>
>>>
>>>>
>>>> JamO
>>>>
>>>> >     Any ideas what's happening here?
>>>> >
>>>> >     Thanks,
>>>> >     JamO
>>>> >
>>>> >     [0]https://logs.opendaylight.org/releng/vex-yul-odl-jenkins
>>>> -1/netvirt-csit-1node-openstack-pike-upstream-stateful-snat-
>>>> conntrack-carbon/200/odl_1/odl1_karaf.log.gz
>>>> >     <https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1
>>>> /netvirt-csit-1node-openstack-pike-upstream-stateful-snat-co
>>>> nntrack-carbon/200/odl_1/odl1_karaf.log.gz>
>>>> >     _______________________________________________
>>>> >     controller-dev mailing list
>>>> >     controller-dev@lists.opendaylight.org <mailto:
>>>> controller-dev@lists.opendaylight.org>
>>>> >     https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>>> >     <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
>>>> >
>>>> >
>>>>
>>>
>>>
>>> _______________________________________________
>>> integration-dev mailing list
>>> integration-...@lists.opendaylight.org
>>> https://lists.opendaylight.org/mailman/listinfo/integration-dev
>>>
>>>
>>
>

_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Re: [controller-dev] [integration-dev] Carbon SR3: circuit breaker timed out. Transaction aborted due to shutdown.

Reply via email to