Re: [controller-dev] ODL abrupt restart - System.exit() via QuarantinedMonitorActorPropsFactory ?
Hi Michael, Quarantine is the state when akka system level messages could not be exchanged across the nodes – these include but not limited to heartbeats, remote deathwatch, node state updates etc. This article https://livingston.io/understanding-akkas-quarantine-state/ gives a fair idea Some pointers on what could cause this are discussed here https://groups.google.com/forum/#!searchin/akka-user/quarantine|sort:date/akka-user/6cmA1RzE4-s/IaHxhxLhEgAJ We have seen the suicide in past earlier during long stop-the world type GCs as well as *deliberate* (for testing purposes) interface-down / up for 2550 … Haven’t tested this behavior on master yet .. Regards Muthu From: controller-dev-boun...@lists.opendaylight.org [mailto:controller-dev-boun...@lists.opendaylight.org] On Behalf Of Michael Vorburger Sent: Thursday, July 05, 2018 11:12 PM To: Tom Pantelis Cc: Sridhar Gaddam ; Kitt, Stephen ; controller-dev Subject: Re: [controller-dev] ODL abrupt restart - System.exit() via QuarantinedMonitorActorPropsFactory ? On Thu, Jul 5, 2018 at 7:39 PM, Tom Pantelis mailto:tompante...@gmail.com>> wrote: On Thu, Jul 5, 2018 at 1:35 PM, Michael Vorburger mailto:vorbur...@redhat.com>> wrote: Tom, or Robert, or anyone else having hit this themselves, would you be able to remind us what in clustering can cause an ODL abrupt restart - System.exit() via bundleContext.getBundle(0).stop(); from https://github.com/opendaylight/controller/blob/master/opendaylight/md-sal/sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/akka/osgi/impl/QuarantinedMonitorActorPropsFactory.java ? I do vaguely an "inconsistent cluster" leading to this - clarify exactly what situation leads to that? Loss of leader? Loss of majority? asking for https://bugzilla.redhat.com/show_bug.cgi?id=1597304 ... That happens when akka quarantines a node - it can no longer rejoin the majority cluster unless the actor system is restarted, hence we restart the whole JVM. and what can cause Akka to have to quarantine a node? ___ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev
Re: [controller-dev] ODL abrupt restart - System.exit() via QuarantinedMonitorActorPropsFactory ?
On Thu, Jul 5, 2018 at 10:45 AM, Tom Pantelis wrote: > > > On Thu, Jul 5, 2018 at 1:42 PM, Michael Vorburger > wrote: > >> On Thu, Jul 5, 2018 at 7:39 PM, Tom Pantelis >> wrote: >> >>> On Thu, Jul 5, 2018 at 1:35 PM, Michael Vorburger >>> wrote: >>> Tom, or Robert, or anyone else having hit this themselves, would you be able to remind us what in clustering can cause an ODL abrupt restart - System.exit() via bundleContext.getBundle(0).stop(); from https://github.com/opendaylight/controller/blob/master/opend aylight/md-sal/sal-distributed-datastore/src/main/java/org/o pendaylight/controller/cluster/akka/osgi/impl/QuarantinedMon itorActorPropsFactory.java ? I do vaguely an "inconsistent cluster" leading to this - clarify exactly what situation leads to that? Loss of leader? Loss of majority? asking for https://bugzilla.redhat.com/show_bug.cgi?id=1597304 ... >>> >>> That happens when akka quarantines a node - it can no longer rejoin the >>> majority cluster unless the actor system is restarted, hence we restart the >>> whole JVM. >>> >> >> and what can cause Akka to have to quarantine a node? >> > > > An unrecoverable failure state - see https://livingston.io/ > understanding-akkas-quarantine-state/ for more detail. > The most common cause is nodes getting isolated for a considerable amount of time > > > ___ > controller-dev mailing list > controller-dev@lists.opendaylight.org > https://lists.opendaylight.org/mailman/listinfo/controller-dev > > ___ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev
Re: [controller-dev] ODL abrupt restart - System.exit() via QuarantinedMonitorActorPropsFactory ?
On Thu, Jul 5, 2018 at 1:42 PM, Michael Vorburger wrote: > On Thu, Jul 5, 2018 at 7:39 PM, Tom Pantelis > wrote: > >> On Thu, Jul 5, 2018 at 1:35 PM, Michael Vorburger >> wrote: >> >>> Tom, or Robert, or anyone else having hit this themselves, >>> >>> would you be able to remind us what in clustering can cause an ODL >>> abrupt restart - System.exit() via bundleContext.getBundle(0).stop(); >>> from https://github.com/opendaylight/controller/blob/master/opend >>> aylight/md-sal/sal-distributed-datastore/src/main/java/org/ >>> opendaylight/controller/cluster/akka/osgi/impl/Quarant >>> inedMonitorActorPropsFactory.java ? >>> >>> I do vaguely an "inconsistent cluster" leading to this - clarify exactly >>> what situation leads to that? Loss of leader? Loss of majority? >>> >>> asking for https://bugzilla.redhat.com/show_bug.cgi?id=1597304 ... >>> >> >> That happens when akka quarantines a node - it can no longer rejoin the >> majority cluster unless the actor system is restarted, hence we restart the >> whole JVM. >> > > and what can cause Akka to have to quarantine a node? > An unrecoverable failure state - see https://livingston.io/understanding-akkas-quarantine-state/ for more detail. ___ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev
Re: [controller-dev] ODL abrupt restart - System.exit() via QuarantinedMonitorActorPropsFactory ?
On Thu, Jul 5, 2018 at 7:39 PM, Tom Pantelis wrote: > On Thu, Jul 5, 2018 at 1:35 PM, Michael Vorburger > wrote: > >> Tom, or Robert, or anyone else having hit this themselves, >> >> would you be able to remind us what in clustering can cause an ODL abrupt >> restart - System.exit() via bundleContext.getBundle(0).stop(); from >> https://github.com/opendaylight/controller/blob/master/ >> opendaylight/md-sal/sal-distributed-datastore/src/main >> /java/org/opendaylight/controller/cluster/akka/osgi/impl/Qua >> rantinedMonitorActorPropsFactory.java ? >> >> I do vaguely an "inconsistent cluster" leading to this - clarify exactly >> what situation leads to that? Loss of leader? Loss of majority? >> >> asking for https://bugzilla.redhat.com/show_bug.cgi?id=1597304 ... >> > > That happens when akka quarantines a node - it can no longer rejoin the > majority cluster unless the actor system is restarted, hence we restart the > whole JVM. > and what can cause Akka to have to quarantine a node? ___ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev
Re: [controller-dev] ODL abrupt restart - System.exit() via QuarantinedMonitorActorPropsFactory ?
On Thu, Jul 5, 2018 at 1:35 PM, Michael Vorburger wrote: > Tom, or Robert, or anyone else having hit this themselves, > > would you be able to remind us what in clustering can cause an ODL abrupt > restart - System.exit() via bundleContext.getBundle(0).stop(); from > https://github.com/opendaylight/controller/blob/ > master/opendaylight/md-sal/sal-distributed-datastore/src/ > main/java/org/opendaylight/controller/cluster/akka/osgi/impl/ > QuarantinedMonitorActorPropsFactory.java ? > > I do vaguely an "inconsistent cluster" leading to this - clarify exactly > what situation leads to that? Loss of leader? Loss of majority? > > asking for https://bugzilla.redhat.com/show_bug.cgi?id=1597304 ... > That happens when akka quarantines a node - it can no longer rejoin the majority cluster unless the actor system is restarted, hence we restart the whole JVM. > > Tx, > M. > -- > Michael Vorburger, Red Hat > vorbur...@redhat.com | IRC: vorburger @freenode | ~ = http://vorburger.ch > > ___ > controller-dev mailing list > controller-dev@lists.opendaylight.org > https://lists.opendaylight.org/mailman/listinfo/controller-dev > > ___ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev
[controller-dev] Understading CDS
Hello Josh, everyone, when trying to understand what CDS does and how it does it, there are concepts and technologies that must be understood -- all relating to distributed systems and state management theory. Specific topics: - Actor systems, with Akka being an implementation - Akka Clustering - Akka Persistence - The RAFT algorithm (and distributed consensus in general, like 3PC) - Multiversion Concurrency Control (as a solution to the problem of concurrency control) All of these are things that cannot be explained in minutes and all have bearing on architecture of CDS as well as trade-offs taken in its design and implementation. If we try to have a conversation about the CDS without sharing this common knowledge, that conversation will be utterly inefficient with frequent and long digressions into those topics -- which is something I (and I suspect Tom) can ill afford. Regards, Robert signature.asc Description: OpenPGP digital signature ___ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev
[controller-dev] ODL abrupt restart - System.exit() via QuarantinedMonitorActorPropsFactory ?
Tom, or Robert, or anyone else having hit this themselves, would you be able to remind us what in clustering can cause an ODL abrupt restart - System.exit() via bundleContext.getBundle(0).stop(); from https://github.com/opendaylight/controller/blob/master/opendaylight/md-sal/sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/akka/osgi/impl/QuarantinedMonitorActorPropsFactory.java ? I do vaguely an "inconsistent cluster" leading to this - clarify exactly what situation leads to that? Loss of leader? Loss of majority? asking for https://bugzilla.redhat.com/show_bug.cgi?id=1597304 ... Tx, M. -- Michael Vorburger, Red Hat vorbur...@redhat.com | IRC: vorburger @freenode | ~ = http://vorburger.ch ___ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev