Re: [controller-dev] [mdsal-dev] cluster - recovery from dual failure

Tom Pantelis Sun, 22 Jan 2017 08:51:24 -0800

Nope. I originally thought it would but they implemented it very
conservatively, at least originally, as I found out.


On Sun, Jan 22, 2017 at 11:00 AM, Sela, Guy <guy.s...@hpe.com> wrote:

> I won’t solve Shlomi’s problem, as described here in a post originated by
> Tom P.J
>
> https://groups.google.com/forum/#!topic/akka-user/506ErDM_KA4
>
>
>
>
>
> *From:* mdsal-dev-boun...@lists.opendaylight.org [mailto:
> mdsal-dev-boun...@lists.opendaylight.org] *On Behalf Of *Sela, Guy
> *Sent:* Sunday, January 22, 2017 5:28 PM
> *To:* Tom Pantelis <tompante...@gmail.com>
> *Cc:* controller-dev@lists.opendaylight.org; mdsal-dev@lists.opendaylight.
> org; Alfasi, Shlomi <shlomi.alf...@hpe.com>
>
> *Subject:* Re: [mdsal-dev] [controller-dev] cluster - recovery from dual
> failure
>
>
>
> I’m sorry, just read about weakly-up documentation.
>
> Sounds like it will solve Shlomi’s problem.
>
> What did you mean by gets it “partly” to the way we want it ? What’s
> missing?
>
>
>
>
>
> *From:* Tom Pantelis [mailto:tompante...@gmail.com <tompante...@gmail.com>]
>
> *Sent:* Sunday, January 22, 2017 5:08 PM
> *To:* Sela, Guy <guy.s...@hpe.com>
> *Cc:* Alfasi, Shlomi <shlomi.alf...@hpe.com>; controller-dev@lists.
> opendaylight.org; mdsal-...@lists.opendaylight.org
> *Subject:* Re: [mdsal-dev] [controller-dev] cluster - recovery from dual
> failure
>
>
>
> That's the way it works and the akka designers have reasons for it. They
> added "weakly-up" which gets it partly to the way we would want it to work
> and they've said they may add more options to better control the behavior.
>
>
>
> You can enable auto-down in your setup. Or an external script to monitor
> the process and, if it goes down, then send a "down" request (via jolokia)
> to the cluster leader.
>
>
>
> On Sun, Jan 22, 2017 at 9:37 AM, Sela, Guy <guy.s...@hpe.com> wrote:
>
> Hi,
>
> Just read the documentation, very interesting.
>
> So that means that ODL Cluster can’t automatically recover from more than
> a single concurrent failure.
>
> Even if we had a cluster of 10 nodes, if one becomes unreachable, none of
> the others can restart, until the first one will be reachable again.
>
> Sounds like a serious restriction for production.
>
> Are there any best practices how to deal with this situations? (Without
> manual intervention)
>
>
>
> *From:* mdsal-dev-boun...@lists.opendaylight.org [mailto:
> mdsal-dev-boun...@lists.opendaylight.org] *On Behalf Of *Tom Pantelis
> *Sent:* Sunday, January 22, 2017 4:30 PM
> *To:* Alfasi, Shlomi <shlomi.alf...@hpe.com>
> *Cc:* controller-dev@lists.opendaylight.org; mdsal-dev@lists.opendaylight.
> org
> *Subject:* Re: [mdsal-dev] [controller-dev] cluster - recovery from dual
> failure
>
>
>
> This is a side effect of how akka clustering works. All unreachable nodes
> must first become reachable again, or the status of the unreachable nodes
> must be changed to 'Down', either manually or auto-downed.  You can enable
> auto-downing but akka doesn't recommend it in production (
> http://doc.akka.io/docs/akka/current/java/cluster-usage.html).
>
>
>
> On Sun, Jan 22, 2017 at 8:53 AM, Alfasi, Shlomi <shlomi.alf...@hpe.com>
> wrote:
>
> Hi All,
>
>
>
> I configured a clustered setup with 3 nodes (attached the akka.conf of one
> of the nodes).
>
> At a specific time one of the members in the cluster was down and then I
> restarted another node.
>
> In the restarted node I see that it fails to read information from the
> datastore and repetitively throw exceptions [1]
>
> In the node that was always up, every 10 seconds there is a log that imply
> that the restarted node doesn’t manage to join [2]
>
>
>
> What is the expected behavior in this case? Is this state recoverable?
>
>
>
> Shlomi
>
>
>
> [1]
>
> WARN  | ult-dispatcher-2 | DataStoreAppConfigMetadata       | 153 -
> org.opendaylight.controller.blueprint - 0.5.2.SNAPSHOT |
> org.opendaylight.netvirt.elanmanager-impl (elanConfig): Read of app
> config org.opend
>
> aylight.yang.gen.v1.urn.opendaylight.netvirt.elan.config.rev150710.ElanConfig
> failed - retrying
>
> ReadFailedException{message=Error executeRead ReadData for path
> /(urn:opendaylight:netvirt:elan:config?revision=2015-07-10)elan-config,
> errorList=[RpcError [message=Error executeRead ReadData for path
> /(urn:opendaylight:netvirt:elan:co
>
> nfig?revision=2015-07-10)elan-config, severity=ERROR,
> errorType=APPLICATION, tag=operation-failed, applicationTag=null,
> info=null, 
> cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException:
> Shard member-3-s
>
> hard-default-config currently has no leader. Try again later.]]}
>
>
>
> [2]
>
> 2017-01-22 15:19:56,290 | INFO  | lt-dispatcher-22 |
> kka://opendaylight-cluster-data) | 159 - com.typesafe.akka.slf4j - 2.4.7
> | Cluster Node [akka.tcp://opendaylight-cluster-data@10.0.77.33:2550] -
> New incarnation of existing member [M
>
> ember(address = akka.tcp://opendaylight-cluster-data@10.0.97.128:2550,
> status = Down)] is trying to join. Existing will be removed from the
> cluster and then new member will be allowed to join.
>
>
>
>
> _______________________________________________
> controller-dev mailing list
> controller-dev@lists.opendaylight.org
> https://lists.opendaylight.org/mailman/listinfo/controller-dev
>
>
>
>
>

_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Re: [controller-dev] [mdsal-dev] cluster - recovery from dual failure

Reply via email to