Re: [akka-user] Akka Cluster (with Sharding) not working without auto-down-unreachable-after

2016-08-05 Thread Justin du coeur
On Fri, Aug 5, 2016 at 2:02 PM, Endre Varga 
wrote:

> No, you are doing a great job explaining these! Maybe a guest blog post?
> (wink, wink, nudge, nudge ;) )
>

Thanks.  Everything's on an "as time permits" basis -- most of my attention
has to be on Querki -- but I'll consider writing something up.

There *is* a big blog post in the plans sometime in the next few weeks,
though, a case study on using Kryo for Akka Persistence.  (Which, knock on
wood, seems to be going decently well, but I have a *lot* of opinions on
the subject, some of them a tad idiosyncratic.)

FYI, while my blog is pretty diverse, this tag covers the programming topics
...

-- 
>>  Read the docs: http://akka.io/docs/
>>  Check the FAQ: 
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>  Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.


Re: [akka-user] Akka Cluster (with Sharding) not working without auto-down-unreachable-after

2016-08-05 Thread Endre Varga
On Fri, Aug 5, 2016 at 12:04 AM, Justin du coeur  wrote:

> It does do reassignment -- but it has to know to do that.  Keep in mind
> that "down" is the master switch here: until the node is downed, the rest
> of the system doesn't *know* that NodeA should be avoided.  I haven't dug
> into that particular code, but I assume from what you're saying that the
> allocation algorithm doesn't take unreachability into account when choosing
> where to allocate the shard, just up/down.  I suspect that unreachability
> is too local and transient to use as the basis for these allocations.
>
> Keep in mind that you're looking at this from a relatively all-knowing
> global perspective, but each node is working from a very localized and
> imperfect one.  All it knows is "I can't currently reach NodeA".  It has no
> a priori way of knowing whether NodeA has been taken offline (so it should
> be avoided), or there's simply been a transient network glitch between here
> and there (so things are *mostly* business as usual).  Downing is how you
> tell it, "No, really, stop using this node"; until then, most of the code
> assumes that the more-common transient situation is the case.  It's
> *probably* possible to take unreachability into account in the case you're
> describing, but it doesn't surprise me if that's not true.
>
> Also, keep in mind that, IIRC, there are a few cluster singletons involved
> here, at least behind the scenes.  If NodeA currently owns one of the key
> singletons (such as the ShardCoordinator), and it hasn't been downed, I
> imagine the rest of the cluster is going to *quickly* lock up, because the
> result is that nobody is authorized to make these sorts of allocation
> decisions.
>
> All that said -- keep in mind, I'm just a user of this stuff, and am
> talking at the edges of my knowledge.  Konrad's the actual expert...
>

No, you are doing a great job explaining these! Maybe a guest blog post?
(wink, wink, nudge, nudge ;) )

-Endre



>
> On Thu, Aug 4, 2016 at 4:59 PM, Eric Swenson  wrote:
>
>> While I'm in the process of implementing your proposed solution, I did
>> want to make sure I understood why I'm seeing the failures I'm seeing when
>> a node is taken offline, auto-down is disabled, and no one is handling the
>> UnreachableNode message.  Let me try to explain what I think is happening
>> and perhaps you (or someone else who knows more about this than I) can
>> confirm or refute.
>>
>> In the case of akka-cluster-sharding, a shard might exist on the
>> unreachable node.  Since the node is not yet marked as "down", the cluster
>> simply cannot handle an incoming message for that shard.  To create another
>> sharded actor on an available cluster node might duplicate the unreachable
>> node state.  In the case of akka-persistence actors, even though a new
>> shard actor could resurrect any journaled state, we cannot be sure that the
>> old unreachable node might not at any time, add other events to the
>> journal, or come online and try to continue operating on the shard.
>>
>> Is that the reason why I see the following behavior:  NodeA is online.
>> NodeB comes online and joins the cluster.  A request comes in from
>> akka-http and is sent to the shard region.  It goes to NodeA which creates
>> an actor to handle the sharded object.  NodeA is taken offline (unbeknownst
>> to the akka-cluster).  Another message for the above-mentioned shard comes
>> in from akka-http and is sent to the shard region. The shard region can't
>> reach NodeA.  NodeA isn't marked as down.  So the shard region cannot
>> create another actor (on an available Node). It can only wait (until
>> timeout) for NodeA to become reachable.  Since, in my scenario, NodeA will
>> never become reachable and NodeB is the only one online, all requests for
>> old shards timeout.
>>
>> If the above logic is true, I have one last issue:  In the above
>> scenario, if a message comes into the shard region for a shard that WOULD
>> HAVE BEEN allocated to NodeA but has never yet been assigned to an actor on
>> NodeA, and NodeA is unreachable, why can it simply be assigned to another
>> Node?  is it because the shard-to-node algorithm is fixed (by default) and
>> there is no dynamic ability to "reassign" to an available Node?
>>
>> Thanks again.  -- Eric
>>
>> On Wednesday, August 3, 2016 at 7:00:42 PM UTC-7, Justin du coeur wrote:
>>>
>>> The keyword here is "auto".  Autodowning is an *incredibly braindead*
>>> algorithm for dealing with nodes coming out of service, and if you use it
>>> in production you more or less guarantee disaster, because that algorithm
>>> can't cope with cluster partition.  You *do* need to deal with downing, but
>>> you have to get something smarter than that.
>>>
>>> Frankly, if you're already hooking into AWS, I *suspect* the best
>>> approach is to leverage that -- when a node goes offline, you have some
>>> code to detect that through the ECS APIs, react to it, and 

Re: [akka-user] Akka Cluster (with Sharding) not working without auto-down-unreachable-after

2016-08-05 Thread Eric Swenson
Yes, thanks. I'll take a look at the default implementation and explore 
possible other implementations. I suspect, however, that the solution I've 
now implemented to "down" unreachable nodes if the AWS/ECS cluster says 
they are no longer there, will address my issues.  

On Friday, August 5, 2016 at 2:38:54 AM UTC-7, Johan Andrén wrote:
>
> You can however implement your own 
> akka.cluster.sharding.ShardCoordinator.ShardAllocationStrategy which will 
> allow you to take whatever you want into account, and deal with the 
> consequences thereof ofc ;) 
>
> --
> Johan
>
> On Friday, August 5, 2016 at 12:04:12 AM UTC+2, Justin du coeur wrote:
>>
>> It does do reassignment -- but it has to know to do that.  Keep in mind 
>> that "down" is the master switch here: until the node is downed, the rest 
>> of the system doesn't *know* that NodeA should be avoided.  I haven't dug 
>> into that particular code, but I assume from what you're saying that the 
>> allocation algorithm doesn't take unreachability into account when choosing 
>> where to allocate the shard, just up/down.  I suspect that unreachability 
>> is too local and transient to use as the basis for these allocations.
>>
>> Keep in mind that you're looking at this from a relatively all-knowing 
>> global perspective, but each node is working from a very localized and 
>> imperfect one.  All it knows is "I can't currently reach NodeA".  It has no 
>> a priori way of knowing whether NodeA has been taken offline (so it should 
>> be avoided), or there's simply been a transient network glitch between here 
>> and there (so things are *mostly* business as usual).  Downing is how you 
>> tell it, "No, really, stop using this node"; until then, most of the code 
>> assumes that the more-common transient situation is the case.  It's 
>> *probably* possible to take unreachability into account in the case you're 
>> describing, but it doesn't surprise me if that's not true.
>>
>> Also, keep in mind that, IIRC, there are a few cluster singletons 
>> involved here, at least behind the scenes.  If NodeA currently owns one of 
>> the key singletons (such as the ShardCoordinator), and it hasn't been 
>> downed, I imagine the rest of the cluster is going to *quickly* lock up, 
>> because the result is that nobody is authorized to make these sorts of 
>> allocation decisions.
>>
>> All that said -- keep in mind, I'm just a user of this stuff, and am 
>> talking at the edges of my knowledge.  Konrad's the actual expert...
>>
>> On Thu, Aug 4, 2016 at 4:59 PM, Eric Swenson > > wrote:
>>
>>> While I'm in the process of implementing your proposed solution, I did 
>>> want to make sure I understood why I'm seeing the failures I'm seeing when 
>>> a node is taken offline, auto-down is disabled, and no one is handling the 
>>> UnreachableNode message.  Let me try to explain what I think is happening 
>>> and perhaps you (or someone else who knows more about this than I) can 
>>> confirm or refute.
>>>
>>> In the case of akka-cluster-sharding, a shard might exist on the 
>>> unreachable node.  Since the node is not yet marked as "down", the cluster 
>>> simply cannot handle an incoming message for that shard.  To create another 
>>> sharded actor on an available cluster node might duplicate the unreachable 
>>> node state.  In the case of akka-persistence actors, even though a new 
>>> shard actor could resurrect any journaled state, we cannot be sure that the 
>>> old unreachable node might not at any time, add other events to the 
>>> journal, or come online and try to continue operating on the shard.
>>>
>>> Is that the reason why I see the following behavior:  NodeA is online.  
>>> NodeB comes online and joins the cluster.  A request comes in from 
>>> akka-http and is sent to the shard region.  It goes to NodeA which creates 
>>> an actor to handle the sharded object.  NodeA is taken offline (unbeknownst 
>>> to the akka-cluster).  Another message for the above-mentioned shard comes 
>>> in from akka-http and is sent to the shard region. The shard region can't 
>>> reach NodeA.  NodeA isn't marked as down.  So the shard region cannot 
>>> create another actor (on an available Node). It can only wait (until 
>>> timeout) for NodeA to become reachable.  Since, in my scenario, NodeA will 
>>> never become reachable and NodeB is the only one online, all requests for 
>>> old shards timeout.
>>>
>>> If the above logic is true, I have one last issue:  In the above 
>>> scenario, if a message comes into the shard region for a shard that WOULD 
>>> HAVE BEEN allocated to NodeA but has never yet been assigned to an actor on 
>>> NodeA, and NodeA is unreachable, why can it simply be assigned to another 
>>> Node?  is it because the shard-to-node algorithm is fixed (by default) and 
>>> there is no dynamic ability to "reassign" to an available Node? 
>>>
>>> Thanks again.  -- Eric
>>>
>>> On Wednesday, August 3, 2016 at 7:00:42 PM UTC-7, Justin du coeur 

Re: [akka-user] Akka Cluster (with Sharding) not working without auto-down-unreachable-after

2016-08-05 Thread Eric Swenson
Thanks Justin.  Makes good sense.  And I believe you've explained the lock 
up -- the shard coordinator singleton was probably on the unreacheable node 
(actually, it must have been, because prior to the second node's coming up, 
the first node was the only node) and since the cluster-of-two determined 
the first node to be unreachable, there was no shard coordinator accessible 
until the first node was "downed" and the shard coordinator moved to the 
second node.  

I appreciate all your insight. -- Eric

On Thursday, August 4, 2016 at 3:04:12 PM UTC-7, Justin du coeur wrote:
>
> It does do reassignment -- but it has to know to do that.  Keep in mind 
> that "down" is the master switch here: until the node is downed, the rest 
> of the system doesn't *know* that NodeA should be avoided.  I haven't dug 
> into that particular code, but I assume from what you're saying that the 
> allocation algorithm doesn't take unreachability into account when choosing 
> where to allocate the shard, just up/down.  I suspect that unreachability 
> is too local and transient to use as the basis for these allocations.
>
> Keep in mind that you're looking at this from a relatively all-knowing 
> global perspective, but each node is working from a very localized and 
> imperfect one.  All it knows is "I can't currently reach NodeA".  It has no 
> a priori way of knowing whether NodeA has been taken offline (so it should 
> be avoided), or there's simply been a transient network glitch between here 
> and there (so things are *mostly* business as usual).  Downing is how you 
> tell it, "No, really, stop using this node"; until then, most of the code 
> assumes that the more-common transient situation is the case.  It's 
> *probably* possible to take unreachability into account in the case you're 
> describing, but it doesn't surprise me if that's not true.
>
> Also, keep in mind that, IIRC, there are a few cluster singletons involved 
> here, at least behind the scenes.  If NodeA currently owns one of the key 
> singletons (such as the ShardCoordinator), and it hasn't been downed, I 
> imagine the rest of the cluster is going to *quickly* lock up, because the 
> result is that nobody is authorized to make these sorts of allocation 
> decisions.
>
> All that said -- keep in mind, I'm just a user of this stuff, and am 
> talking at the edges of my knowledge.  Konrad's the actual expert...
>
> On Thu, Aug 4, 2016 at 4:59 PM, Eric Swenson  > wrote:
>
>> While I'm in the process of implementing your proposed solution, I did 
>> want to make sure I understood why I'm seeing the failures I'm seeing when 
>> a node is taken offline, auto-down is disabled, and no one is handling the 
>> UnreachableNode message.  Let me try to explain what I think is happening 
>> and perhaps you (or someone else who knows more about this than I) can 
>> confirm or refute.
>>
>> In the case of akka-cluster-sharding, a shard might exist on the 
>> unreachable node.  Since the node is not yet marked as "down", the cluster 
>> simply cannot handle an incoming message for that shard.  To create another 
>> sharded actor on an available cluster node might duplicate the unreachable 
>> node state.  In the case of akka-persistence actors, even though a new 
>> shard actor could resurrect any journaled state, we cannot be sure that the 
>> old unreachable node might not at any time, add other events to the 
>> journal, or come online and try to continue operating on the shard.
>>
>> Is that the reason why I see the following behavior:  NodeA is online.  
>> NodeB comes online and joins the cluster.  A request comes in from 
>> akka-http and is sent to the shard region.  It goes to NodeA which creates 
>> an actor to handle the sharded object.  NodeA is taken offline (unbeknownst 
>> to the akka-cluster).  Another message for the above-mentioned shard comes 
>> in from akka-http and is sent to the shard region. The shard region can't 
>> reach NodeA.  NodeA isn't marked as down.  So the shard region cannot 
>> create another actor (on an available Node). It can only wait (until 
>> timeout) for NodeA to become reachable.  Since, in my scenario, NodeA will 
>> never become reachable and NodeB is the only one online, all requests for 
>> old shards timeout.
>>
>> If the above logic is true, I have one last issue:  In the above 
>> scenario, if a message comes into the shard region for a shard that WOULD 
>> HAVE BEEN allocated to NodeA but has never yet been assigned to an actor on 
>> NodeA, and NodeA is unreachable, why can it simply be assigned to another 
>> Node?  is it because the shard-to-node algorithm is fixed (by default) and 
>> there is no dynamic ability to "reassign" to an available Node? 
>>
>> Thanks again.  -- Eric
>>
>> On Wednesday, August 3, 2016 at 7:00:42 PM UTC-7, Justin du coeur wrote:
>>>
>>> The keyword here is "auto".  Autodowning is an *incredibly braindead* 
>>> algorithm for dealing with nodes coming out of 

Re: [akka-user] Akka Cluster (with Sharding) not working without auto-down-unreachable-after

2016-08-05 Thread Johan Andrén
You can however implement your own 
akka.cluster.sharding.ShardCoordinator.ShardAllocationStrategy which will 
allow you to take whatever you want into account, and deal with the 
consequences thereof ofc ;) 

--
Johan

On Friday, August 5, 2016 at 12:04:12 AM UTC+2, Justin du coeur wrote:
>
> It does do reassignment -- but it has to know to do that.  Keep in mind 
> that "down" is the master switch here: until the node is downed, the rest 
> of the system doesn't *know* that NodeA should be avoided.  I haven't dug 
> into that particular code, but I assume from what you're saying that the 
> allocation algorithm doesn't take unreachability into account when choosing 
> where to allocate the shard, just up/down.  I suspect that unreachability 
> is too local and transient to use as the basis for these allocations.
>
> Keep in mind that you're looking at this from a relatively all-knowing 
> global perspective, but each node is working from a very localized and 
> imperfect one.  All it knows is "I can't currently reach NodeA".  It has no 
> a priori way of knowing whether NodeA has been taken offline (so it should 
> be avoided), or there's simply been a transient network glitch between here 
> and there (so things are *mostly* business as usual).  Downing is how you 
> tell it, "No, really, stop using this node"; until then, most of the code 
> assumes that the more-common transient situation is the case.  It's 
> *probably* possible to take unreachability into account in the case you're 
> describing, but it doesn't surprise me if that's not true.
>
> Also, keep in mind that, IIRC, there are a few cluster singletons involved 
> here, at least behind the scenes.  If NodeA currently owns one of the key 
> singletons (such as the ShardCoordinator), and it hasn't been downed, I 
> imagine the rest of the cluster is going to *quickly* lock up, because the 
> result is that nobody is authorized to make these sorts of allocation 
> decisions.
>
> All that said -- keep in mind, I'm just a user of this stuff, and am 
> talking at the edges of my knowledge.  Konrad's the actual expert...
>
> On Thu, Aug 4, 2016 at 4:59 PM, Eric Swenson  wrote:
>
>> While I'm in the process of implementing your proposed solution, I did 
>> want to make sure I understood why I'm seeing the failures I'm seeing when 
>> a node is taken offline, auto-down is disabled, and no one is handling the 
>> UnreachableNode message.  Let me try to explain what I think is happening 
>> and perhaps you (or someone else who knows more about this than I) can 
>> confirm or refute.
>>
>> In the case of akka-cluster-sharding, a shard might exist on the 
>> unreachable node.  Since the node is not yet marked as "down", the cluster 
>> simply cannot handle an incoming message for that shard.  To create another 
>> sharded actor on an available cluster node might duplicate the unreachable 
>> node state.  In the case of akka-persistence actors, even though a new 
>> shard actor could resurrect any journaled state, we cannot be sure that the 
>> old unreachable node might not at any time, add other events to the 
>> journal, or come online and try to continue operating on the shard.
>>
>> Is that the reason why I see the following behavior:  NodeA is online.  
>> NodeB comes online and joins the cluster.  A request comes in from 
>> akka-http and is sent to the shard region.  It goes to NodeA which creates 
>> an actor to handle the sharded object.  NodeA is taken offline (unbeknownst 
>> to the akka-cluster).  Another message for the above-mentioned shard comes 
>> in from akka-http and is sent to the shard region. The shard region can't 
>> reach NodeA.  NodeA isn't marked as down.  So the shard region cannot 
>> create another actor (on an available Node). It can only wait (until 
>> timeout) for NodeA to become reachable.  Since, in my scenario, NodeA will 
>> never become reachable and NodeB is the only one online, all requests for 
>> old shards timeout.
>>
>> If the above logic is true, I have one last issue:  In the above 
>> scenario, if a message comes into the shard region for a shard that WOULD 
>> HAVE BEEN allocated to NodeA but has never yet been assigned to an actor on 
>> NodeA, and NodeA is unreachable, why can it simply be assigned to another 
>> Node?  is it because the shard-to-node algorithm is fixed (by default) and 
>> there is no dynamic ability to "reassign" to an available Node? 
>>
>> Thanks again.  -- Eric
>>
>> On Wednesday, August 3, 2016 at 7:00:42 PM UTC-7, Justin du coeur wrote:
>>>
>>> The keyword here is "auto".  Autodowning is an *incredibly braindead* 
>>> algorithm for dealing with nodes coming out of service, and if you use it 
>>> in production you more or less guarantee disaster, because that algorithm 
>>> can't cope with cluster partition.  You *do* need to deal with downing, but 
>>> you have to get something smarter than that.
>>>
>>> Frankly, if you're already hooking into AWS, I *suspect* 

Re: [akka-user] Akka Cluster (with Sharding) not working without auto-down-unreachable-after

2016-08-04 Thread Justin du coeur
It does do reassignment -- but it has to know to do that.  Keep in mind
that "down" is the master switch here: until the node is downed, the rest
of the system doesn't *know* that NodeA should be avoided.  I haven't dug
into that particular code, but I assume from what you're saying that the
allocation algorithm doesn't take unreachability into account when choosing
where to allocate the shard, just up/down.  I suspect that unreachability
is too local and transient to use as the basis for these allocations.

Keep in mind that you're looking at this from a relatively all-knowing
global perspective, but each node is working from a very localized and
imperfect one.  All it knows is "I can't currently reach NodeA".  It has no
a priori way of knowing whether NodeA has been taken offline (so it should
be avoided), or there's simply been a transient network glitch between here
and there (so things are *mostly* business as usual).  Downing is how you
tell it, "No, really, stop using this node"; until then, most of the code
assumes that the more-common transient situation is the case.  It's
*probably* possible to take unreachability into account in the case you're
describing, but it doesn't surprise me if that's not true.

Also, keep in mind that, IIRC, there are a few cluster singletons involved
here, at least behind the scenes.  If NodeA currently owns one of the key
singletons (such as the ShardCoordinator), and it hasn't been downed, I
imagine the rest of the cluster is going to *quickly* lock up, because the
result is that nobody is authorized to make these sorts of allocation
decisions.

All that said -- keep in mind, I'm just a user of this stuff, and am
talking at the edges of my knowledge.  Konrad's the actual expert...

On Thu, Aug 4, 2016 at 4:59 PM, Eric Swenson  wrote:

> While I'm in the process of implementing your proposed solution, I did
> want to make sure I understood why I'm seeing the failures I'm seeing when
> a node is taken offline, auto-down is disabled, and no one is handling the
> UnreachableNode message.  Let me try to explain what I think is happening
> and perhaps you (or someone else who knows more about this than I) can
> confirm or refute.
>
> In the case of akka-cluster-sharding, a shard might exist on the
> unreachable node.  Since the node is not yet marked as "down", the cluster
> simply cannot handle an incoming message for that shard.  To create another
> sharded actor on an available cluster node might duplicate the unreachable
> node state.  In the case of akka-persistence actors, even though a new
> shard actor could resurrect any journaled state, we cannot be sure that the
> old unreachable node might not at any time, add other events to the
> journal, or come online and try to continue operating on the shard.
>
> Is that the reason why I see the following behavior:  NodeA is online.
> NodeB comes online and joins the cluster.  A request comes in from
> akka-http and is sent to the shard region.  It goes to NodeA which creates
> an actor to handle the sharded object.  NodeA is taken offline (unbeknownst
> to the akka-cluster).  Another message for the above-mentioned shard comes
> in from akka-http and is sent to the shard region. The shard region can't
> reach NodeA.  NodeA isn't marked as down.  So the shard region cannot
> create another actor (on an available Node). It can only wait (until
> timeout) for NodeA to become reachable.  Since, in my scenario, NodeA will
> never become reachable and NodeB is the only one online, all requests for
> old shards timeout.
>
> If the above logic is true, I have one last issue:  In the above scenario,
> if a message comes into the shard region for a shard that WOULD HAVE BEEN
> allocated to NodeA but has never yet been assigned to an actor on NodeA,
> and NodeA is unreachable, why can it simply be assigned to another Node?
>  is it because the shard-to-node algorithm is fixed (by default) and there
> is no dynamic ability to "reassign" to an available Node?
>
> Thanks again.  -- Eric
>
> On Wednesday, August 3, 2016 at 7:00:42 PM UTC-7, Justin du coeur wrote:
>>
>> The keyword here is "auto".  Autodowning is an *incredibly braindead*
>> algorithm for dealing with nodes coming out of service, and if you use it
>> in production you more or less guarantee disaster, because that algorithm
>> can't cope with cluster partition.  You *do* need to deal with downing, but
>> you have to get something smarter than that.
>>
>> Frankly, if you're already hooking into AWS, I *suspect* the best
>> approach is to leverage that -- when a node goes offline, you have some
>> code to detect that through the ECS APIs, react to it, and manually down
>> that node.  (I'm planning on something along those lines for my system, but
>> haven't actually tried yet.)  But whether you do that or something else,
>> you've got to add *something* that does downing.
>>
>> I believe the official party line is "Buy a Lightbend Subscription",
>> 

Re: [akka-user] Akka Cluster (with Sharding) not working without auto-down-unreachable-after

2016-08-04 Thread Eric Swenson
While I'm in the process of implementing your proposed solution, I did want 
to make sure I understood why I'm seeing the failures I'm seeing when a 
node is taken offline, auto-down is disabled, and no one is handling the 
UnreachableNode message.  Let me try to explain what I think is happening 
and perhaps you (or someone else who knows more about this than I) can 
confirm or refute.

In the case of akka-cluster-sharding, a shard might exist on the 
unreachable node.  Since the node is not yet marked as "down", the cluster 
simply cannot handle an incoming message for that shard.  To create another 
sharded actor on an available cluster node might duplicate the unreachable 
node state.  In the case of akka-persistence actors, even though a new 
shard actor could resurrect any journaled state, we cannot be sure that the 
old unreachable node might not at any time, add other events to the 
journal, or come online and try to continue operating on the shard.

Is that the reason why I see the following behavior:  NodeA is online. 
 NodeB comes online and joins the cluster.  A request comes in from 
akka-http and is sent to the shard region.  It goes to NodeA which creates 
an actor to handle the sharded object.  NodeA is taken offline (unbeknownst 
to the akka-cluster).  Another message for the above-mentioned shard comes 
in from akka-http and is sent to the shard region. The shard region can't 
reach NodeA.  NodeA isn't marked as down.  So the shard region cannot 
create another actor (on an available Node). It can only wait (until 
timeout) for NodeA to become reachable.  Since, in my scenario, NodeA will 
never become reachable and NodeB is the only one online, all requests for 
old shards timeout.

If the above logic is true, I have one last issue:  In the above scenario, 
if a message comes into the shard region for a shard that WOULD HAVE BEEN 
allocated to NodeA but has never yet been assigned to an actor on NodeA, 
and NodeA is unreachable, why can it simply be assigned to another Node? 
 is it because the shard-to-node algorithm is fixed (by default) and there 
is no dynamic ability to "reassign" to an available Node? 

Thanks again.  -- Eric

On Wednesday, August 3, 2016 at 7:00:42 PM UTC-7, Justin du coeur wrote:
>
> The keyword here is "auto".  Autodowning is an *incredibly braindead* 
> algorithm for dealing with nodes coming out of service, and if you use it 
> in production you more or less guarantee disaster, because that algorithm 
> can't cope with cluster partition.  You *do* need to deal with downing, but 
> you have to get something smarter than that.
>
> Frankly, if you're already hooking into AWS, I *suspect* the best approach 
> is to leverage that -- when a node goes offline, you have some code to 
> detect that through the ECS APIs, react to it, and manually down that node. 
>  (I'm planning on something along those lines for my system, but haven't 
> actually tried yet.)  But whether you do that or something else, you've got 
> to add *something* that does downing.
>
> I believe the official party line is "Buy a Lightbend Subscription", 
> through which you can get their Split Brain Resolver, which is a fairly 
> battle-hardened module for dealing with this problem.  That's not strictly 
> necessary, but you *do* need to have a reliable solution...
>
> On Wed, Aug 3, 2016 at 8:42 PM, Eric Swenson  > wrote:
>
>> We have an akka-cluster/sharding application deployed an AWS/ECS, where 
>> each instance of the application is a Docker container.  An ECS service 
>> launches N instances of the application based on configuration data.  It is 
>> not possible to know, for certain, the IP addresses of the cluster 
>> members.  Upon startup, before the AkkaSystem is created, the code 
>> currently polls AWS and determines the IP addresses of all the Docker hosts 
>> (which potentially could run the akka application).  It sets these IP 
>> addresses as the seed nodes before bringing up the akka cluster system. The 
>> configuration for these has, up until yesterday always included the 
>> akka.cluster.auto-down-unreachable-after configuration setting.  And it has 
>> always worked.  Furthermore, it supports two very critical requirements:
>>
>> a) an instance of the application can be removed at any time, due to 
>> scaling or rolling updates
>> b) an instance of the application can be added at any time, due to 
>> scaling or rolling updates
>>
>> On the advice of an Akka expert on the Gitter channel, I removed the 
>> auto-down-unreachable-after setting, which, as documented, is dangerous for 
>> production.  As a result the system no longer supports rolling updates.  A 
>> rolling update occurs thus:  a new version of the application is deployed 
>> (a new ECS task definition is created with a new Docker image).  The ECS 
>> service launches a new task (Docker container running on an available host) 
>> and once that container becomes stable, it kills one of the 

Re: [akka-user] Akka Cluster (with Sharding) not working without auto-down-unreachable-after

2016-08-04 Thread Eric Swenson
Thanks, Konrad. I will replace the use of auto-down with a scheme such as 
that proposed by Justin. I have also reached out to Lightbend (and received 
a call back already) regarding subscription services from Lightbend.  -- 
Eric

On Thursday, August 4, 2016 at 1:26:35 AM UTC-7, Konrad Malawski wrote:
>
> Just to re-affirm what Justin wrote there.
>
> Auto downing is "auto". It's dumb. That's why it's not safe.
> The safer automatic downing modes ones are in 
> doc.akka.io/docs/akka/rp-16s01p05/scala/split-brain-resolver.html
> Yes, that's a commercial thing.
>
> If you don't want to use these, use EC2's APIs - they have APIs from which 
> you can get information about state like that.
>
> -- 
> Konrad `ktoso` Malawski
> Akka  @ Lightbend 
>
> On 4 August 2016 at 04:00:34, Justin du coeur (jduc...@gmail.com 
> ) wrote:
>
> The keyword here is "auto".  Autodowning is an *incredibly braindead* 
> algorithm for dealing with nodes coming out of service, and if you use it 
> in production you more or less guarantee disaster, because that algorithm 
> can't cope with cluster partition.  You *do* need to deal with downing, but 
> you have to get something smarter than that. 
>
> Frankly, if you're already hooking into AWS, I *suspect* the best approach 
> is to leverage that -- when a node goes offline, you have some code to 
> detect that through the ECS APIs, react to it, and manually down that node. 
>  (I'm planning on something along those lines for my system, but haven't 
> actually tried yet.)  But whether you do that or something else, you've got 
> to add *something* that does downing.
>
> I believe the official party line is "Buy a Lightbend Subscription", 
> through which you can get their Split Brain Resolver, which is a fairly 
> battle-hardened module for dealing with this problem.  That's not strictly 
> necessary, but you *do* need to have a reliable solution...
>
> On Wed, Aug 3, 2016 at 8:42 PM, Eric Swenson  > wrote:
>
>> We have an akka-cluster/sharding application deployed an AWS/ECS, where 
>> each instance of the application is a Docker container.  An ECS service 
>> launches N instances of the application based on configuration data.  It is 
>> not possible to know, for certain, the IP addresses of the cluster 
>> members.  Upon startup, before the AkkaSystem is created, the code 
>> currently polls AWS and determines the IP addresses of all the Docker hosts 
>> (which potentially could run the akka application).  It sets these IP 
>> addresses as the seed nodes before bringing up the akka cluster system. The 
>> configuration for these has, up until yesterday always included the 
>> akka.cluster.auto-down-unreachable-after configuration setting.  And it has 
>> always worked.  Furthermore, it supports two very critical requirements: 
>>
>> a) an instance of the application can be removed at any time, due to 
>> scaling or rolling updates
>> b) an instance of the application can be added at any time, due to 
>> scaling or rolling updates
>>
>> On the advice of an Akka expert on the Gitter channel, I removed the 
>> auto-down-unreachable-after setting, which, as documented, is dangerous for 
>> production.  As a result the system no longer supports rolling updates.  A 
>> rolling update occurs thus:  a new version of the application is deployed 
>> (a new ECS task definition is created with a new Docker image).  The ECS 
>> service launches a new task (Docker container running on an available host) 
>> and once that container becomes stable, it kills one of the remaining 
>> instances (cluster members) to bring the number of instances to some 
>> configured value.  
>>
>> When this happens, akka-cluster becomes very unhappy and becomes 
>> unresponsive.  Without the auto-down-unreachable-after setting, it keeps 
>> trying to talk to the old cluster members. which is no longer present.  It 
>> appears to NOT recover from this.  There is a constant barrage of messages 
>> of the form:
>>
>> [DEBUG] [08/04/2016 00:19:27.126] 
>> [ClusterSystem-cassandra-plugin-default-dispatcher-27] 
>> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path 
>> sequence [/system/sharding/ExperimentInstance#-389574371] failed
>> [DEBUG] [08/04/2016 00:19:27.140] 
>> [ClusterSystem-cassandra-plugin-default-dispatcher-27] 
>> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path 
>> sequence [/system/sharding/ExperimentInstance#-389574371] failed
>> [DEBUG] [08/04/2016 00:19:27.142] 
>> [ClusterSystem-cassandra-plugin-default-dispatcher-27] 
>> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path 
>> sequence [/system/sharding/ExperimentInstance#-389574371] failed
>> [DEBUG] [08/04/2016 00:19:27.143] 
>> [ClusterSystem-cassandra-plugin-default-dispatcher-27] 
>> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path 
>> sequence [/system/sharding/ExperimentInstance#-389574371] 

Re: [akka-user] Akka Cluster (with Sharding) not working without auto-down-unreachable-after

2016-08-04 Thread Eric Swenson
Thanks very much, Justin.  I appreciate your suggested approach and will 
implement something along those lines.  In summary, I believe, I should do 
the following:

1) handle notifications pf nodes going offline in my application
2) query AWS/ECS to see if the node is *really* supposed to be offline 
(meaning that it has been removed for autoscaling or replacement reasons),
3) if yes, then manually down the node

This makes perfect sense. The philosophy of going to the single source of 
truth about the state of the cluster (in this case, AWS/ECS) seems to be 
apt here.

Thanks again.  -- Eric

On Wednesday, August 3, 2016 at 7:00:42 PM UTC-7, Justin du coeur wrote:
>
> The keyword here is "auto".  Autodowning is an *incredibly braindead* 
> algorithm for dealing with nodes coming out of service, and if you use it 
> in production you more or less guarantee disaster, because that algorithm 
> can't cope with cluster partition.  You *do* need to deal with downing, but 
> you have to get something smarter than that.
>
> Frankly, if you're already hooking into AWS, I *suspect* the best approach 
> is to leverage that -- when a node goes offline, you have some code to 
> detect that through the ECS APIs, react to it, and manually down that node. 
>  (I'm planning on something along those lines for my system, but haven't 
> actually tried yet.)  But whether you do that or something else, you've got 
> to add *something* that does downing.
>
> I believe the official party line is "Buy a Lightbend Subscription", 
> through which you can get their Split Brain Resolver, which is a fairly 
> battle-hardened module for dealing with this problem.  That's not strictly 
> necessary, but you *do* need to have a reliable solution...
>
> On Wed, Aug 3, 2016 at 8:42 PM, Eric Swenson  > wrote:
>
>> We have an akka-cluster/sharding application deployed an AWS/ECS, where 
>> each instance of the application is a Docker container.  An ECS service 
>> launches N instances of the application based on configuration data.  It is 
>> not possible to know, for certain, the IP addresses of the cluster 
>> members.  Upon startup, before the AkkaSystem is created, the code 
>> currently polls AWS and determines the IP addresses of all the Docker hosts 
>> (which potentially could run the akka application).  It sets these IP 
>> addresses as the seed nodes before bringing up the akka cluster system. The 
>> configuration for these has, up until yesterday always included the 
>> akka.cluster.auto-down-unreachable-after configuration setting.  And it has 
>> always worked.  Furthermore, it supports two very critical requirements:
>>
>> a) an instance of the application can be removed at any time, due to 
>> scaling or rolling updates
>> b) an instance of the application can be added at any time, due to 
>> scaling or rolling updates
>>
>> On the advice of an Akka expert on the Gitter channel, I removed the 
>> auto-down-unreachable-after setting, which, as documented, is dangerous for 
>> production.  As a result the system no longer supports rolling updates.  A 
>> rolling update occurs thus:  a new version of the application is deployed 
>> (a new ECS task definition is created with a new Docker image).  The ECS 
>> service launches a new task (Docker container running on an available host) 
>> and once that container becomes stable, it kills one of the remaining 
>> instances (cluster members) to bring the number of instances to some 
>> configured value.  
>>
>> When this happens, akka-cluster becomes very unhappy and becomes 
>> unresponsive.  Without the auto-down-unreachable-after setting, it keeps 
>> trying to talk to the old cluster members. which is no longer present.  It 
>> appears to NOT recover from this.  There is a constant barrage of messages 
>> of the form:
>>
>> [DEBUG] [08/04/2016 00:19:27.126] 
>> [ClusterSystem-cassandra-plugin-default-dispatcher-27] 
>> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path 
>> sequence [/system/sharding/ExperimentInstance#-389574371] failed
>> [DEBUG] [08/04/2016 00:19:27.140] 
>> [ClusterSystem-cassandra-plugin-default-dispatcher-27] 
>> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path 
>> sequence [/system/sharding/ExperimentInstance#-389574371] failed
>> [DEBUG] [08/04/2016 00:19:27.142] 
>> [ClusterSystem-cassandra-plugin-default-dispatcher-27] 
>> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path 
>> sequence [/system/sharding/ExperimentInstance#-389574371] failed
>> [DEBUG] [08/04/2016 00:19:27.143] 
>> [ClusterSystem-cassandra-plugin-default-dispatcher-27] 
>> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path 
>> sequence [/system/sharding/ExperimentInstance#-389574371] failed
>> [DEBUG] [08/04/2016 00:19:27.143] 
>> [ClusterSystem-cassandra-plugin-default-dispatcher-27] 
>> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path 
>> sequence 

Re: [akka-user] Akka Cluster (with Sharding) not working without auto-down-unreachable-after

2016-08-04 Thread Konrad Malawski
Just to re-affirm what Justin wrote there.

Auto downing is "auto". It's dumb. That's why it's not safe.
The safer automatic downing modes ones are in
doc.akka.io/docs/akka/rp-16s01p05/scala/split-brain-resolver.html
Yes, that's a commercial thing.

If you don't want to use these, use EC2's APIs - they have APIs from which
you can get information about state like that.

-- 
Konrad `ktoso` Malawski
Akka  @ Lightbend 

On 4 August 2016 at 04:00:34, Justin du coeur (jduco...@gmail.com) wrote:

The keyword here is "auto".  Autodowning is an *incredibly braindead*
algorithm for dealing with nodes coming out of service, and if you use it
in production you more or less guarantee disaster, because that algorithm
can't cope with cluster partition.  You *do* need to deal with downing, but
you have to get something smarter than that.

Frankly, if you're already hooking into AWS, I *suspect* the best approach
is to leverage that -- when a node goes offline, you have some code to
detect that through the ECS APIs, react to it, and manually down that node.
 (I'm planning on something along those lines for my system, but haven't
actually tried yet.)  But whether you do that or something else, you've got
to add *something* that does downing.

I believe the official party line is "Buy a Lightbend Subscription",
through which you can get their Split Brain Resolver, which is a fairly
battle-hardened module for dealing with this problem.  That's not strictly
necessary, but you *do* need to have a reliable solution...

On Wed, Aug 3, 2016 at 8:42 PM, Eric Swenson  wrote:

> We have an akka-cluster/sharding application deployed an AWS/ECS, where
> each instance of the application is a Docker container.  An ECS service
> launches N instances of the application based on configuration data.  It is
> not possible to know, for certain, the IP addresses of the cluster
> members.  Upon startup, before the AkkaSystem is created, the code
> currently polls AWS and determines the IP addresses of all the Docker hosts
> (which potentially could run the akka application).  It sets these IP
> addresses as the seed nodes before bringing up the akka cluster system. The
> configuration for these has, up until yesterday always included the
> akka.cluster.auto-down-unreachable-after configuration setting.  And it has
> always worked.  Furthermore, it supports two very critical requirements:
>
> a) an instance of the application can be removed at any time, due to
> scaling or rolling updates
> b) an instance of the application can be added at any time, due to scaling
> or rolling updates
>
> On the advice of an Akka expert on the Gitter channel, I removed the
> auto-down-unreachable-after setting, which, as documented, is dangerous for
> production.  As a result the system no longer supports rolling updates.  A
> rolling update occurs thus:  a new version of the application is deployed
> (a new ECS task definition is created with a new Docker image).  The ECS
> service launches a new task (Docker container running on an available host)
> and once that container becomes stable, it kills one of the remaining
> instances (cluster members) to bring the number of instances to some
> configured value.
>
> When this happens, akka-cluster becomes very unhappy and becomes
> unresponsive.  Without the auto-down-unreachable-after setting, it keeps
> trying to talk to the old cluster members. which is no longer present.  It
> appears to NOT recover from this.  There is a constant barrage of messages
> of the form:
>
> [DEBUG] [08/04/2016 00:19:27.126]
> [ClusterSystem-cassandra-plugin-default-dispatcher-27]
> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path
> sequence [/system/sharding/ExperimentInstance#-389574371] failed
> [DEBUG] [08/04/2016 00:19:27.140]
> [ClusterSystem-cassandra-plugin-default-dispatcher-27]
> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path
> sequence [/system/sharding/ExperimentInstance#-389574371] failed
> [DEBUG] [08/04/2016 00:19:27.142]
> [ClusterSystem-cassandra-plugin-default-dispatcher-27]
> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path
> sequence [/system/sharding/ExperimentInstance#-389574371] failed
> [DEBUG] [08/04/2016 00:19:27.143]
> [ClusterSystem-cassandra-plugin-default-dispatcher-27]
> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path
> sequence [/system/sharding/ExperimentInstance#-389574371] failed
> [DEBUG] [08/04/2016 00:19:27.143]
> [ClusterSystem-cassandra-plugin-default-dispatcher-27]
> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path
> sequence [/system/sharding/ExperimentInstance#-389574371] failed
>
> and of the form:
>
> [WARN] [08/04/2016 00:19:16.787]
> [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry
> request for shard [5] homes from 

Re: [akka-user] Akka Cluster (with Sharding) not working without auto-down-unreachable-after

2016-08-03 Thread Justin du coeur
The keyword here is "auto".  Autodowning is an *incredibly braindead*
algorithm for dealing with nodes coming out of service, and if you use it
in production you more or less guarantee disaster, because that algorithm
can't cope with cluster partition.  You *do* need to deal with downing, but
you have to get something smarter than that.

Frankly, if you're already hooking into AWS, I *suspect* the best approach
is to leverage that -- when a node goes offline, you have some code to
detect that through the ECS APIs, react to it, and manually down that node.
 (I'm planning on something along those lines for my system, but haven't
actually tried yet.)  But whether you do that or something else, you've got
to add *something* that does downing.

I believe the official party line is "Buy a Lightbend Subscription",
through which you can get their Split Brain Resolver, which is a fairly
battle-hardened module for dealing with this problem.  That's not strictly
necessary, but you *do* need to have a reliable solution...

On Wed, Aug 3, 2016 at 8:42 PM, Eric Swenson  wrote:

> We have an akka-cluster/sharding application deployed an AWS/ECS, where
> each instance of the application is a Docker container.  An ECS service
> launches N instances of the application based on configuration data.  It is
> not possible to know, for certain, the IP addresses of the cluster
> members.  Upon startup, before the AkkaSystem is created, the code
> currently polls AWS and determines the IP addresses of all the Docker hosts
> (which potentially could run the akka application).  It sets these IP
> addresses as the seed nodes before bringing up the akka cluster system. The
> configuration for these has, up until yesterday always included the
> akka.cluster.auto-down-unreachable-after configuration setting.  And it has
> always worked.  Furthermore, it supports two very critical requirements:
>
> a) an instance of the application can be removed at any time, due to
> scaling or rolling updates
> b) an instance of the application can be added at any time, due to scaling
> or rolling updates
>
> On the advice of an Akka expert on the Gitter channel, I removed the
> auto-down-unreachable-after setting, which, as documented, is dangerous for
> production.  As a result the system no longer supports rolling updates.  A
> rolling update occurs thus:  a new version of the application is deployed
> (a new ECS task definition is created with a new Docker image).  The ECS
> service launches a new task (Docker container running on an available host)
> and once that container becomes stable, it kills one of the remaining
> instances (cluster members) to bring the number of instances to some
> configured value.
>
> When this happens, akka-cluster becomes very unhappy and becomes
> unresponsive.  Without the auto-down-unreachable-after setting, it keeps
> trying to talk to the old cluster members. which is no longer present.  It
> appears to NOT recover from this.  There is a constant barrage of messages
> of the form:
>
> [DEBUG] [08/04/2016 00:19:27.126]
> [ClusterSystem-cassandra-plugin-default-dispatcher-27]
> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path
> sequence [/system/sharding/ExperimentInstance#-389574371] failed
> [DEBUG] [08/04/2016 00:19:27.140]
> [ClusterSystem-cassandra-plugin-default-dispatcher-27]
> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path
> sequence [/system/sharding/ExperimentInstance#-389574371] failed
> [DEBUG] [08/04/2016 00:19:27.142]
> [ClusterSystem-cassandra-plugin-default-dispatcher-27]
> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path
> sequence [/system/sharding/ExperimentInstance#-389574371] failed
> [DEBUG] [08/04/2016 00:19:27.143]
> [ClusterSystem-cassandra-plugin-default-dispatcher-27]
> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path
> sequence [/system/sharding/ExperimentInstance#-389574371] failed
> [DEBUG] [08/04/2016 00:19:27.143]
> [ClusterSystem-cassandra-plugin-default-dispatcher-27]
> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path
> sequence [/system/sharding/ExperimentInstance#-389574371] failed
>
> and of the form:
>
> [WARN] [08/04/2016 00:19:16.787]
> [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry
> request for shard [5] homes from coordinator at [Actor[akka.tcp://
> ClusterSystem@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]].
> [1] buffered messages.
> [WARN] [08/04/2016 00:19:18.787]
> [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry
> request for shard [23] homes from coordinator at [Actor[akka.tcp://
> ClusterSystem@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]].
> [1] buffered