Re: [akka-user] Re: Cluster aware consistent hashing pool - works only on leader

Patrik Nordwall Thu, 19 Nov 2015 12:33:07 -0800

On Thu, Nov 19, 2015 at 8:06 PM, Jim Hazen <[email protected]> wrote:


> Wait, what?  So cluster sharding depends on shared mutable state across
> your cluster to work?  AFAIknew the independent local nodes managed their
> state internally, communicated/coordinated via network protocolling and
> delegated to a master when it needed to determine a shard owner the first
> time.  All of this allowing for local state, mutated locally with
> discovered cluster information.  Is this not the case?  If so, why, this
> seems contra to the Akka/Actor model and many other clustering strategies.
>

That is correct, but the decisions taken by the coordinator must be
consistent, also when the coordinator crashes and fails over to another
node. Therefore, the state of the new coordinator instance must be
recovered to the exact same state as the previous coordinator instance. By
default the coordinator is using Akka Persistence to store and recover this
state. Distributed Data can be used as an alternative for storing this
state.


>
> My product is using a distributed journal, since we also use persistence
> along with cluster sharding.  However we're plagued with clustering issues
> when rolling new code and ripple restarting nodes in the cluster.  I was
> hoping this would go away in 2.4 when I could go back to a local journal
> for sharding state.
>

That will not work, becuase when the coordinator fails over to another node
it will not have access to the leveldb journal used by the previous
coordinator, and it will then recover to wrong state and you will end up
with entity actors with same id running on several nodes. Not an option.


>  The assumption being that an inter-cluster network partition was
> corrupting the shared state on the distributed journal.  Currently the only
> way to recover my cluster in these situations is to shut down all nodes,
> remove all shard entries from dynamo and restart the cluster nodes, 1 by
> 1.  This is on akka 2.3.10.
>

We have seen two reasons for this issue.
1) Bugs in the journals, e.g. replaying events in wrong order.
2) Split brain scenarios (including network partitions, long GC, and system
overload) causing split of the cluster into two separate clusters when
using auto-downing. That will result in two coordinators (one in each
cluster) writing to the same database and thereby making the event sequence
corrupt. We recommend manual downing or Split Brain Resolver
<http://doc.akka.io/docs/akka/rp-15v09p02/scala/split-brain-resolver.html>
instead of auto-downing.

By the way, use latest version, i.e. 2.3.14 or 2.4.0.

Regards,
Patrik


>
> java.lang.IllegalArgumentException: requirement failed: Region
> Actor[akka://User/user/sharding/TokenOAuthState#1273636297] not registered:
> State(Map(-63 ->
> Actor[akka://User/user/sharding/TokenOAuthState#-1778736418], 23 ->
> Actor[akka://User/user/sharding/TokenOAuthState#-1778736418], 40 ->
> Actor[akka.tcp://
> [email protected]:8115/user/sharding/TokenOAuthState#1939420601], 33 ->
> Actor[akka.tcp://
> [email protected]:8115/user/sharding/TokenOAuthState#-1679280759], 50 ->
> Actor[akka://User/user/sharding/TokenOAuthState#-1778736418], -58 ->
> Actor[akka.tcp://
> [email protected]:8115/user/sharding/TokenOAuthState#-1679280759], 35 ->
> Actor[akka.tcp://
> [email protected]:8115/user/sharding/TokenOAuthState#-1679280759], -66 ->
> Actor[akka.tcp://
> [email protected]:8115/user/sharding/TokenOAuthState#-1679280759], -23 ->
> Actor[akka.tcp://
> [email protected]:8115/user/sharding/TokenOAuthState#1939420601], -11 ->
> Actor[akka.tcp://
> [email protected]:8115/user/sharding/TokenOAuthState#1939420601]
> ),Map(Actor[akka.tcp://
> [email protected]:8115/user/sharding/TokenOAuthState#-1679280759] ->
> Vector(35, -58, -66, 33),
> Actor[akka://User/user/sharding/TokenOAuthState#-1778736418] -> Vector(-63,
> 23, 50), Actor[akka.tcp://
> [email protected]:8115/user/sharding/TokenOAuthState#1939420601] ->
> Vector(-23, 40, -11)),Set())
>
> On Wednesday, November 18, 2015 at 12:23:28 PM UTC-8, Patrik Nordwall
> wrote:
>>
>> Leveldb can't be used for cluster sharding, since that is a local
>> journal. The documentation of persistence has links to distributed
>> journals.
>
> --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ:
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>



-- 

Patrik Nordwall
Typesafe <http://typesafe.com/> -  Reactive apps on the JVM
Twitter: @patriknw

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] Re: Cluster aware consistent hashing pool - works only on leader

Reply via email to