On Tue, 2015-02-24 at 23:06 -0800, Johannes Berg wrote:
> Hi!
> 
> We recently made a mistake by having two separate clusters use the
> same journal DB which obviously doesn't work but it lead to some
> questions about how the cluster sharding works.
> 
> We detected our problem by seeing a stack trace containing lines of
> code that didn't exist anywhere in the code deployed in that cluster
> and after tripple checking that we were sure we did run the correct
> code in all places in the cluster we finally found out that it in fact
> must be talking with an actor outside of the cluster. Looking at what
> the cluster sharding coordinator persists in the journal we found an
> IP address of a node in the other cluster.

What a coincidence! I observed the exact same issue a couple of days
after you.
See here:
https://groups.google.com/d/msg/akka-user/OJKGEgKBn2g/BQLRWkoK8zcJ

So I tested a bit more and found that the system would heal by itself (a
bit later), because if there's no ActorSystem running on those other
nodes, the ShardCoordinator would finally receive a Terminated message
(at least under akka 2.3.9), and allocate the region to another place.
Granted if the node moved to another cluster using the same ActorSystem
name, it might be more problematic.

Still this look a bit strange to me, or something that had been
oversighted during design.

> Granted the data in the journal table is pretty invalid but this still
> raises the question why the shard coordinator could start using nodes
> that aren't part of the cluster just based on some historical
> persisted info in the journal table. For example what happens if you
> run sharding over two nodes and decide to move the second node to
> another cluster. You take down the system and change the config files
> and then start the two systems again, could the sharding coordinator
> on the first node under any circumstances still use the second node
> based on the info in the journal table even though it's now not part
> of the same cluster anymore?

Yes, this isn't practical at all in an elastic environment. I would have
preferred a mechanism when recovering the ShardCoordinator that would
allocate the region locally then rebalance the shards based on the
discovered regions at the end of the recovery.

I think I'll enter a bug report for this problem.

-- 
Brice Figureau <[email protected]>

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to