Greetings, 

We are having some problems with our cluster configuration that manifest 
themselves in the following log lines (redacted for confidentiality 
reasons. 

Sep 09 00:58:10 host1.mycompany.com application-9001.log:  2016-09-09 05:58:
10 +0000 - [WARN] - [OrdersActor] akka://myCompany/user/OrdersActor/291 - 
 (291) #recordTxns, sending 54 txns to UserActor took 0.0044229 seconds
Sep 09 00:58:19 host1.mycompany.com application-9001.log:  2016-09-09 05:58:
19 +0000 - [WARN] - [ShardRegion] 
akka.tcp://[email protected]:2551/system/sharding/UserActor 
-  Trying to register to coordinator at [None], but no acknowledgement. 
Total [54] buffered messages.

I have traced this to the configuration of the cluster. We are running this 
on Amazon AWS and the code includes use of Hazelcast for finding the IPs of 
the other nodes (mostly because we have solved discovery for hazelcast in 
our dynamic IP cluster). We retrieve the IPs of the other nodes in the 
cluster from hazelcast and appropriately use them to create the Address 
object to use in the seed node. Once we have the seed nodes we have tried 
two mechanisms. First is to take the list of seed nodes and use them to 
join the cluster with cluster.joinSeedNodes(). Of Course not all machines 
come up and are discovered by hazelcast at exactly the same instant so the 
first 3 nodes might come up first and use each other to join whereas by the 
time the 9th node comes up there are 9 seed nodes. When we start sending 
messages to cluster shared actors, we get the errors above. Also when a 
node goes down the system screams constantly that a seed node is gone. So I 
changed the code to pick a node at random and do a cluster.join() with that 
node instead. However, we have the same problem as above. However, when we 
first bring up one node then bring them up one at a time, the problem goes 
away. Another symptom is that if we have the problem above and we terminate 
host1 then other nodes start propagating this behavior. Probably all the 
other nodes that were connected to host1. Apparently they can't heal to 
connect to another node. So this lends evidence to the multiple split 
brains theory. 

My theory is that by using all these seed nodes I am creating multiple 
split brains. So if you have 5 nodes, A, B, C, D, E and A connects to B, B 
to A, C to E, E to D, D to E then we have two clusters running that know 
nothing about each other. For some reason then the coordinators get 
confused about what is going on. 

Essentially the problem domain is this: 1. We don't know what ANY of the 
IPs are ahead of time. 2) We want the cluster to be whole. 3) If a single 
node leaves the cluster we would like the remaining nodes to recover. 

I would appreciate any insight anyone could provide on this and especially 
what may be the problem (I could be wrong), and how we can accomplish our 
goals. Note that I am not committed to using hazelcast to find other nodes. 

 Thanks in advance.


-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to