Hi, 

I am using a relatively large codebase ( distributed Raft protocol Stack 
and Datastore built atop basic Akka clustering) which uses Ask pattern for 
some of the key messages exchanged across the nodes . We frequently hit 
AskTimeout Exceptions (typically we have 30000 ms timeout) particularly 
when network partitions and heals. 

Basics of my environment 
====================
Akka version         : 2.4.7 
Environment : OSGi
Clustering used ?    : Yes 
Sharding used    ?   : No
Cluster Size           : 3 nodes in 3 VMs
Any special features like Distributed Pub Sub used ? : No
Ask timeouts : upto 30000ms
Remoting : Netty tcp
Auto-down after : Disabled


Observations :
===========
I did following checks to start basic troubleshooting all the way from Akka 
cluster layer for above issue

a) Checking the Cluster MBean on all cluster nodes twice or thrice to see 
if the cluster heals properly after network split and heal (we do NOT use 
auto-down-after configuration) --> Akka Layer heals properly 

b) Checking if there is a heavy message-exchange between cluster nodes by 
monitoring 'netstat -anp | grep 2550' soon after network heal (monitoring 
SendQ and RecvQ for port 2550 and we use Netty transport - plan to move to 
Artery is on cards but not now) --> Connections are stable and there are 
message exchanges but not to the order of ones which can cause issues like 
head-of-line blocking or similar to have other side-effects 

c) Checking for Unreachable messages after the network is restored ---> No 
such conditions seen AFTER network is restored 

Since most of basic checks do not indicate any issues at the Akka layer, I 
am now trying to troubleshoot the application / stack built atop akka. Some 
key areas which I can suspect are 

a) Message processing latency by stack/applications on top of akka
b) Above condition resulting in mailbox build-up

So, I think, I can effectively start with the mailbox monitoring on the 
cluster nodes which receive the messages via remoting. 

Now, specific clarifications : 
=====================
a) Looked around for few options like Kamon and Cinnamon. But getting them 
work in my OSGi environment is a nightmare because of classloading 
complexity. So, wanted to check if good old toolbox like jmap historgram 
would be of use if I know the message class names to monitor mailbox 
build-up

b) Apart from mailbox buildup on remote nodes, any other monitoring is 
recommended ?

c) Apart from receive-side mailbox, should I monitor anything specifically 
on sender-side. I assume that this is not required because netstat on 2550 
does not seem to indicate any issues around remoting per se 

Please let me know if I am missing any info which makes the observations 
and clarifications vague so that I can get back on the same


Thanks in advance 

Regards
Muthu










-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to