Hi Austin,

The etcd processes (on all your nodes) are continually restarting. This is 
because monit is marking the etcd process as unresponsive, and so killing the 
etcd process.

The monit check for responsive used to write to etcd. We’ve since realised that 
this is a bad check to do, as it relies on the etcd leader being up. This can 
lead to the situation where monit continually kills all etcd processes as:


·         Something initially goes wrong and the current etcd leader fails

·         The surviving etcd processes attempt to elect a new leader

·         While the election is in process, they can’t write to etcd

·         Monit kills the rest of the etcd processes

·         This now can’t recover as the etcd processes won’t stay up long 
enough to elect a leader

We’ve fixed this in the latest Project Clearwater release. Can you try 
upgrading your system and running stress again?

Thanks,

Ellie

From: Clearwater [mailto:[email protected]] On 
Behalf Of Austin Marston
Sent: 29 October 2015 10:15
To: [email protected]
Subject: [Clearwater] Fwd: SIP stress not working

Hello,

Thanks, you'll find attached the logs of sprout etc and monit from a brand new 
run of the sip testing.
While I tested that, the sprout status of  etcd_process was "Does not exist" 
and clearwater_cluster_manager was "Execution failed".

Thanks a lot,
Austin

2015-10-23 21:42 GMT+02:00 Eleanor Merry 
<[email protected]<mailto:[email protected]>>:
Hi Austin,

It sounds like your etcd process could be regularly restarting. Can you send me 
the etcd logs (in /var/log/clearwater-etcd) and monit logs (/var/log/monit.log) 
from your Sprout node?
Thanks,

Ellie

From: Clearwater 
[mailto:[email protected]<mailto:[email protected]>]
 On Behalf Of Austin Marston
Sent: 23 October 2015 07:22
To: 
[email protected]<mailto:[email protected]>
Subject: [Clearwater] SIP stress not working

Hi all,

I deployed manually clearwater infra with one bono,sprout,ellis,ralph,homer, 
and hs.
My sip testing seem to be working fine but my stress testing is not working at 
all.

I created a new node for sip testing, following 
https://github.com/Metaswitch/crest/blob/dev/docs/Bulk-Provisioning%20Numbers.md

Note:
Whenever I want to check what might be wrong I always get different status for 
my nodes.
The clearwater cluster_manager seem to fail most of the time on bono and sprout 
and when I check the cluster health results are always different.
Like for instance, when I ran clearwater-etcdctl cluster-health
I see that the cluster is healthy but sometimes my bono node (172-16-1-20) or 
sprout node (172-16-1-20) are reported as not healthy.

[17:07:14][sprout]user@cw-002:/var/log/sprout$ clearwater-etcdctl cluster-health
cluster is healthy
member 27a940d2104e9692 is unhealthy
member 2ea8f3a5eea05584 is healthy
member 5fdc25bd4ae527c0 is healthy
member d26088cb54745bbc is healthy
member e525c6a4ed161686 is healthy
member f5765a98a56e9c4a is healthy
[17:07:24]user@cw-002:/var/log/sprout$ clearwater-etcdctl member list
27a940d2104e9692: name=172-16-1-20 
peerURLs=http://172.16.1.20:2380<http://172.16.1.20:2380/>clientURLs=http://172.16.1.20:4000<http://172.16.1.20:4000/>
2ea8f3a5eea05584: name=172-16-1-22 
peerURLs=http://172.16.1.22:2380<http://172.16.1.22:2380/>clientURLs=http://172.16.1.22:4000<http://172.16.1.22:4000/>
5fdc25bd4ae527c0: name=172-16-1-25 
peerURLs=http://172.16.1.25:2380<http://172.16.1.25:2380/>clientURLs=http://172.16.1.25:4000<http://172.16.1.25:4000/>
d26088cb54745bbc: name=172-16-1-24 
peerURLs=http://172.16.1.24:2380<http://172.16.1.24:2380/>clientURLs=http://172.16.1.24:4000<http://172.16.1.24:4000/>
e525c6a4ed161686: name=172-16-1-21 
peerURLs=http://172.16.1.21:2380<http://172.16.1.21:2380/>clientURLs=http://172.16.1.21:4000<http://172.16.1.21:4000/>
f5765a98a56e9c4a: name=172-16-1-23 
peerURLs=http://172.16.1.23:2380<http://172.16.1.23:2380/>clientURLs=http://172.16.1.23:4000<http://172.16.1.23:4000/>
[17:10:00][sprout]user@cw-002:/var/log/sprout$ clearwater-etcdctl cluster-health
cluster is healthy
member 27a940d2104e9692 is healthy
member 2ea8f3a5eea05584 is healthy
member 5fdc25bd4ae527c0 is healthy
member d26088cb54745bbc is healthy
member e525c6a4ed161686 is healthy
member f5765a98a56e9c4a is healthy

I attach my sip stress logs and my sprout logs. I was running the test between 
13:57 and 14:05 on the 22 of october.
If you have any idea about why this could go wrong. I certainly forgot 
something that might be obvious but cannot catch it!

Thanks,
Austin​​​​
​
[https://ssl.gstatic.com/docs/doclist/images/icon_10_generic_list.png] 
clearwater-etcd.log.bckup<https://drive.google.com/file/d/0BwD2rKlmArODcmRDZlZSWWRBcC1LUnBEYm9vMHhoaUczbGQw/view?usp=drive_web>
​​
[https://ssl.gstatic.com/docs/doclist/images/icon_10_generic_list.png] 
monit.log.bckup<https://drive.google.com/file/d/0BwD2rKlmArODM3ZTQ1J6OGZ6eDJNZHk3N1pSQUhaM1BuRU84/view?usp=drive_web>
​
_______________________________________________
Clearwater mailing list
[email protected]
http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org

Reply via email to