Hi Austin, The etcd processes (on all your nodes) are continually restarting. This is because monit is marking the etcd process as unresponsive, and so killing the etcd process.
The monit check for responsive used to write to etcd. We’ve since realised that this is a bad check to do, as it relies on the etcd leader being up. This can lead to the situation where monit continually kills all etcd processes as: · Something initially goes wrong and the current etcd leader fails · The surviving etcd processes attempt to elect a new leader · While the election is in process, they can’t write to etcd · Monit kills the rest of the etcd processes · This now can’t recover as the etcd processes won’t stay up long enough to elect a leader We’ve fixed this in the latest Project Clearwater release. Can you try upgrading your system and running stress again? Thanks, Ellie From: Clearwater [mailto:[email protected]] On Behalf Of Austin Marston Sent: 29 October 2015 10:15 To: [email protected] Subject: [Clearwater] Fwd: SIP stress not working Hello, Thanks, you'll find attached the logs of sprout etc and monit from a brand new run of the sip testing. While I tested that, the sprout status of etcd_process was "Does not exist" and clearwater_cluster_manager was "Execution failed". Thanks a lot, Austin 2015-10-23 21:42 GMT+02:00 Eleanor Merry <[email protected]<mailto:[email protected]>>: Hi Austin, It sounds like your etcd process could be regularly restarting. Can you send me the etcd logs (in /var/log/clearwater-etcd) and monit logs (/var/log/monit.log) from your Sprout node? Thanks, Ellie From: Clearwater [mailto:[email protected]<mailto:[email protected]>] On Behalf Of Austin Marston Sent: 23 October 2015 07:22 To: [email protected]<mailto:[email protected]> Subject: [Clearwater] SIP stress not working Hi all, I deployed manually clearwater infra with one bono,sprout,ellis,ralph,homer, and hs. My sip testing seem to be working fine but my stress testing is not working at all. I created a new node for sip testing, following https://github.com/Metaswitch/crest/blob/dev/docs/Bulk-Provisioning%20Numbers.md Note: Whenever I want to check what might be wrong I always get different status for my nodes. The clearwater cluster_manager seem to fail most of the time on bono and sprout and when I check the cluster health results are always different. Like for instance, when I ran clearwater-etcdctl cluster-health I see that the cluster is healthy but sometimes my bono node (172-16-1-20) or sprout node (172-16-1-20) are reported as not healthy. [17:07:14][sprout]user@cw-002:/var/log/sprout$ clearwater-etcdctl cluster-health cluster is healthy member 27a940d2104e9692 is unhealthy member 2ea8f3a5eea05584 is healthy member 5fdc25bd4ae527c0 is healthy member d26088cb54745bbc is healthy member e525c6a4ed161686 is healthy member f5765a98a56e9c4a is healthy [17:07:24]user@cw-002:/var/log/sprout$ clearwater-etcdctl member list 27a940d2104e9692: name=172-16-1-20 peerURLs=http://172.16.1.20:2380<http://172.16.1.20:2380/>clientURLs=http://172.16.1.20:4000<http://172.16.1.20:4000/> 2ea8f3a5eea05584: name=172-16-1-22 peerURLs=http://172.16.1.22:2380<http://172.16.1.22:2380/>clientURLs=http://172.16.1.22:4000<http://172.16.1.22:4000/> 5fdc25bd4ae527c0: name=172-16-1-25 peerURLs=http://172.16.1.25:2380<http://172.16.1.25:2380/>clientURLs=http://172.16.1.25:4000<http://172.16.1.25:4000/> d26088cb54745bbc: name=172-16-1-24 peerURLs=http://172.16.1.24:2380<http://172.16.1.24:2380/>clientURLs=http://172.16.1.24:4000<http://172.16.1.24:4000/> e525c6a4ed161686: name=172-16-1-21 peerURLs=http://172.16.1.21:2380<http://172.16.1.21:2380/>clientURLs=http://172.16.1.21:4000<http://172.16.1.21:4000/> f5765a98a56e9c4a: name=172-16-1-23 peerURLs=http://172.16.1.23:2380<http://172.16.1.23:2380/>clientURLs=http://172.16.1.23:4000<http://172.16.1.23:4000/> [17:10:00][sprout]user@cw-002:/var/log/sprout$ clearwater-etcdctl cluster-health cluster is healthy member 27a940d2104e9692 is healthy member 2ea8f3a5eea05584 is healthy member 5fdc25bd4ae527c0 is healthy member d26088cb54745bbc is healthy member e525c6a4ed161686 is healthy member f5765a98a56e9c4a is healthy I attach my sip stress logs and my sprout logs. I was running the test between 13:57 and 14:05 on the 22 of october. If you have any idea about why this could go wrong. I certainly forgot something that might be obvious but cannot catch it! Thanks, Austin [https://ssl.gstatic.com/docs/doclist/images/icon_10_generic_list.png] clearwater-etcd.log.bckup<https://drive.google.com/file/d/0BwD2rKlmArODcmRDZlZSWWRBcC1LUnBEYm9vMHhoaUczbGQw/view?usp=drive_web> [https://ssl.gstatic.com/docs/doclist/images/icon_10_generic_list.png] monit.log.bckup<https://drive.google.com/file/d/0BwD2rKlmArODM3ZTQ1J6OGZ6eDJNZHk3N1pSQUhaM1BuRU84/view?usp=drive_web>
_______________________________________________ Clearwater mailing list [email protected] http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org
