Re: [Project Clearwater] etcd_process execution failed on each node.

Jace.Liang Mon, 24 Oct 2016 18:46:51 -0700

Hi Richard,

I think my problem here is not the 
issue<https://github.com/Metaswitch/clearwater-etcd/issues/320> you mentioned 
in last mail.
As my clearwater-etcd.log is quite different with that one. And my clearwater 
version is already the latest Onix, release 108.
PS: I followed the manual install instruction and try the upgrade procedure but 
nothing had to be updated.


In my case, my ctcd-process looks like unable to use the socket to successfully 
start and communicate with each other.
As the boot.log shows
cat /var/log/boot.log
………..
………..
zmq_msg_recv: Resource temporarily unavailable
Configuring monit for only localhost access
Error:  dial tcp 192.168.2.205:4000: getsockopt: no route to host
Rejoining cluster...
Etcd failed to come up – exiting

I think the root cause is about “zmq_msg_recv: Resource temporarily 
unavailable”.  (this is something about socket error)

Since I did the two test on two deployment.
-------------------First   deployment ---------------
(1). Six new VM with Ubuntu official 14.04.02 64-bits server
(2). Reboot the system when OS is finished
(there is no zmq_msg_recv: Resource temporarily unavailable in the boot.log)
(3.) Run the manual installation of clearwater then restart VMs.
the zmq_msg_recv info shows! Etcd_process failed to start.
(4.) Try to upgrade to the release Onix 108, but everything is newest.
(5.) the eted-process is still unfunctional.

------------------Second  deployment-----------------
(1). In this deployment, I used my 6 VM images of Clearwater nodes that I setup 
in about 2015 Oct.
And this deployment was running smoothly.
(2). Run the upgrade procedure to the latest version Onix 108.
(3). The zmq_msg_recv: Resource temporarily unavailable shows during the 
upgrade process.
(4.) Reboot after the upgrade finished.
(5.) The zmq_msg_recv: Resource temporarily unavailable shows during the boot 
step.
        And the etcd_process became unfunctional.

I would like to know what’s version of your python? Seems like zmq_msg_recv is 
from a python binary.
And how do I check my clearwater release version? And Install the new machine 
with certain version?
I think the old version might just good for me.

Thank you very much!


Jace.


From: Richard Whitehouse [mailto:[email protected]]
Sent: Friday, October 21, 2016 5:43 PM
To: [email protected]; 梁維恩 <[email protected]>
Subject: RE: etcd_process execution failed on each node.

Jace Liang,

What version of Project Clearwater are you running?

From your problem description, you might be hitting 
https://github.com/Metaswitch/clearwater-etcd/issues/320 which we’ve fixed in 
the latest release, Onix, release 108.

This can cause the etcd cluster to lose quorum and thus prevent it starting up 
correctly.

If this is the case, you’ve got two options:

1) You can delete your existing installation, and install release-108 instead. 
That’s probably the simplest solution, but you’ll lose all of your data.

2) Alternatively, you can upgrade your nodes to release-108, and then restore 
the cluster. As it’s lost quorum, you’ll need to follow the instructions for 
multiple node recovery, which are documented in the docs at 
http://clearwater.readthedocs.io/en/stable/Handling_Failed_Nodes.html#multiple-failed-nodes

Hope this helps,


Richard

From: Clearwater [mailto:[email protected]] On 
Behalf Of [email protected]<mailto:[email protected]>
Sent: 21 October 2016 03:23
To: 
[email protected]<mailto:[email protected]>
Subject: [Project Clearwater] etcd_process execution failed on each node.

Dear All,

Recently I found out my 6 VMs of each node have trouble of executing 
etcd_process
I think I did the config file right, because at first their etcd_process was 
running well and the “clearwater-etcdctl cluster health” shows all healthly.
But somehow they suddenly

I tried “monit restart etcd_process” but still failed.

Here are some command result for more information. (on the node ellis)

root@ellis1:/var/log# clearwater-etcdctl cluster-health
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured
error #0: dial tcp 192.168.2.206:4000: getsockopt: connection refused  (.206 is 
homestead’s ip)

cat /var/log/boot.log
………..
………..
………..
zmq_msg_recv: Resource temporarily unavailable
Configuring monit for only localhost access
Error:  dial tcp 192.168.2.205:4000: getsockopt: no route to host
Rejoining cluster...
Etcd failed to come up - exiting

root@ellis1:/var/log/clearwater-etcd# cat clearwater-etcd.log
………….
………….
…………
2016-10-21 10:11:46.686827 I | etcdmain: etcd Version: 2.2.5
2016-10-21 10:11:46.686888 I | etcdmain: Git SHA: bc9ddf2
2016-10-21 10:11:46.686895 I | etcdmain: Go Version: go1.5.3
2016-10-21 10:11:46.686902 I | etcdmain: Go OS/Arch: linux/amd64
2016-10-21 10:11:46.686913 I | etcdmain: setting maximum number of CPUs to 4, 
total number of available CPUs is 4
2016-10-21 10:11:46.686953 N | etcdmain: the server is already initialized as 
member before, starting as etcd member...
2016-10-21 10:11:46.687015 I | etcdmain: listening for peers on 
http://192.168.2.206:2380
2016-10-21 10:11:46.687039 I | etcdmain: listening for client requests on 
http://192.168.2.206:4000
2016-10-21 10:11:46.689639 I | etcdserver: recovered store from snapshot at 
index 10001
2016-10-21 10:11:46.689654 I | etcdserver: name = 192-168-2-206
2016-10-21 10:11:46.689660 I | etcdserver: data dir = 
/var/lib/clearwater-etcd/192.168.2.206
2016-10-21 10:11:46.689668 I | etcdserver: member dir = 
/var/lib/clearwater-etcd/192.168.2.206/member
2016-10-21 10:11:46.689674 I | etcdserver: heartbeat = 100ms
2016-10-21 10:11:46.689680 I | etcdserver: election = 1000ms
2016-10-21 10:11:46.689686 I | etcdserver: snapshot count = 10000
2016-10-21 10:11:46.689696 I | etcdserver: advertise client URLs = 
http://192.168.2.206:4000
2016-10-21 10:11:46.689717 I | etcdserver: loaded cluster information from 
store: <nil>
2016-10-21 10:11:46.726159 I | etcdserver: restarting member 4cb5fd19beaa1750 
in cluster 877b90a46cdaaa83 at commit index 14044
2016-10-21 10:11:46.727646 I | raft: 4cb5fd19beaa1750 became follower at term 
814
2016-10-21 10:11:46.727690 I | raft: newRaft 4cb5fd19beaa1750 [peers: 
[1226bb321c91a88e,4cb5fd19beaa1750,8ac8820f24de7303,a4a5d4f826d5740a], term: 
814, commit: 14044, applied: 10001, lastindex: 14045, lastterm: 814]
2016-10-21 10:11:46.734589 I | rafthttp: the connection with 1226bb321c91a88e 
became active
2016-10-21 10:11:46.739284 E | rafthttp: failed to dial 8ac8820f24de7303 on 
stream Message (dial tcp 192.168.2.202:2380: getsockopt: connection refused)
2016-10-21 10:11:46.740156 E | rafthttp: failed to dial 8ac8820f24de7303 on 
stream MsgApp v2 (dial tcp 192.168.2.202:2380: getsockopt: connection refused)
2016-10-21 10:11:46.745962 I | etcdserver: starting server... [version: 2.2.5, 
cluster version: 2.2]
2016-10-21 10:11:46.747252 E | rafthttp: failed to dial a4a5d4f826d5740a on 
stream Message (dial tcp 192.168.2.203:2380: getsockopt: connection refused)
2016-10-21 10:11:46.747394 E | rafthttp: failed to dial a4a5d4f826d5740a on 
stream MsgApp v2 (dial tcp 192.168.2.203:2380: getsockopt: connection refused)
2016-10-21 10:11:46.756637 I | rafthttp: the connection with 1226bb321c91a88e 
became inactive
2016-10-21 10:11:46.756660 E | rafthttp: failed to read 1226bb321c91a88e on 
stream Message (net/http: request canceled)
2016-10-21 10:11:46.756687 N | etcdserver: removed member 1226bb321c91a88e from 
cluster 877b90a46cdaaa83
2016-10-21 10:11:46.756766 D | etcdserver: skipped updating attributes of 
removed member 1226bb321c91a88e
2016-10-21 10:11:46.756853 C | etcdserver: nodeToMember should never fail: 
raftAttributes key doesn't exist
panic: nodeToMember should never fail: raftAttributes key doesn't exist


Seems like the nodes cannot connect to each other, but I test it with ping, 
they still ping to each other.
Can anyone give us some advice or solution?
Thank you.


--
本信件可能包含工研院機密資訊，非指定之收件者，請勿使用或揭露本信件內容，並請銷毀此信件。 This email may contain 
confidential information. Please do not use or disclose it in any way and 
delete it if you are not the intended recipient.


--
本信件可能包含工研院機密資訊，非指定之收件者，請勿使用或揭露本信件內容，並請銷毀此信件。 This email may contain 
confidential information. Please do not use or disclose it in any way and 
delete it if you are not the intended recipient.

_______________________________________________
Clearwater mailing list
[email protected]
http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org

Re: [Project Clearwater] etcd_process execution failed on each node.

Reply via email to