Hi Lianjie, Your configuration files aren't quite right.
The cluster_settings file should have the form servers=<address>,<address> - so in your case it would be "servers=192.168.1.21:11211, 192.168.1.22:11211". This file should be identical on each Sprout node (so the sprouts must be in same order on each node). The chronos.conf file should have one localhost entry, which is set to the IP address of the local node, and multiple node entries, which are set to the IP addresses of each node in the cluster. In your case, this would be (on sprout 1): [cluster] localhost = 192.168.1.21 node = 192.168.1.21 node = 192.168.1.22 The order of the nodes must be the same on each node - so the file on sprout 2 should be: [cluster] localhost = 192.168.1.22 node = 192.168.1.21 node = 192.168.1.22 Can you make these changes to the config files, and then reload Sprout and Chronos (sudo service <service> reload)? In the logs below, Homestead has stopped because it couldn't contact cassandra: 04-02-2015 18:42:19.616 UTC Error cassandra_store.cpp:207: Cache caught TTransportException: connect() failed: Connection refused 04-02-2015 18:42:19.616 UTC Error main.cpp:550: Failed to initialize cache - rc 3 04-02-2015 18:42:19.616 UTC Status cassandra_store.cpp:185: Stopping cache Can you check whether Cassandra is running reliably on the Homestead node? Does /var/monit/monit.log show that monit is restarting it, and are there any logs in /var/log/cassandra? Ellie -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Lianjie Cao Sent: 04 February 2015 19:37 To: [email protected] Subject: [Clearwater] Problems with Sprout clustering and Homestead failure Hi, We recently built a Clearwater deployment with one Bono node, two Sprout nodes, one Homestead node, one Homer node and one Ralf node. Howerver, we ran into some problems related to Homestead start failure and Sprout clustering. *Sprout clustering:* The manual installation instruction shows for the latest version Sprout clustering is done by Chronos. To add or remove a Sprout node, /etc/chronos/chronos.conf needs to modified correspondingly. However, we found that when we don't have chronos.conf file, the two Sprout nodes seems working fine by adding IPs of the two Sprout nodes to /etc/clearwater/cluster_settings. [sprout]cw@sprout-2:~$ cat /etc/clearwater/cluster_settings servers=192.168.1.21:11211 servers=192.168.1.22:11211 But, if we do add /etc/chronos/chronos.conf with the information of two Sprout nodes as below, Chronos failed and no new log files found under /var/log/chronos. [sprout]cw@sprout-1:/var/log/chronos$ cat /etc/chronos/chronos.conf [http] bind-address = 0.0.0.0 bind-port = 7253 [logging] folder = /var/log/chronos level = 5 [cluster] localhost = 192.168.1.21 node = localhost sprout-2 = 192.168.1.22 node = sprout-2 [alarms] enabled = true [sprout]cw@sprout-1:~$ sudo monit status The Monit daemon 5.8.1 uptime: 0m Program 'poll_sprout' status Status ok monitoring status Monitored last started Wed, 04 Feb 2015 11:20:36 last exit value 0 data collected Wed, 04 Feb 2015 11:20:36 Process 'sprout' status Running monitoring status Monitored pid 1157 parent pid 1 uid 999 effective uid 999 gid 999 uptime 1m children 0 memory kilobytes 42412 memory kilobytes total 42412 memory percent 1.0% memory percent total 1.0% cpu percent 0.4% cpu percent total 0.4% data collected Wed, 04 Feb 2015 11:20:36 Program 'poll_memcached' status Status ok monitoring status Monitored last started Wed, 04 Feb 2015 11:20:36 last exit value 0 data collected Wed, 04 Feb 2015 11:20:36 Process 'memcached' status Running monitoring status Monitored pid 1092 parent pid 1 uid 108 effective uid 108 gid 114 uptime 1m children 0 memory kilobytes 1180 memory kilobytes total 1180 memory percent 0.0% memory percent total 0.0% cpu percent 0.0% cpu percent total 0.0% data collected Wed, 04 Feb 2015 11:20:36 Process 'clearwater_diags_monitor' status Running monitoring status Monitored pid 1072 parent pid 1 uid 0 effective uid 0 gid 0 uptime 1m children 1 memory kilobytes 1796 memory kilobytes total 2172 memory percent 0.0% memory percent total 0.0% cpu percent 0.0% cpu percent total 0.0% data collected Wed, 04 Feb 2015 11:20:36 Process 'chronos' status Execution failed monitoring status Monitored data collected Wed, 04 Feb 2015 11:20:26 System 'sprout-1' status Running monitoring status Monitored load average [0.20] [0.09] [0.04] cpu 6.8%us 1.1%sy 0.0%wa memory usage 116944 kB [2.8%] swap usage 0 kB [0.0%] data collected Wed, 04 Feb 2015 11:20:26 Is it because we are not using Chronos in the right way or there are other settings we need to do? *Homestead Failure:* When we use SIPp to perform user registration tests, we receive “403 Forbidden" response and we observed error on both sprout nodes. [sprout]cw@sprout-1:~$ cat /var/log/sprout/sprout_current.txt 04-02-2015 18:54:50.884 UTC Warning acr.cpp:627: Failed to send Ralf ACR message (0x7fce241cd780), rc = 400 04-02-2015 18:54:51.083 UTC Error httpconnection.cpp:573: http://hs.hp-clearwater.com:8888/impi/6500000008%40hp-clearwater.com/av?impu=sip%3A6500000008%40hp-clearwater.com failed at server 192.168.1.31 : Timeout was reached (28) : fatal 04-02-2015 18:54:51.083 UTC Error httpconnection.cpp:688: cURL failure with cURL error code 28 (see man 3 libcurl-errors) and HTTP error code 500 04-02-2015 18:54:51.083 UTC Error hssconnection.cpp:145: Failed to get Authentication Vector for [email protected] 04-02-2015 18:54:51.086 UTC Error httpconnection.cpp:688: cURL failure with cURL error code 0 (see man 3 libcurl-errors) and HTTP error code 400 04-02-2015 18:54:51.086 UTC Warning acr.cpp:627: Failed to send Ralf ACR message (0x14322c0), rc = 400 04-02-2015 18:54:51.282 UTC Error httpconnection.cpp:573: http://hs.hp-clearwater.com:8888/impi/6500000009%40hp-clearwater.com/av?impu=sip%3A6500000009%40hp-clearwater.com failed at server 192.168.1.31 : Timeout was reached (28) : fatal 04-02-2015 18:54:51.283 UTC Error httpconnection.cpp:688: cURL failure with cURL error code 28 (see man 3 libcurl-errors) and HTTP error code 500 04-02-2015 18:54:51.283 UTC Error hssconnection.cpp:145: Failed to get Authentication Vector for [email protected] 04-02-2015 18:54:51.286 UTC Error httpconnection.cpp:688: cURL failure with cURL error code 0 (see man 3 libcurl-errors) and HTTP error code 400 04-02-2015 18:54:51.286 UTC Warning acr.cpp:627: Failed to send Ralf ACR message (0x7fce1c1fdef0), rc = 400 .... It seems like Homestead is unreachable. Then on Homestead node, if we check status using monit: [homestead]cw@homestead-1:~$ sudo monit status The Monit daemon 5.8.1 uptime: 15m Process 'nginx' status Running monitoring status Monitored pid 1044 parent pid 1 uid 0 effective uid 0 gid 0 uptime 15m children 4 memory kilobytes 1240 memory kilobytes total 8448 memory percent 0.0% memory percent total 0.2% cpu percent 0.0% cpu percent total 0.0% port response time 0.000s to 127.0.0.1:80/ping [HTTP via TCP] data collected Wed, 04 Feb 2015 10:58:02 Program 'poll_homestead' status Status failed monitoring status Monitored last started Wed, 04 Feb 2015 10:58:02 last exit value 1 data collected Wed, 04 Feb 2015 10:58:02 Process 'homestead' status Does not exist monitoring status Monitored data collected Wed, 04 Feb 2015 10:58:02 Program 'poll_homestead-prov' status Status ok monitoring status Monitored last started Wed, 04 Feb 2015 10:58:02 last exit value 0 data collected Wed, 04 Feb 2015 10:58:02 Process 'homestead-prov' status Execution failed monitoring status Monitored data collected Wed, 04 Feb 2015 10:58:32 Process 'clearwater_diags_monitor' status Running monitoring status Monitored pid 1027 parent pid 1 uid 0 effective uid 0 gid 0 uptime 16m children 1 memory kilobytes 1664 memory kilobytes total 2040 memory percent 0.0% memory percent total 0.0% cpu percent 0.0% cpu percent total 0.0% data collected Wed, 04 Feb 2015 10:58:32 Program 'poll_cassandra_ring' status Status ok monitoring status Monitored last started Wed, 04 Feb 2015 10:58:32 last exit value 0 data collected Wed, 04 Feb 2015 10:58:32 Process 'cassandra' status Running monitoring status Monitored pid 1280 parent pid 1277 uid 106 effective uid 106 gid 113 uptime 16m children 0 memory kilobytes 1388648 memory kilobytes total 1388648 memory percent 34.3% memory percent total 34.3% cpu percent 0.4% cpu percent total 0.4% data collected Wed, 04 Feb 2015 10:58:32 System 'homestead-1' status Running monitoring status Monitored load average [0.00] [0.04] [0.05] cpu 3.0%us 0.8%sy 0.0%wa memory usage 1505324 kB [37.1%] swap usage 0 kB [0.0%] data collected Wed, 04 Feb 2015 10:58:32 And log file shows: [homestead]cw@homestead-1:~$ cat /var/log/homestead-prov/homestead-prov-err.log Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/share/clearwater/homestead/env/lib/python2.7/site-packages/crest-0.1-py2.7.egg/metaswitch/crest/main.py", line 156, in <module> standalone() File "/usr/share/clearwater/homestead/env/lib/python2.7/site-packages/crest-0.1-py2.7.egg/metaswitch/crest/main.py", line 119, in standalone reactor.listenUNIX(unix_sock_name, application) File "/usr/share/clearwater/homestead/env/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/internet/posixbase.py", line 413, in listenUNIX p.startListening() File "/usr/share/clearwater/homestead/env/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/internet/unix.py", line 293, in startListening raise CannotListenError, (None, self.port, le) twisted.internet.error.CannotListenError: Couldn't listen on any:/tmp/.homestead-prov-sock-0: [Errno 98] Address already in use. ...... [homestead]cw@homestead-1:~$ cat /var/log/homestead-prov/homestead-prov-0.log 2015-02-04 18:42:23,476 UTC INFO main:118 Going to listen for HTTP on UNIX socket /tmp/.homestead-prov-sock-0 2015-02-04 18:42:24,087 UTC INFO main:118 Going to listen for HTTP on UNIX socket /tmp/.homestead-prov-sock-0 2015-02-04 18:42:35,826 UTC INFO main:118 Going to listen for HTTP on UNIX socket /tmp/.homestead-prov-sock-0 2015-02-04 18:43:16,205 UTC INFO main:118 Going to listen for HTTP on UNIX socket /tmp/.homestead-prov-sock-0 ...... homestead_20150204T180000Z.txt homestead_current.txt [homestead]cw@homestead-1:~$ cat /var/log/homestead/homestead_current.txt 04-02-2015 18:42:19.586 UTC Status main.cpp:468: Log level set to 2 04-02-2015 18:42:19.602 UTC Status main.cpp:489: Access logging enabled to /var/log/homestead 04-02-2015 18:42:19.614 UTC Status load_monitor.cpp:93: Constructing LoadMonitor 04-02-2015 18:42:19.614 UTC Status load_monitor.cpp:94: Target latency (usecs) : 100000 04-02-2015 18:42:19.614 UTC Status load_monitor.cpp:95: Max bucket size : 20 04-02-2015 18:42:19.614 UTC Status load_monitor.cpp:96: Initial token fill rate/s: 10.000000 04-02-2015 18:42:19.614 UTC Status load_monitor.cpp:97: Min token fill rate/s : 10.000000 04-02-2015 18:42:19.614 UTC Status dnscachedresolver.cpp:90: Creating Cached Resolver using server 127.0.0.1 04-02-2015 18:42:19.614 UTC Status httpresolver.cpp:50: Created HTTP resolver 04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:145: Configuring store 04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:146: Hostname: localhost 04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:147: Port: 9160 04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:148: Threads: 10 04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:149: Max Queue: 0 04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:199: Starting store 04-02-2015 18:42:19.616 UTC Error cassandra_store.cpp:207: Cache caught TTransportException: connect() failed: Connection refused 04-02-2015 18:42:19.616 UTC Error main.cpp:550: Failed to initialize cache - rc 3 04-02-2015 18:42:19.616 UTC Status cassandra_store.cpp:185: Stopping cache 04-02-2015 18:42:19.616 UTC Status cassandra_store.cpp:226: Waiting for cache to stop ...... And the port usage is: [homestead]cw@homestead-1:~$ sudo netstat -tulpn Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 127.0.0.1:9042 0.0.0.0:* LISTEN 1280/jsvc.exec tcp 0 0 0.0.0.0:53 0.0.0.0:* LISTEN 952/dnsmasq tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 827/sshd tcp 0 0 127.0.0.1:7000 0.0.0.0:* LISTEN 1280/jsvc.exec tcp 0 0 127.0.0.1:2812 0.0.0.0:* LISTEN 1036/monit tcp 0 0 0.0.0.0:37791 0.0.0.0:* LISTEN 1280/jsvc.exec tcp 0 0 0.0.0.0:7199 0.0.0.0:* LISTEN 1280/jsvc.exec tcp 0 0 0.0.0.0:53313 0.0.0.0:* LISTEN 1280/jsvc.exec tcp 0 0 127.0.0.1:9160 0.0.0.0:* LISTEN 1280/jsvc.exec tcp6 0 0 :::53 :::* LISTEN 952/dnsmasq tcp6 0 0 :::22 :::* LISTEN 827/sshd tcp6 0 0 :::8889 :::* LISTEN 1044/nginx tcp6 0 0 :::80 :::* LISTEN 1044/nginx udp 0 0 0.0.0.0:13344 0.0.0.0:* 952/dnsmasq udp 0 0 0.0.0.0:48567 0.0.0.0:* 952/dnsmasq udp 0 0 0.0.0.0:53 0.0.0.0:* 952/dnsmasq udp 0 0 0.0.0.0:41016 0.0.0.0:* 952/dnsmasq udp 0 0 0.0.0.0:68 0.0.0.0:* 634/dhclient3 udp 0 0 192.168.1.31:123 0.0.0.0:* 791/ntpd udp 0 0 127.0.0.1:123 0.0.0.0:* 791/ntpd udp 0 0 0.0.0.0:123 0.0.0.0:* 791/ntpd udp6 0 0 :::53 :::* 952/dnsmasq udp6 0 0 fe80::f816:3eff:fe7:123 :::* 791/ntpd udp6 0 0 ::1:123 :::* 791/ntpd udp6 0 0 :::123 :::* 791/ntpd So, how should we fix the problems with Homestead and Homestead-prov? Best regards, Lianjie _______________________________________________ Clearwater mailing list [email protected] http://lists.projectclearwater.org/listinfo/clearwater _______________________________________________ Clearwater mailing list [email protected] http://lists.projectclearwater.org/listinfo/clearwater
