Hi Abdul, The etcd output you’ve provided lists a few nodes as unreachable, which is likely to cause problems:
root@dime-0:/home/ubuntu# clearwater-etcdctl cluster-health member 5cd7042180fbb2a is unhealthy: got unhealthy result from http://192.168.0.6:4000 failed to check the health of member 208dd0fbcefb149c on http://192.168.0.8:4000: Get http://192.168.0.8:4000/health: dial tcp 192.168.0.8:4000<http://192.168.0.8:4000>: getsockopt: connection refused member 208dd0fbcefb149c is unreachable: [http://192.168.0.8:4000] are all unreachable member 2457d2c8a20fc738 is unhealthy: got unhealthy result from http://192.168.0.5:4000 member 48fa49be2b2ae2c2 is unhealthy: got unhealthy result from http://192.168.0.3:4000 failed to check the health of member a48134f48b185ad9 on http://192.168.0.7:4000: Get http://192.168.0.7:4000/health: dial tcp 192.168.0.7:4000<http://192.168.0.7:4000>: getsockopt: connection refused member a48134f48b185ad9 is unreachable: [http://192.168.0.7:4000] are all unreachable member e7db4eebbdb94a11 is unhealthy: got unhealthy result from http://192.168.0.4:4000 cluster is unhealthy Are you expecting 192.168.0.9 and 192.168.0.7 to be reachable? Which nodes are they? Also, the Cassandra logs you’ve sent over have quite a few ten-minute gaps in them like this, which is interesting: INFO [main] 2017-08-27 04:47:56,225 GossipingPropertyFileSnitch.java:64 - Loaded cassandra-topology.properties for compatibility INFO [ScheduledTasks:1] 2017-08-27 04:48:11,720 TokenMetadata.java:433 - Updating topology for all endpoints that have changed INFO [main] 2017-08-27 04:56:10,295 CassandraDaemon.java:155 - Hostname: vellum-0.test.com INFO [main] 2017-08-27 04:56:17,803 YamlConfigurationLoader.java:92 - Loading settings from file:/etc/cassandra/cassandra.yaml Could you send over /var/log/monit.log (and possibly a complete diagnostics package, created by sudo cw-gather_diags)? That should give more information on when and why Cassandra is being restarted. Thanks, Rob From: Clearwater [mailto:[email protected]] On Behalf Of Abdul Basit Alvi Sent: 27 August 2017 06:24 To: [email protected] Subject: [Project Clearwater] Cassandra and etcd clustering problem Hi, I have been trying to make the IMS work via manual install. I have followed all the instructions to the dot and have tried starting from scratch multiple times, but somehow I cant figure out why Cassandra and etcd clustering are not working properly. In the Dime node homestead process is not running, this I know is because it cant connect to the vellum node cassandra via the thrift port 9160. Next in the Vellum node the cassandra process is running but not working at all. I have attached the system log files as well as cassandra.yaml and the cassandra-env.sh file. For test purposes I have allowed all incomming TCP/UDP traffic to and from all nodes. Can you kindly look at the logs and outputs and point out what am I doing wrong? [Monit summary of dime] root@dime-0:/home/ubuntu# monit summary Monit 5.18.1 uptime: 6h 27m Service Name Status Type node-dime-0.test.com<http://node-dime-0.test.com> Running System snmpd_process Running Process ralf_process Running Process ntp_process Running Process nginx_process Running Process homestead_process Does not exist Process homestead-prov_process Running Process clearwater_queue_manager_pro... Running Process etcd_process Running Process clearwater_diags_monitor_pro... Running Process clearwater_config_manager_pr... Running Process clearwater_cluster_manager_p... Running Process ralf_uptime Status ok Program poll_ralf Status ok Program nginx_ping Status ok Program nginx_uptime Status ok Program monit_uptime Status ok Program homestead_uptime Wait parent Program poll_homestead Wait parent Program check_cx_health Wait parent Program poll_homestead-prov Status failed Program clearwater_queue_manager_uptime Status ok Program etcd_uptime Status ok Program poll_etcd_cluster Status failed Program poll_etcd Status ok Program [Dime Local config] [etcd cluster health dime] root@dime-0:/home/ubuntu# clearwater-etcdctl cluster-health member 5cd7042180fbb2a is unhealthy: got unhealthy result from http://192.168.0.6:4000 failed to check the health of member 208dd0fbcefb149c on http://192.168.0.8:4000: Get http://192.168.0.8:4000/health: dial tcp 192.168.0.8:4000<http://192.168.0.8:4000>: getsockopt: connection refused member 208dd0fbcefb149c is unreachable: [http://192.168.0.8:4000] are all unreachable member 2457d2c8a20fc738 is unhealthy: got unhealthy result from http://192.168.0.5:4000 member 48fa49be2b2ae2c2 is unhealthy: got unhealthy result from http://192.168.0.3:4000 failed to check the health of member a48134f48b185ad9 on http://192.168.0.7:4000: Get http://192.168.0.7:4000/health: dial tcp 192.168.0.7:4000<http://192.168.0.7:4000>: getsockopt: connection refused member a48134f48b185ad9 is unreachable: [http://192.168.0.7:4000] are all unreachable member e7db4eebbdb94a11 is unhealthy: got unhealthy result from http://192.168.0.4:4000 cluster is unhealthy [local config file dime] root@dime-0:/home/ubuntu# cat /etc/clearwater/local_config local_ip=192.168.0.4 public_ip=10.1.10.192 public_hostname=dime-0.test.com<http://dime-0.test.com> etcd_cluster=192.168.0.3,192.168.0.4,192.168.0.5,192.168.0.6,192.168.0.7,192.168.0.8 [shared config file dime] root@vellum-0:/home/ubuntu# cat /etc/clearwater/shared_config # Deployment definitions home_domain=test.com<http://test.com> sprout_hostname=sprout.test.com<http://sprout.test.com> hs_hostname=hs.test.com:8888<http://hs.test.com:8888> hs_provisioning_hostname=hs-prov.test.com:8889<http://hs-prov.test.com:8889> dime_hostname=dime.test.com:10888<http://dime.test.com:10888> xdms_hostname=homer.test.com:7888<http://homer.test.com:7888> sprout_impi_store=vellum.test.com<http://vellum.test.com> sprout_registration_store=vellum.test.com<http://vellum.test.com> cassandra_hostname=vellum.test.com<http://vellum.test.com> chronos_hostname=vellum.test.com<http://vellum.test.com> dime_session_store=vellum.test.com<http://vellum.test.com> upstream_port=0 # Email server configuration smtp_smarthost=localhost smtp_username=username smtp_password=password [email protected]<mailto:[email protected]> # Keys signup_key=secret turn_workaround=secret ellis_api_key=secret ellis_cookie_key=secret [Error Log Homestead] Thrift: Sun Aug 27 05:04:01 2017 TSocket::open() error on socket (after THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused Thrift: Sun Aug 27 05:04:02 2017 TSocket::open() error on socket (after THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused Thrift: Sun Aug 27 05:04:44 2017 TSocket::open() error on socket (after THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused Thrift: Sun Aug 27 05:04:44 2017 TSocket::open() error on socket (after THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused Thrift: Sun Aug 27 05:05:08 2017 TSocket::open() error on socket (after THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused Thrift: Sun Aug 27 05:05:08 2017 TSocket::open() error on socket (after THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused Thrift: Sun Aug 27 05:05:30 2017 TSocket::open() error on socket (after THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused Thrift: Sun Aug 27 05:05:30 2017 TSocket::open() error on socket (after THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused Thrift: Sun Aug 27 05:05:38 2017 TSocket::open() error on socket (after THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused Thrift: Sun Aug 27 05:05:38 2017 TSocket::open() error on socket (after THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused [Running cqlsh on vellum] root@vellum-0:/home/ubuntu# cqlsh Connection error: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")} [Monit summary of vellum] root@vellum-0:/home/ubuntu# monit summary Monit 5.18.1 uptime: 6h 39m Service Name Status Type node-vellum-0.test.com<http://node-vellum-0.test.com> Running System snmpd_process Running Process ntp_process Running Process memcached_process Running Process clearwater_queue_manager_pro... Running Process etcd_process Execution failed | Does... Process clearwater_diags_monitor_pro... Running Process clearwater_config_manager_pr... Running Process clearwater_cluster_manager_p... Running Process cassandra_process Running Process chronos_process Running Process astaire_process Running Process monit_uptime Status ok Program memcached_uptime Status ok Program poll_memcached Status ok Program clearwater_queue_manager_uptime Status ok Program etcd_uptime Wait parent Program poll_etcd_cluster Wait parent Program poll_etcd Wait parent Program cassandra_uptime Status ok Program poll_cassandra Status ok Program poll_cqlsh Status ok Program chronos_uptime Status ok Program poll_chronos Status failed Program astaire_uptime Status ok Program [local config file vellum] root@vellum-0:/home/ubuntu# cat /etc/clearwater/local_config local_ip=192.168.0.8 public_ip=10.1.10.204 public_hostname=vellum-0.test.com<http://vellum-0.test.com> etcd_cluster=192.168.0.3,192.168.0.4,192.168.0.5,192.168.0.6,192.168.0.7,192.168.0.8 [shared config file vellum] root@vellum-0:/home/ubuntu# cat /etc/clearwater/shared_config # Deployment definitions home_domain=test.com<http://test.com> sprout_hostname=sprout.test.com<http://sprout.test.com> hs_hostname=hs.test.com:8888<http://hs.test.com:8888> hs_provisioning_hostname=hs-prov.test.com:8889<http://hs-prov.test.com:8889> dime_hostname=dime.test.com:10888<http://dime.test.com:10888> xdms_hostname=homer.test.com:7888<http://homer.test.com:7888> sprout_impi_store=vellum.test.com<http://vellum.test.com> sprout_registration_store=vellum.test.com<http://vellum.test.com> cassandra_hostname=vellum.test.com<http://vellum.test.com> chronos_hostname=vellum.test.com<http://vellum.test.com> dime_session_store=vellum.test.com<http://vellum.test.com> upstream_port=0 # Email server configuration smtp_smarthost=localhost smtp_username=username smtp_password=password [email protected]<mailto:[email protected]> # Keys signup_key=secret turn_workaround=secret ellis_api_key=secret ellis_cookie_key=secret [netstat on vellum] root@vellum-0:/home/ubuntu# netstat -tulnap Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 192.168.0.8:11211<http://192.168.0.8:11211> 0.0.0.0:* LISTEN 27005/memcached tcp 0 0 127.0.0.1:7253<http://127.0.0.1:7253> 0.0.0.0:* LISTEN 27610/chronos tcp 0 0 255.255.255.255:7253<http://255.255.255.255:7253> 0.0.0.0:* LISTEN 27610/chronos tcp 0 0 127.0.0.1:53<http://127.0.0.1:53> 0.0.0.0:* LISTEN 7718/dnsmasq tcp 0 0 0.0.0.0:22<http://0.0.0.0:22> 0.0.0.0:* LISTEN 1189/sshd tcp 0 0 127.0.0.1:2812<http://127.0.0.1:2812> 0.0.0.0:* LISTEN 7833/monit tcp 0 0 192.168.0.8:44035<http://192.168.0.8:44035> 192.168.0.8:11211<http://192.168.0.8:11211> TIME_WAIT - tcp 0 0 127.0.0.1:54026<http://127.0.0.1:54026> 127.0.0.1:7253<http://127.0.0.1:7253> TIME_WAIT - tcp 0 0 192.168.0.8:44064<http://192.168.0.8:44064> 192.168.0.8:11211<http://192.168.0.8:11211> TIME_WAIT - tcp 0 0 192.168.0.8:44081<http://192.168.0.8:44081> 192.168.0.8:11211<http://192.168.0.8:11211> TIME_WAIT - tcp 0 0 192.168.0.8:44053<http://192.168.0.8:44053> 192.168.0.8:11211<http://192.168.0.8:11211> TIME_WAIT - tcp 0 0 192.168.0.8:44054<http://192.168.0.8:44054> 192.168.0.8:11211<http://192.168.0.8:11211> TIME_WAIT - tcp 0 0 192.168.0.8:44048<http://192.168.0.8:44048> 192.168.0.8:11211<http://192.168.0.8:11211> TIME_WAIT - tcp 0 0 192.168.0.8:22<http://192.168.0.8:22> 10.1.10.112:51998<http://10.1.10.112:51998> ESTABLISHED 26454/sshd: ubuntu tcp 0 268 192.168.0.8:22<http://192.168.0.8:22> 10.1.10.112:51933<http://10.1.10.112:51933> ESTABLISHED 22779/sshd: ubuntu tcp6 0 0 :::11311 :::* LISTEN 26657/astaire tcp6 0 0 ::1:53 :::* LISTEN 7718/dnsmasq tcp6 0 0 :::22 :::* LISTEN 1189/sshd udp 0 0 127.0.0.1:53<http://127.0.0.1:53> 0.0.0.0:* 7718/dnsmasq udp 0 0 0.0.0.0:68<http://0.0.0.0:68> 0.0.0.0:* 601/dhclient udp 0 0 192.168.0.8:123<http://192.168.0.8:123> 0.0.0.0:* 7362/ntpd udp 0 0 127.0.0.1:123<http://127.0.0.1:123> 0.0.0.0:* 7362/ntpd udp 0 0 0.0.0.0:123<http://0.0.0.0:123> 0.0.0.0:* 7362/ntpd udp 0 0 0.0.0.0:161<http://0.0.0.0:161> 0.0.0.0:* 7472/snmpd udp 0 0 0.0.0.0:55423<http://0.0.0.0:55423> 0.0.0.0:* 601/dhclient udp6 0 0 :::23767 :::* 601/dhclient udp6 0 0 ::1:53 :::* 7718/dnsmasq udp6 0 0 ::1:123 :::* 7362/ntpd udp6 0 0 fe80::f816:3eff:fe3:123 :::* 7362/ntpd udp6 0 0 :::123 :::* 7362/ntpd udp6 0 0 :::161 :::* 7472/snmpd [Ping results] root@vellum-0:/home/ubuntu# ping hs-prov.test.com<http://hs-prov.test.com> PING hs-prov.test.com<http://hs-prov.test.com> (192.168.0.4) 56(84) bytes of data. 64 bytes from 192.168.0.4<http://192.168.0.4>: icmp_seq=1 ttl=64 time=10.3 ms 64 bytes from 192.168.0.4<http://192.168.0.4>: icmp_seq=2 ttl=64 time=12.1 ms 64 bytes from 192.168.0.4<http://192.168.0.4>: icmp_seq=3 ttl=64 time=9.06 ms ^C --- hs-prov.test.com<http://hs-prov.test.com> ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2032ms rtt min/avg/max/mdev = 9.063/10.529/12.151/1.270 ms root@vellum-0:/home/ubuntu# root@vellum-0:/home/ubuntu# ping hs.test.com<http://hs.test.com> PING hs.test.com<http://hs.test.com> (192.168.0.4) 56(84) bytes of data. 64 bytes from 192.168.0.4<http://192.168.0.4>: icmp_seq=1 ttl=64 time=21.3 ms 64 bytes from 192.168.0.4<http://192.168.0.4>: icmp_seq=2 ttl=64 time=4.58 ms 64 bytes from 192.168.0.4<http://192.168.0.4>: icmp_seq=3 ttl=64 time=20.6 ms ^C --- hs.test.com<http://hs.test.com> ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2070ms rtt min/avg/max/mdev = 4.584/15.523/21.363/7.741 ms root@vellum-0:/home/ubuntu# ping dime-0.test.com<http://dime-0.test.com> PING dime-0.test.com<http://dime-0.test.com> (10.1.10.192) 56(84) bytes of data. 64 bytes from 10.1.10.192<http://10.1.10.192>: icmp_seq=1 ttl=63 time=25.9 ms 64 bytes from 10.1.10.192<http://10.1.10.192>: icmp_seq=2 ttl=63 time=58.9 ms ^C --- dime-0.test.com<http://dime-0.test.com> ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1029ms rtt min/avg/max/mdev = 25.954/42.459/58.964/16.505 ms root@bono-0:/home/ubuntu# ping vellum-0.test.com<http://vellum-0.test.com> PING vellum-0.test.com<http://vellum-0.test.com> (10.1.10.204) 56(84) bytes of data. 64 bytes from 10.1.10.204<http://10.1.10.204>: icmp_seq=1 ttl=63 time=36.7 ms 64 bytes from 10.1.10.204<http://10.1.10.204>: icmp_seq=2 ttl=63 time=25.2 ms ^C --- vellum-0.test.com<http://vellum-0.test.com> ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1002ms rtt min/avg/max/mdev = 25.240/30.985/36.730/5.745 ms root@bono-0:/home/ubuntu# ping vellum.test.com<http://vellum.test.com> PING vellum.test.com<http://vellum.test.com> (192.168.0.8) 56(84) bytes of data. 64 bytes from 192.168.0.8<http://192.168.0.8>: icmp_seq=1 ttl=64 time=51.4 ms 64 bytes from 192.168.0.8<http://192.168.0.8>: icmp_seq=2 ttl=64 time=3.93 ms 64 bytes from 192.168.0.8<http://192.168.0.8>: icmp_seq=3 ttl=64 time=4.22 ms Regards, Abdul Basit Alvi
_______________________________________________ Clearwater mailing list [email protected] http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org
