Re: [Project Clearwater] Cassandra and etcd clustering problem

Robert Day Wed, 30 Aug 2017 02:50:54 -0700

Hi Abdul,

The etcd output you’ve provided lists a few nodes as unreachable, which is 
likely to cause problems:


root@dime-0:/home/ubuntu# clearwater-etcdctl cluster-health
member 5cd7042180fbb2a is unhealthy: got unhealthy result from 
http://192.168.0.6:4000
failed to check the health of member 208dd0fbcefb149c on 
http://192.168.0.8:4000: Get http://192.168.0.8:4000/health: dial tcp 
192.168.0.8:4000<http://192.168.0.8:4000>: getsockopt: connection refused
member 208dd0fbcefb149c is unreachable: [http://192.168.0.8:4000] are all 
unreachable
member 2457d2c8a20fc738 is unhealthy: got unhealthy result from 
http://192.168.0.5:4000
member 48fa49be2b2ae2c2 is unhealthy: got unhealthy result from 
http://192.168.0.3:4000
failed to check the health of member a48134f48b185ad9 on 
http://192.168.0.7:4000: Get http://192.168.0.7:4000/health: dial tcp 
192.168.0.7:4000<http://192.168.0.7:4000>: getsockopt: connection refused
member a48134f48b185ad9 is unreachable: [http://192.168.0.7:4000] are all 
unreachable
member e7db4eebbdb94a11 is unhealthy: got unhealthy result from 
http://192.168.0.4:4000
cluster is unhealthy

Are you expecting 192.168.0.9 and 192.168.0.7 to be reachable? Which nodes are 
they?

Also, the Cassandra logs you’ve sent over have quite a few ten-minute gaps in 
them like this, which is interesting:

INFO  [main] 2017-08-27 04:47:56,225 GossipingPropertyFileSnitch.java:64 - 
Loaded cassandra-topology.properties for compatibility
INFO  [ScheduledTasks:1] 2017-08-27 04:48:11,720 TokenMetadata.java:433 - 
Updating topology for all endpoints that have changed
INFO  [main] 2017-08-27 04:56:10,295 CassandraDaemon.java:155 - Hostname: 
vellum-0.test.com
INFO  [main] 2017-08-27 04:56:17,803 YamlConfigurationLoader.java:92 - Loading 
settings from file:/etc/cassandra/cassandra.yaml

Could you send over /var/log/monit.log (and possibly a complete diagnostics 
package, created by sudo cw-gather_diags)? That should give more information on 
when and why Cassandra is being restarted.

Thanks,
Rob

From: Clearwater [mailto:[email protected]] On 
Behalf Of Abdul Basit Alvi
Sent: 27 August 2017 06:24
To: [email protected]
Subject: [Project Clearwater] Cassandra and etcd clustering problem

Hi,
I have been trying to make the IMS work via manual install. I have followed all 
the instructions to the dot and have tried starting from scratch multiple 
times, but somehow I cant figure out why Cassandra and etcd clustering are not 
working properly.

In the Dime node homestead process is not running, this I know is because it 
cant connect to the vellum node cassandra via the thrift port 9160.

Next in the Vellum node the cassandra process is running but not working at 
all. I have attached the system log files as well as cassandra.yaml and the 
cassandra-env.sh file. For test purposes I have allowed all incomming TCP/UDP 
traffic to and from all nodes.

Can you kindly look at the logs and outputs and point out what am I doing wrong?

[Monit summary of dime]
root@dime-0:/home/ubuntu# monit summary
Monit 5.18.1 uptime: 6h 27m
 Service Name                     Status                      Type
 node-dime-0.test.com<http://node-dime-0.test.com>             Running          
           System
 snmpd_process                    Running                     Process
 ralf_process                     Running                     Process
 ntp_process                      Running                     Process
 nginx_process                    Running                     Process
 homestead_process                Does not exist              Process
 homestead-prov_process           Running                     Process
 clearwater_queue_manager_pro...  Running                     Process
 etcd_process                     Running                     Process
 clearwater_diags_monitor_pro...  Running                     Process
 clearwater_config_manager_pr...  Running                     Process
 clearwater_cluster_manager_p...  Running                     Process
 ralf_uptime                      Status ok                   Program
 poll_ralf                        Status ok                   Program
 nginx_ping                       Status ok                   Program
 nginx_uptime                     Status ok                   Program
 monit_uptime                     Status ok                   Program
 homestead_uptime                 Wait parent                 Program
 poll_homestead                   Wait parent                 Program
 check_cx_health                  Wait parent                 Program
 poll_homestead-prov              Status failed               Program
 clearwater_queue_manager_uptime  Status ok                   Program
 etcd_uptime                      Status ok                   Program
 poll_etcd_cluster                Status failed               Program
 poll_etcd                        Status ok                   Program
[Dime Local config]

[etcd cluster health dime]
root@dime-0:/home/ubuntu# clearwater-etcdctl cluster-health
member 5cd7042180fbb2a is unhealthy: got unhealthy result from 
http://192.168.0.6:4000
failed to check the health of member 208dd0fbcefb149c on 
http://192.168.0.8:4000: Get http://192.168.0.8:4000/health: dial tcp 
192.168.0.8:4000<http://192.168.0.8:4000>: getsockopt: connection refused
member 208dd0fbcefb149c is unreachable: [http://192.168.0.8:4000] are all 
unreachable
member 2457d2c8a20fc738 is unhealthy: got unhealthy result from 
http://192.168.0.5:4000
member 48fa49be2b2ae2c2 is unhealthy: got unhealthy result from 
http://192.168.0.3:4000
failed to check the health of member a48134f48b185ad9 on 
http://192.168.0.7:4000: Get http://192.168.0.7:4000/health: dial tcp 
192.168.0.7:4000<http://192.168.0.7:4000>: getsockopt: connection refused
member a48134f48b185ad9 is unreachable: [http://192.168.0.7:4000] are all 
unreachable
member e7db4eebbdb94a11 is unhealthy: got unhealthy result from 
http://192.168.0.4:4000
cluster is unhealthy

[local config file dime]
root@dime-0:/home/ubuntu# cat /etc/clearwater/local_config
local_ip=192.168.0.4
public_ip=10.1.10.192
public_hostname=dime-0.test.com<http://dime-0.test.com>
etcd_cluster=192.168.0.3,192.168.0.4,192.168.0.5,192.168.0.6,192.168.0.7,192.168.0.8

[shared config file dime]
root@vellum-0:/home/ubuntu# cat /etc/clearwater/shared_config
# Deployment definitions
home_domain=test.com<http://test.com>
sprout_hostname=sprout.test.com<http://sprout.test.com>
hs_hostname=hs.test.com:8888<http://hs.test.com:8888>
hs_provisioning_hostname=hs-prov.test.com:8889<http://hs-prov.test.com:8889>
dime_hostname=dime.test.com:10888<http://dime.test.com:10888>
xdms_hostname=homer.test.com:7888<http://homer.test.com:7888>
sprout_impi_store=vellum.test.com<http://vellum.test.com>
sprout_registration_store=vellum.test.com<http://vellum.test.com>
cassandra_hostname=vellum.test.com<http://vellum.test.com>
chronos_hostname=vellum.test.com<http://vellum.test.com>
dime_session_store=vellum.test.com<http://vellum.test.com>

upstream_port=0

# Email server configuration
smtp_smarthost=localhost
smtp_username=username
smtp_password=password
[email protected]<mailto:[email protected]>

# Keys
signup_key=secret
turn_workaround=secret
ellis_api_key=secret
ellis_cookie_key=secret

[Error Log Homestead]
Thrift: Sun Aug 27 05:04:01 2017 TSocket::open() error on socket (after 
THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused
Thrift: Sun Aug 27 05:04:02 2017 TSocket::open() error on socket (after 
THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused
Thrift: Sun Aug 27 05:04:44 2017 TSocket::open() error on socket (after 
THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused
Thrift: Sun Aug 27 05:04:44 2017 TSocket::open() error on socket (after 
THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused
Thrift: Sun Aug 27 05:05:08 2017 TSocket::open() error on socket (after 
THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused
Thrift: Sun Aug 27 05:05:08 2017 TSocket::open() error on socket (after 
THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused
Thrift: Sun Aug 27 05:05:30 2017 TSocket::open() error on socket (after 
THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused
Thrift: Sun Aug 27 05:05:30 2017 TSocket::open() error on socket (after 
THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused
Thrift: Sun Aug 27 05:05:38 2017 TSocket::open() error on socket (after 
THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused
Thrift: Sun Aug 27 05:05:38 2017 TSocket::open() error on socket (after 
THRIFT_POLL) <Host: 192.168.0.8 Port: 9160>Connection refused

[Running cqlsh on vellum]
root@vellum-0:/home/ubuntu# cqlsh
Connection error: ('Unable to connect to any servers', {'127.0.0.1': error(111, 
"Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")}

[Monit summary of vellum]
root@vellum-0:/home/ubuntu# monit summary
Monit 5.18.1 uptime: 6h 39m
 Service Name                     Status                      Type
 node-vellum-0.test.com<http://node-vellum-0.test.com>           Running        
             System
 snmpd_process                    Running                     Process
 ntp_process                      Running                     Process
 memcached_process                Running                     Process
 clearwater_queue_manager_pro...  Running                     Process
 etcd_process                     Execution failed | Does...  Process
 clearwater_diags_monitor_pro...  Running                     Process
 clearwater_config_manager_pr...  Running                     Process
 clearwater_cluster_manager_p...  Running                     Process
 cassandra_process                Running                     Process
 chronos_process                  Running                     Process
 astaire_process                  Running                     Process
 monit_uptime                     Status ok                   Program
 memcached_uptime                 Status ok                   Program
 poll_memcached                   Status ok                   Program
 clearwater_queue_manager_uptime  Status ok                   Program
 etcd_uptime                      Wait parent                 Program
 poll_etcd_cluster                Wait parent                 Program
 poll_etcd                        Wait parent                 Program
 cassandra_uptime                 Status ok                   Program
 poll_cassandra                   Status ok                   Program
 poll_cqlsh                       Status ok                   Program
 chronos_uptime                   Status ok                   Program
 poll_chronos                     Status failed               Program
 astaire_uptime                   Status ok                   Program

[local config file vellum]
root@vellum-0:/home/ubuntu# cat /etc/clearwater/local_config
local_ip=192.168.0.8
public_ip=10.1.10.204
public_hostname=vellum-0.test.com<http://vellum-0.test.com>
etcd_cluster=192.168.0.3,192.168.0.4,192.168.0.5,192.168.0.6,192.168.0.7,192.168.0.8


[shared config file vellum]
root@vellum-0:/home/ubuntu# cat /etc/clearwater/shared_config
# Deployment definitions
home_domain=test.com<http://test.com>
sprout_hostname=sprout.test.com<http://sprout.test.com>
hs_hostname=hs.test.com:8888<http://hs.test.com:8888>
hs_provisioning_hostname=hs-prov.test.com:8889<http://hs-prov.test.com:8889>
dime_hostname=dime.test.com:10888<http://dime.test.com:10888>
xdms_hostname=homer.test.com:7888<http://homer.test.com:7888>
sprout_impi_store=vellum.test.com<http://vellum.test.com>
sprout_registration_store=vellum.test.com<http://vellum.test.com>
cassandra_hostname=vellum.test.com<http://vellum.test.com>
chronos_hostname=vellum.test.com<http://vellum.test.com>
dime_session_store=vellum.test.com<http://vellum.test.com>

upstream_port=0

# Email server configuration
smtp_smarthost=localhost
smtp_username=username
smtp_password=password
[email protected]<mailto:[email protected]>

# Keys
signup_key=secret
turn_workaround=secret
ellis_api_key=secret
ellis_cookie_key=secret

[netstat on vellum]
root@vellum-0:/home/ubuntu# netstat -tulnap
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       
PID/Program name
tcp        0      0 192.168.0.8:11211<http://192.168.0.8:11211>       0.0.0.0:* 
              LISTEN      27005/memcached
tcp        0      0 127.0.0.1:7253<http://127.0.0.1:7253>          0.0.0.0:*    
           LISTEN      27610/chronos
tcp        0      0 255.255.255.255:7253<http://255.255.255.255:7253>    
0.0.0.0:*               LISTEN      27610/chronos
tcp        0      0 127.0.0.1:53<http://127.0.0.1:53>            0.0.0.0:*      
         LISTEN      7718/dnsmasq
tcp        0      0 0.0.0.0:22<http://0.0.0.0:22>              0.0.0.0:*        
       LISTEN      1189/sshd
tcp        0      0 127.0.0.1:2812<http://127.0.0.1:2812>          0.0.0.0:*    
           LISTEN      7833/monit
tcp        0      0 192.168.0.8:44035<http://192.168.0.8:44035>       
192.168.0.8:11211<http://192.168.0.8:11211>       TIME_WAIT   -
tcp        0      0 127.0.0.1:54026<http://127.0.0.1:54026>         
127.0.0.1:7253<http://127.0.0.1:7253>          TIME_WAIT   -
tcp        0      0 192.168.0.8:44064<http://192.168.0.8:44064>       
192.168.0.8:11211<http://192.168.0.8:11211>       TIME_WAIT   -
tcp        0      0 192.168.0.8:44081<http://192.168.0.8:44081>       
192.168.0.8:11211<http://192.168.0.8:11211>       TIME_WAIT   -
tcp        0      0 192.168.0.8:44053<http://192.168.0.8:44053>       
192.168.0.8:11211<http://192.168.0.8:11211>       TIME_WAIT   -
tcp        0      0 192.168.0.8:44054<http://192.168.0.8:44054>       
192.168.0.8:11211<http://192.168.0.8:11211>       TIME_WAIT   -
tcp        0      0 192.168.0.8:44048<http://192.168.0.8:44048>       
192.168.0.8:11211<http://192.168.0.8:11211>       TIME_WAIT   -
tcp        0      0 192.168.0.8:22<http://192.168.0.8:22>          
10.1.10.112:51998<http://10.1.10.112:51998>       ESTABLISHED 26454/sshd: ubuntu
tcp        0    268 192.168.0.8:22<http://192.168.0.8:22>          
10.1.10.112:51933<http://10.1.10.112:51933>       ESTABLISHED 22779/sshd: ubuntu
tcp6       0      0 :::11311                :::*                    LISTEN      
26657/astaire
tcp6       0      0 ::1:53                  :::*                    LISTEN      
7718/dnsmasq
tcp6       0      0 :::22                   :::*                    LISTEN      
1189/sshd
udp        0      0 127.0.0.1:53<http://127.0.0.1:53>            0.0.0.0:*      
                     7718/dnsmasq
udp        0      0 0.0.0.0:68<http://0.0.0.0:68>              0.0.0.0:*        
                   601/dhclient
udp        0      0 192.168.0.8:123<http://192.168.0.8:123>         0.0.0.0:*   
                        7362/ntpd
udp        0      0 127.0.0.1:123<http://127.0.0.1:123>           0.0.0.0:*     
                      7362/ntpd
udp        0      0 0.0.0.0:123<http://0.0.0.0:123>             0.0.0.0:*       
                    7362/ntpd
udp        0      0 0.0.0.0:161<http://0.0.0.0:161>             0.0.0.0:*       
                    7472/snmpd
udp        0      0 0.0.0.0:55423<http://0.0.0.0:55423>           0.0.0.0:*     
                      601/dhclient
udp6       0      0 :::23767                :::*                                
601/dhclient
udp6       0      0 ::1:53                  :::*                                
7718/dnsmasq
udp6       0      0 ::1:123                 :::*                                
7362/ntpd
udp6       0      0 fe80::f816:3eff:fe3:123 :::*                                
7362/ntpd
udp6       0      0 :::123                  :::*                                
7362/ntpd
udp6       0      0 :::161                  :::*                                
7472/snmpd

[Ping results]
root@vellum-0:/home/ubuntu# ping hs-prov.test.com<http://hs-prov.test.com>
PING hs-prov.test.com<http://hs-prov.test.com> (192.168.0.4) 56(84) bytes of 
data.
64 bytes from 192.168.0.4<http://192.168.0.4>: icmp_seq=1 ttl=64 time=10.3 ms
64 bytes from 192.168.0.4<http://192.168.0.4>: icmp_seq=2 ttl=64 time=12.1 ms
64 bytes from 192.168.0.4<http://192.168.0.4>: icmp_seq=3 ttl=64 time=9.06 ms
^C
--- hs-prov.test.com<http://hs-prov.test.com> ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2032ms
rtt min/avg/max/mdev = 9.063/10.529/12.151/1.270 ms
root@vellum-0:/home/ubuntu#
root@vellum-0:/home/ubuntu# ping hs.test.com<http://hs.test.com>
PING hs.test.com<http://hs.test.com> (192.168.0.4) 56(84) bytes of data.
64 bytes from 192.168.0.4<http://192.168.0.4>: icmp_seq=1 ttl=64 time=21.3 ms
64 bytes from 192.168.0.4<http://192.168.0.4>: icmp_seq=2 ttl=64 time=4.58 ms
64 bytes from 192.168.0.4<http://192.168.0.4>: icmp_seq=3 ttl=64 time=20.6 ms
^C
--- hs.test.com<http://hs.test.com> ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2070ms
rtt min/avg/max/mdev = 4.584/15.523/21.363/7.741 ms
root@vellum-0:/home/ubuntu# ping dime-0.test.com<http://dime-0.test.com>
PING dime-0.test.com<http://dime-0.test.com> (10.1.10.192) 56(84) bytes of data.
64 bytes from 10.1.10.192<http://10.1.10.192>: icmp_seq=1 ttl=63 time=25.9 ms
64 bytes from 10.1.10.192<http://10.1.10.192>: icmp_seq=2 ttl=63 time=58.9 ms
^C
--- dime-0.test.com<http://dime-0.test.com> ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1029ms
rtt min/avg/max/mdev = 25.954/42.459/58.964/16.505 ms

root@bono-0:/home/ubuntu# ping vellum-0.test.com<http://vellum-0.test.com>
PING vellum-0.test.com<http://vellum-0.test.com> (10.1.10.204) 56(84) bytes of 
data.
64 bytes from 10.1.10.204<http://10.1.10.204>: icmp_seq=1 ttl=63 time=36.7 ms
64 bytes from 10.1.10.204<http://10.1.10.204>: icmp_seq=2 ttl=63 time=25.2 ms
^C
--- vellum-0.test.com<http://vellum-0.test.com> ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 25.240/30.985/36.730/5.745 ms
root@bono-0:/home/ubuntu# ping vellum.test.com<http://vellum.test.com>
PING vellum.test.com<http://vellum.test.com> (192.168.0.8) 56(84) bytes of data.
64 bytes from 192.168.0.8<http://192.168.0.8>: icmp_seq=1 ttl=64 time=51.4 ms
64 bytes from 192.168.0.8<http://192.168.0.8>: icmp_seq=2 ttl=64 time=3.93 ms
64 bytes from 192.168.0.8<http://192.168.0.8>: icmp_seq=3 ttl=64 time=4.22 ms

Regards,

Abdul Basit Alvi

_______________________________________________
Clearwater mailing list
[email protected]
http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org

Re: [Project Clearwater] Cassandra and etcd clustering problem

Reply via email to