I'm running into a quirky issue with Brisk 1.0 Beta 2 (w/ Cassandra 0.8.1).

I think the last node in our cluster is having problems (10.201.x.x).
OpsCenter and nodetool ring (run from that node) show the node as down, but
the rest of the cluster sees it as up.

If I run nodetool ring from one of the first 11 nodes, I get this output...
everything is up:

ubuntu@ip-10-85-x-x:~/brisk/resources/cassandra$ bin/nodetool -h localhost
ring
Address         DC          Rack        Status State   Load
Owns    Token

148873535527910577765226390751398592512
10.2.x.x        DC1         RAC1        Up     Normal  901.57 GB
12.50%  0
10.116.x.x    DC2         RAC1        Up     Normal  258.22 GB       6.25%
10633823966279326983230456482242756608
10.110.x.x        DC1         RAC1        Up     Normal  129.07 GB
6.25%   21267647932558653966460912964485513216
10.2.x.x          DC1         RAC1        Up     Normal  128.5 GB
12.50%  42535295865117307932921825928971026432
10.114.x.x       DC2         RAC1        Up     Normal  257.31 GB
6.25%   53169119831396634916152282411213783040
10.210.x.x       DC1         RAC1        Up     Normal  128.66 GB
6.25%   63802943797675961899382738893456539648
10.207.x.x       DC1         RAC2        Up     Normal  643.12 GB
12.50%  85070591730234615865843651857942052864
10.85.x.x        DC2         RAC1        Up     Normal  256.76 GB
6.25%   95704415696513942849074108340184809472
10.2.x.x        DC1         RAC2        Up     Normal  128.95 GB
6.25%   106338239662793269832304564822427566080
10.96.x.x    DC1         RAC2        Up     Normal  128.29 GB       12.50%
127605887595351923798765477786913079296
10.194.x.x    DC2         RAC1        Up     Normal  257.14 GB       6.25%
138239711561631250781995934269155835904
10.201.x.x    DC1         RAC2        Up     Normal  129.45 GB       6.25%
148873535527910577765226390751398592512

However, OpsCenter shows the last node (10.201.x.x) as unresponsive:
http://blueplastic.com/accenture/unresponsive.PNG

And if I try to run nodetool ring from the 10.201.x.x node, I get connection
errors like this:

ubuntu@ip-10-194-x-x:~/brisk/resources/cassandra$ bin/nodetool -h localhost
ring
Error connection to remote JMX agent!
java.io.IOException: Failed to retrieve RMIServer stub:
javax.naming.CommunicationException [Root exception is
java.rmi.ConnectIOException: error during JRMP connection establishment;
nested exception is:
        java.net.SocketTimeoutException: Read timed out]
        at
javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:338)
        at
javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248)
        at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:141)
        at org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:111)
        at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:559)
Caused by: javax.naming.CommunicationException [Root exception is
java.rmi.ConnectIOException: error during JRMP connection establishment;
nested exception is:
        java.net.SocketTimeoutException: Read timed out]
        at
com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:101)


tpstats command also didn't work:

ubuntu@ip-10-194-x-x:~/brisk/resources/cassandra$ bin/nodetool -h localhost
tpstats
Error connection to remote JMX agent!
java.io.IOException: Failed to retrieve RMIServer stub:
javax.naming.CommunicationException [Root exception is java.rmi.Co
nnectIOException: error during JRMP connection establishment; nested
exception is:
        java.net.SocketTimeoutException: Read timed out]
        at
javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:338)


Looking at what's listening on 7199 on the node shows a bunch of results:

ubuntu@ip-10-194-x-x:~/brisk/resources/cassandra$ sudo netstat -anp | grep
7199
tcp        0      0 0.0.0.0:7199            0.0.0.0:*
LISTEN      1459/java
tcp        8      0 10.194.x.x:7199     10.2.x.x:40135      CLOSE_WAIT  -
tcp        8      0 127.0.0.1:7199          127.0.0.1:49835
CLOSE_WAIT  -
tcp        8      0 127.0.0.1:7199          127.0.0.1:55087
CLOSE_WAIT  -
tcp        8      0 127.0.0.1:7199          127.0.0.1:49837
CLOSE_WAIT  -
tcp        8      0 127.0.0.1:7199          127.0.0.1:55647
CLOSE_WAIT  -
tcp        8      0 127.0.0.1:7199          127.0.0.1:49833
CLOSE_WAIT  -
tcp        8      0 127.0.0.1:7199          127.0.0.1:52935
CLOSE_WAIT  -
tcp        8      0 127.0.0.1:7199          127.0.0.1:52940
CLOSE_WAIT  -
tcp        8      0 10.194.x.x:7199     10.2.x.x:40141      CLOSE_WAIT  -
tcp        8      0 127.0.0.1:7199          127.0.0.1:52936
CLOSE_WAIT  -
tcp        8      0 127.0.0.1:7199          127.0.0.1:55646
CLOSE_WAIT  -
tcp        8      0 127.0.0.1:7199          127.0.0.1:39098
CLOSE_WAIT  -
tcp        8      0 127.0.0.1:7199          127.0.0.1:39095
CLOSE_WAIT  -
tcp        8      0 127.0.0.1:7199          127.0.0.1:55086
CLOSE_WAIT  -
tcp        8      0 127.0.0.1:7199          127.0.0.1:50575
CLOSE_WAIT  -
[list truncated, there are about 20 more lines]


The /var/log/cassandra dir shows that none of the system.log files were
touched in the last two days. The system.log.1 file's tail shows:

FATAL [TASK-TRACKER-INIT] 2011-07-16 07:48:19,120 Configuration.java (line
1256) error parsing conf file: java.io.FileNotFoundException:
/home/ubuntu/brisk/resources/hadoop/conf/core-site.xml (Too many open files)
FATAL [TASK-TRACKER-INIT] 2011-07-16 07:48:19,121 Configuration.java (line
1256) error parsing conf file: java.io.FileNotFoundException:
/home/ubuntu/brisk/resources/hadoop/conf/core-site.xml (Too many open files)
FATAL [TASK-TRACKER-INIT] 2011-07-16 07:48:19,123 Configuration.java (line
1256) error parsing conf file: java.io.FileNotFoundException:
/home/ubuntu/brisk/resources/hadoop/conf/core-site.xml (Too many open files)
FATAL [TASK-TRACKER-INIT] 2011-07-16 07:48:19,124 Configuration.java (line
1256) error parsing conf file: java.io.FileNotFoundException:
/home/ubuntu/brisk/resources/hadoop/conf/core-site.xml (Too many open files)

Also, this node might be going through a memory leak or spinning tread.
Check out the top output from it (specifically the CPU & MEM):
http://blueplastic.com/accenture/top.PNG


Anything else I can do to troubleshoot this? Is this a known issue that I
can just ignore and reboot the node?

- Sameer

Reply via email to