Single faulty node brings down entire cluster. No reads/writes possible
-----------------------------------------------------------------------
Key: CASSANDRA-3226
URL: https://issues.apache.org/jira/browse/CASSANDRA-3226
Project: Cassandra
Issue Type: Bug
Components: Core
Affects Versions: 0.8.5
Environment: linux
Reporter: Thibaut
No client is able to read anything from the entire cluster anymore. This
occured a few times so far, but I can't reproduce it.
Looks like an OOM directly after starting up the node? Restarting the node
"solves" the issue. I also have to kill the node with -9 because normal kill
won't kill the node.
Healty nodes:
*.13:
Mode: Normal
Not sending any streams.
Not receiving any streams.
Pool Name Active Pending Completed
Commands n/a 15237 416868
Responses n/a 0 126721
*.14.
Mode: Normal
Not sending any streams.
Not receiving any streams.
Pool Name Active Pending Completed
Commands n/a 15387 437325
Responses n/a 0 131066
*.15:
Mode: Normal
Not sending any streams.
Not receiving any streams.
Pool Name Active Pending Completed
Commands n/a 15530 368771
Responses n/a 0 145168
etc... The pending commands at the healty nodes all increase.
Faulty node before restart:
/software/cassandra/bin/nodetool -h localhost info
Token : f33
Gossip active : true
Load : 130.67 GB
Generation No : 1316197687
Uptime (seconds) : 137051
Heap Memory (MB) : 3580.22 / 3614.00
Data Center : datacenter1
Rack : rack1
Exceptions : 108
/software/cassandra/bin/nodetool -h localhost netstats
Mode: Normal
Not sending any streams.
Not receiving any streams.
Pool Name Active Pending Completed
Commands n/a 0 29696566
Responses n/a 560 26650981
Log excerpt:
INFO [GossipStage:3] 2011-09-18 09:16:46,254 Gossiper.java (line 713) Node
/192.168.0.11 has restarted, now UP again
INFO [GossipStage:3] 2011-09-18 09:16:46,255 Gossiper.java (line 681)
InetAddress /192.168.0.11 is now UP
INFO [GossipStage:3] 2011-09-18 09:16:46,255 StorageService.java (line 815)
Node /192.168.0.11 state jump to normal
INFO [GossipStage:3] 2011-09-18 09:16:46,257 StorageService.java (line 815)
Node /192.168.0.11 state jump to normal
INFO [GossipStage:3] 2011-09-18 09:16:54,984 StorageService.java (line 815)
Node /192.168.0.6 state jump to normal
INFO [GossipStage:3] 2011-09-18 09:16:54,984 Gossiper.java (line 681)
InetAddress /192.168.0.6 is now UP
INFO [GossipStage:3] 2011-09-18 09:16:56,262 StorageService.java (line 815)
Node /192.168.0.18 state jump to normal
INFO [GossipStage:3] 2011-09-18 09:16:56,263 Gossiper.java (line 681)
InetAddress /192.168.0.18 is now UP
INFO [GossipStage:3] 2011-09-18 09:17:06,272 Gossiper.java (line 713) Node
/192.168.0.1 has restarted, now UP again
INFO [GossipStage:3] 2011-09-18 09:17:06,272 Gossiper.java (line 681)
InetAddress /192.168.0.1 is now UP
INFO [GossipStage:3] 2011-09-18 09:17:06,272 StorageService.java (line 815)
Node /192.168.0.1 state jump to normal
INFO [HintedHandoff:1] 2011-09-18 09:20:49,846 HintedHandOffManager.java (line
323) Started hinted handoff for endpoint /192.168.0.8
INFO [HintedHandoff:1] 2011-09-18 09:20:49,847 HintedHandOffManager.java (line
379) Finished hinted handoff of 0 rows to endpoint /192.168.0.8
INFO [HintedHandoff:1] 2011-09-18 09:21:45,430 HintedHandOffManager.java (line
323) Started hinted handoff for endpoint /192.168.0.7
INFO [HintedHandoff:1] 2011-09-18 09:21:45,696 HintedHandOffManager.java (line
379) Finished hinted handoff of 0 rows to endpoint /192.168.0.7
INFO [HintedHandoff:1] 2011-09-18 09:21:52,432 HintedHandOffManager.java (line
323) Started hinted handoff for endpoint /192.168.0.20
INFO [HintedHandoff:1] 2011-09-18 09:21:52,432 HintedHandOffManager.java (line
379) Finished hinted handoff of 0 rows to endpoint /192.168.0.20
INFO [HintedHandoff:1] 2011-09-18 09:22:12,469 HintedHandOffManager.java (line
323) Started hinted handoff for endpoint /192.168.0.9
INFO [HintedHandoff:1] 2011-09-18 09:22:12,469 HintedHandOffManager.java (line
379) Finished hinted handoff of 0 rows to endpoint /192.168.0.9
INFO [HintedHandoff:1] 2011-09-18 09:23:05,202 HintedHandOffManager.java (line
323) Started hinted handoff for endpoint /192.168.0.3
INFO [HintedHandoff:1] 2011-09-18 09:23:05,203 HintedHandOffManager.java (line
379) Finished hinted handoff of 0 rows to endpoint /192.168.0.3
INFO [HintedHandoff:1] 2011-09-18 09:23:08,611 HintedHandOffManager.java (line
323) Started hinted handoff for endpoint /192.168.0.17
INFO [HintedHandoff:1] 2011-09-18 09:23:08,612 HintedHandOffManager.java (line
379) Finished hinted handoff of 0 rows to endpoint /192.168.0.17
INFO [HintedHandoff:1] 2011-09-18 09:23:22,687 HintedHandOffManager.java (line
323) Started hinted handoff for endpoint /192.168.0.11
INFO [HintedHandoff:1] 2011-09-18 09:23:22,688 HintedHandOffManager.java (line
379) Finished hinted handoff of 0 rows to endpoint /192.168.0.11
INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line
323) Started hinted handoff for endpoint /192.168.0.6
INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line
379) Finished hinted handoff of 0 rows to endpoint /192.168.0.6
ERROR [Thread-549] 2011-09-18 09:24:26,806 AbstractCassandraDaemon.java (line
139) Fatal exception in thread Thread[Thread-549,5,main]
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut
down
at
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
at
org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
INFO [HintedHandoff:1] 2011-09-18 09:23:05,202 HintedHandOffManager.java (line
323) Started hinted handoff for endpoint /192.168.0.3
INFO [HintedHandoff:1] 2011-09-18 09:23:05,203 HintedHandOffManager.java (line
379) Finished hinted handoff of 0 rows to endpoint /192.168.0.3
INFO [HintedHandoff:1] 2011-09-18 09:23:08,611 HintedHandOffManager.java (line
323) Started hinted handoff for endpoint /192.168.0.17
INFO [HintedHandoff:1] 2011-09-18 09:23:08,612 HintedHandOffManager.java (line
379) Finished hinted handoff of 0 rows to endpoint /192.168.0.17
INFO [HintedHandoff:1] 2011-09-18 09:23:22,687 HintedHandOffManager.java (line
323) Started hinted handoff for endpoint /192.168.0.11
INFO [HintedHandoff:1] 2011-09-18 09:23:22,688 HintedHandOffManager.java (line
379) Finished hinted handoff of 0 rows to endpoint /192.168.0.11
INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line
323) Started hinted handoff for endpoint /192.168.0.6
INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line
379) Finished hinted handoff of 0 rows to endpoint /192.168.0.6
ERROR [Thread-549] 2011-09-18 09:24:26,806 AbstractCassandraDaemon.java (line
139) Fatal exception in thread Thread[Thread-549,5,main]
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut
down
at
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
at
org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
INFO [HintedHandoff:1] 2011-09-18 09:24:34,488 HintedHandOffManager.java (line
323) Started hinted handoff for endpoint /192.168.0.18
INFO [HintedHandoff:1] 2011-09-18 09:24:34,488 HintedHandOffManager.java (line
379) Finished hinted handoff of 0 rows to endpoint /192.168.0.18
ERROR [Thread-554] 2011-09-18 09:24:43,853 AbstractCassandraDaemon.java (line
139) Fatal exception in thread Thread[Thread-554,5,main]
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut
down
at
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
at
org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
ERROR [Thread-542] 2011-09-18 09:24:43,924 AbstractCassandraDaemon.java (line
139) Fatal exception in thread Thread[Thread-542,5,main]
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut
down
at
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
at
org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
INFO [HintedHandoff:1] 2011-09-18 09:25:03,601 HintedHandOffManager.java (line
323) Started hinted handoff for endpoint /192.168.0.1
INFO [HintedHandoff:1] 2011-09-18 09:25:03,601 HintedHandOffManager.java (line
379) Finished hinted handoff of 0 rows to endpoint /192.168.0.1
ERROR [Thread-535] 2011-09-18 09:25:03,995 AbstractCassandraDaemon.java (line
139) Fatal exception in thread Thread[Thread-535,5,main]
If I restart the node (kill -9), everything works fine again. No OOM!.
/software/scripts# /software/cassandra/bin/nodetool -h localhost netstats
Mode: Normal
Not sending any streams.
Not receiving any streams.
Pool Name Active Pending Completed
Commands n/a 0 131351
Responses n/a 0 190697
/software/scripts# /software/cassandra/bin/nodetool -h localhost info
Token : f33
Gossip active : true
Load : 88.54 GB
Generation No : 1316334645
Uptime (seconds) : 494
Heap Memory (MB) : 1582.95 / 3614.00
Data Center : datacenter1
Rack : rack1
Exceptions : 0
/software/scripts# /software/cassandra/bin/nodetool -h localhost ring
Address DC Rack Status State Load Owns
Token
ffffffffffffffff
192.168.0.1 datacenter1 rack1 Up Normal 85.2 GB 0.02%
0cc
192.168.0.2 datacenter1 rack1 Up Normal 86.94 GB 0.02%
199
192.168.0.3 datacenter1 rack1 Up Normal 85.24 GB 0.02%
266
192.168.0.4 datacenter1 rack1 Up Normal 86.38 GB 0.02%
333
192.168.0.5 datacenter1 rack1 Up Normal 86.96 GB 0.02%
400
192.168.0.6 datacenter1 rack1 Up Normal 86.17 GB 0.02%
4cc
192.168.0.7 datacenter1 rack1 Up Normal 83.88 GB 0.02%
599
192.168.0.8 datacenter1 rack1 Up Normal 84.42 GB 0.02%
666
192.168.0.9 datacenter1 rack1 Up Normal 85.06 GB 0.02%
733
192.168.0.10 datacenter1 rack1 Up Normal 83.08 GB 0.02%
7ff
192.168.0.11 datacenter1 rack1 Up Normal 86.22 GB 0.02%
8cc
192.168.0.12 datacenter1 rack1 Up Normal 85.94 GB 0.02%
999
192.168.0.13 datacenter1 rack1 Up Normal 85.01 GB 0.02%
a66
192.168.0.14 datacenter1 rack1 Up Normal 86.5 GB 0.02%
b33
192.168.0.15 datacenter1 rack1 Up Normal 83.33 GB 0.02%
c00
192.168.0.16 datacenter1 rack1 Up Normal 84.41 GB 0.02%
ccc
192.168.0.17 datacenter1 rack1 Up Normal 86.97 GB 28.51%
d99
192.168.0.18 datacenter1 rack1 Up Normal 112.63 GB 41.88%
e66
192.168.0.19 datacenter1 rack1 Up Normal 88.56 GB 29.27%
f33
192.168.0.20 datacenter1 rack1 Up Normal 85.83 GB 0.02%
ffffffffffffffff
Interestingly, after restart, the node load (from nodetool info) is reduced).
Any ideas? The node doesn't seem to have any hardware memory issues.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira