[
https://issues.apache.org/jira/browse/CASSANDRA-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107824#comment-13107824
]
Thibaut commented on CASSANDRA-3226:
------------------------------------
Do I have to enable read repair for the failover to work? Aren't host scores
being updated if a command/request doesn't get a valid response?
I checked the mbean of the dynamic snitch and all hosts have sound values (now
where everything is working). Stopping one node also doesn't trigger this
problem, and requests are being routed to different hosts. The problem only
occurs e.g. if the node is in above state, and does seem to sent back invalid
responses? (The handled requests counter increased very fast).
> Single faulty node brings down entire cluster. No reads/writes possible
> -----------------------------------------------------------------------
>
> Key: CASSANDRA-3226
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3226
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.8.5
> Environment: linux
> Reporter: Thibaut
> Attachments: jstack
>
>
> No client is able to read anything from the entire cluster anymore. This
> occured a few times so far, but I can't reproduce it.
> Looks like an OOM directly after starting up the node? Restarting the node
> "solves" the issue. I also have to kill the node with -9 because normal kill
> won't kill the node.
> Healty nodes:
> *.13:
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name Active Pending Completed
> Commands n/a 15237 416868
> Responses n/a 0 126721
> *.14.
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name Active Pending Completed
> Commands n/a 15387 437325
> Responses n/a 0 131066
> *.15:
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name Active Pending Completed
> Commands n/a 15530 368771
> Responses n/a 0 145168
> etc... The pending commands at the healty nodes all increase.
> Faulty node before restart:
> /software/cassandra/bin/nodetool -h localhost info
> Token : f33
> Gossip active : true
> Load : 130.67 GB
> Generation No : 1316197687
> Uptime (seconds) : 137051
> Heap Memory (MB) : 3580.22 / 3614.00
> Data Center : datacenter1
> Rack : rack1
> Exceptions : 108
> /software/cassandra/bin/nodetool -h localhost netstats
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name Active Pending Completed
> Commands n/a 0 29696566
> Responses n/a 560 26650981
> Log excerpt:
> INFO [GossipStage:3] 2011-09-18 09:16:46,254 Gossiper.java (line 713) Node
> /192.168.0.11 has restarted, now UP again
> INFO [GossipStage:3] 2011-09-18 09:16:46,255 Gossiper.java (line 681)
> InetAddress /192.168.0.11 is now UP
> INFO [GossipStage:3] 2011-09-18 09:16:46,255 StorageService.java (line 815)
> Node /192.168.0.11 state jump to normal
> INFO [GossipStage:3] 2011-09-18 09:16:46,257 StorageService.java (line 815)
> Node /192.168.0.11 state jump to normal
> INFO [GossipStage:3] 2011-09-18 09:16:54,984 StorageService.java (line 815)
> Node /192.168.0.6 state jump to normal
> INFO [GossipStage:3] 2011-09-18 09:16:54,984 Gossiper.java (line 681)
> InetAddress /192.168.0.6 is now UP
> INFO [GossipStage:3] 2011-09-18 09:16:56,262 StorageService.java (line 815)
> Node /192.168.0.18 state jump to normal
> INFO [GossipStage:3] 2011-09-18 09:16:56,263 Gossiper.java (line 681)
> InetAddress /192.168.0.18 is now UP
> INFO [GossipStage:3] 2011-09-18 09:17:06,272 Gossiper.java (line 713) Node
> /192.168.0.1 has restarted, now UP again
> INFO [GossipStage:3] 2011-09-18 09:17:06,272 Gossiper.java (line 681)
> InetAddress /192.168.0.1 is now UP
> INFO [GossipStage:3] 2011-09-18 09:17:06,272 StorageService.java (line 815)
> Node /192.168.0.1 state jump to normal
> INFO [HintedHandoff:1] 2011-09-18 09:20:49,846 HintedHandOffManager.java
> (line 323) Started hinted handoff for endpoint /192.168.0.8
> INFO [HintedHandoff:1] 2011-09-18 09:20:49,847 HintedHandOffManager.java
> (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.8
> INFO [HintedHandoff:1] 2011-09-18 09:21:45,430 HintedHandOffManager.java
> (line 323) Started hinted handoff for endpoint /192.168.0.7
> INFO [HintedHandoff:1] 2011-09-18 09:21:45,696 HintedHandOffManager.java
> (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.7
> INFO [HintedHandoff:1] 2011-09-18 09:21:52,432 HintedHandOffManager.java
> (line 323) Started hinted handoff for endpoint /192.168.0.20
> INFO [HintedHandoff:1] 2011-09-18 09:21:52,432 HintedHandOffManager.java
> (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.20
> INFO [HintedHandoff:1] 2011-09-18 09:22:12,469 HintedHandOffManager.java
> (line 323) Started hinted handoff for endpoint /192.168.0.9
> INFO [HintedHandoff:1] 2011-09-18 09:22:12,469 HintedHandOffManager.java
> (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.9
> INFO [HintedHandoff:1] 2011-09-18 09:23:05,202 HintedHandOffManager.java
> (line 323) Started hinted handoff for endpoint /192.168.0.3
> INFO [HintedHandoff:1] 2011-09-18 09:23:05,203 HintedHandOffManager.java
> (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.3
> INFO [HintedHandoff:1] 2011-09-18 09:23:08,611 HintedHandOffManager.java
> (line 323) Started hinted handoff for endpoint /192.168.0.17
> INFO [HintedHandoff:1] 2011-09-18 09:23:08,612 HintedHandOffManager.java
> (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.17
> INFO [HintedHandoff:1] 2011-09-18 09:23:22,687 HintedHandOffManager.java
> (line 323) Started hinted handoff for endpoint /192.168.0.11
> INFO [HintedHandoff:1] 2011-09-18 09:23:22,688 HintedHandOffManager.java
> (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.11
> INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java
> (line 323) Started hinted handoff for endpoint /192.168.0.6
> INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java
> (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.6
> ERROR [Thread-549] 2011-09-18 09:24:26,806 AbstractCassandraDaemon.java (line
> 139) Fatal exception in thread Thread[Thread-549,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut
> down
> at
> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
> at
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
> at
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
> at
> org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
> at
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
> INFO [HintedHandoff:1] 2011-09-18 09:23:05,202 HintedHandOffManager.java
> (line 323) Started hinted handoff for endpoint /192.168.0.3
> INFO [HintedHandoff:1] 2011-09-18 09:23:05,203 HintedHandOffManager.java
> (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.3
> INFO [HintedHandoff:1] 2011-09-18 09:23:08,611 HintedHandOffManager.java
> (line 323) Started hinted handoff for endpoint /192.168.0.17
> INFO [HintedHandoff:1] 2011-09-18 09:23:08,612 HintedHandOffManager.java
> (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.17
> INFO [HintedHandoff:1] 2011-09-18 09:23:22,687 HintedHandOffManager.java
> (line 323) Started hinted handoff for endpoint /192.168.0.11
> INFO [HintedHandoff:1] 2011-09-18 09:23:22,688 HintedHandOffManager.java
> (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.11
> INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java
> (line 323) Started hinted handoff for endpoint /192.168.0.6
> INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java
> (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.6
> ERROR [Thread-549] 2011-09-18 09:24:26,806 AbstractCassandraDaemon.java (line
> 139) Fatal exception in thread Thread[Thread-549,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut
> down
> at
> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
> at
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
> at
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
> at
> org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
> at
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
> INFO [HintedHandoff:1] 2011-09-18 09:24:34,488 HintedHandOffManager.java
> (line 323) Started hinted handoff for endpoint /192.168.0.18
> INFO [HintedHandoff:1] 2011-09-18 09:24:34,488 HintedHandOffManager.java
> (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.18
> ERROR [Thread-554] 2011-09-18 09:24:43,853 AbstractCassandraDaemon.java (line
> 139) Fatal exception in thread Thread[Thread-554,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut
> down
> at
> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
> at
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
> at
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
> at
> org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
> at
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
> ERROR [Thread-542] 2011-09-18 09:24:43,924 AbstractCassandraDaemon.java (line
> 139) Fatal exception in thread Thread[Thread-542,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut
> down
> at
> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
> at
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
> at
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
> at
> org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
> at
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
> INFO [HintedHandoff:1] 2011-09-18 09:25:03,601 HintedHandOffManager.java
> (line 323) Started hinted handoff for endpoint /192.168.0.1
> INFO [HintedHandoff:1] 2011-09-18 09:25:03,601 HintedHandOffManager.java
> (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.1
> ERROR [Thread-535] 2011-09-18 09:25:03,995 AbstractCassandraDaemon.java (line
> 139) Fatal exception in thread Thread[Thread-535,5,main]
> If I restart the node (kill -9), everything works fine again. No OOM!.
> /software/scripts# /software/cassandra/bin/nodetool -h localhost netstats
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name Active Pending Completed
> Commands n/a 0 131351
> Responses n/a 0 190697
> /software/scripts# /software/cassandra/bin/nodetool -h localhost info
> Token : f33
> Gossip active : true
> Load : 88.54 GB
> Generation No : 1316334645
> Uptime (seconds) : 494
> Heap Memory (MB) : 1582.95 / 3614.00
> Data Center : datacenter1
> Rack : rack1
> Exceptions : 0
> /software/scripts# /software/cassandra/bin/nodetool -h localhost ring
> Address DC Rack Status State Load Owns
> Token
>
> ffffffffffffffff
> 192.168.0.1 datacenter1 rack1 Up Normal 85.2 GB 0.02%
> 0cc
> 192.168.0.2 datacenter1 rack1 Up Normal 86.94 GB 0.02%
> 199
> 192.168.0.3 datacenter1 rack1 Up Normal 85.24 GB 0.02%
> 266
> 192.168.0.4 datacenter1 rack1 Up Normal 86.38 GB 0.02%
> 333
> 192.168.0.5 datacenter1 rack1 Up Normal 86.96 GB 0.02%
> 400
> 192.168.0.6 datacenter1 rack1 Up Normal 86.17 GB 0.02%
> 4cc
> 192.168.0.7 datacenter1 rack1 Up Normal 83.88 GB 0.02%
> 599
> 192.168.0.8 datacenter1 rack1 Up Normal 84.42 GB 0.02%
> 666
> 192.168.0.9 datacenter1 rack1 Up Normal 85.06 GB 0.02%
> 733
> 192.168.0.10 datacenter1 rack1 Up Normal 83.08 GB 0.02%
> 7ff
> 192.168.0.11 datacenter1 rack1 Up Normal 86.22 GB 0.02%
> 8cc
> 192.168.0.12 datacenter1 rack1 Up Normal 85.94 GB 0.02%
> 999
> 192.168.0.13 datacenter1 rack1 Up Normal 85.01 GB 0.02%
> a66
> 192.168.0.14 datacenter1 rack1 Up Normal 86.5 GB 0.02%
> b33
> 192.168.0.15 datacenter1 rack1 Up Normal 83.33 GB 0.02%
> c00
> 192.168.0.16 datacenter1 rack1 Up Normal 84.41 GB 0.02%
> ccc
> 192.168.0.17 datacenter1 rack1 Up Normal 86.97 GB 28.51%
> d99
> 192.168.0.18 datacenter1 rack1 Up Normal 112.63 GB 41.88%
> e66
> 192.168.0.19 datacenter1 rack1 Up Normal 88.56 GB 29.27%
> f33
> 192.168.0.20 datacenter1 rack1 Up Normal 85.83 GB 0.02%
> ffffffffffffffff
> Interestingly, after restart, the node load (from nodetool info) is reduced).
> Any ideas? The node doesn't seem to have any hardware memory issues.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira