Sebastian Estevez created CASSANDRA-9279:
--------------------------------------------
Summary: Gossip locks up on startup
Key: CASSANDRA-9279
URL: https://issues.apache.org/jira/browse/CASSANDRA-9279
Project: Cassandra
Issue Type: Bug
Reporter: Sebastian Estevez
Attachments: Screen Shot 2015-04-30 at 4.41.57 PM.png
Cluster running 2.0.14.352 on EC2 - c3.4xl's
When starting up the node we noticed it was gray in OpsCenter. Other monitoring
tool showed it as up.
Turned out gossip tasks were piling up and we could see the following in the
system.log:
{code}
WARN [GossipTasks:1] 2015-04-30 20:22:29,512 Gossiper.java (line 671) Gossip
stage has 4270 pending tasks; skipping status check (no nodes will be marked
down)
WARN [GossipTasks:1] 2015-04-30 20:22:30,612 Gossiper.java (line 671) Gossip
stage has 4272 pending tasks; skipping status check (no nodes will be marked
down)
WARN [GossipTasks:1] 2015-04-30 20:22:31,713 Gossiper.java (line 671) Gossip
stage has 4273 pending tasks; skipping status check (no nodes will be marked
down)
...
{code}
and tpstats shows blocked tasks--gossip and mutations:
{code}
GossipStage 1 3904 29384 0
0
{code}
the CPU's are inactive (See attachment)
and dstat output:
{code}
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
2 0 97 0 0 0|1324k 1381k| 0 0 | 0 0 |6252 5548
0 0 100 0 0 0| 0 64k| 42k 1017k| 0 0 |3075 2537
0 0 99 0 0 0| 0 8192B| 39k 794k| 0 0 |6999 7039
0 0 100 0 0 0| 0 0 | 39k 759k| 0 0 |3067 2726
0 0 99 0 0 0| 0 184k| 48k 1086k| 0 0 |4829 4178
0 0 99 0 0 0| 0 8192B| 34k 802k| 0 0 |1671 1240
0 0 100 0 0 0| 0 8192B| 48k 1067k| 0 0 |1878 1193
{code}
I managed to grab a thread dump:
https://gist.githubusercontent.com/anonymous/3b7b4698c32032603493/raw/read.md
and dmesg:
https://gist.githubusercontent.com/anonymous/5982b15337c9afbd5d49/raw/f3c2e4411b9d59e90f4615d93c7c1ad25922e170/read.md
Restarting the node solved the issue (it came up normally), we don't know what
is causing it but apparently --per the thread dump-- gossip threads are blocked
writing the system keyspace and the writes waiting on the commitlog.
Gossip:
{code}
"GossipTasks:1" daemon prio=10 tid=0x00007ffa2368f800 nid=0xa140e runnable
[0x00007ffc16fb2000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000005d4378b88> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
{code}
Mutation:
{code}
"MutationStage:32" daemon prio=10 tid=0x00007ffa2339c800 nid=0xa1399 waiting on
condition [0x00007ff9cd6c8000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000005d486a888> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:349)
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)