[jira] [Comment Edited] (CASSANDRA-9279) Gossip (and mutations) lock up on startup

Brandon Williams (JIRA) Mon, 04 May 2015 14:50:50 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527378#comment-14527378
 ]


Brandon Williams edited comment on CASSANDRA-9279 at 5/4/15 9:50 PM:
---------------------------------------------------------------------

Here's another example:

{noformat}
ERROR [COMMIT-LOG-ALLOCATOR] 2015-05-02 04:38:54,102 CommitLog.java (line 420) 
Failed to allocate new commit log segments. Commit disk failure policy is stop; 
terminating thread
FSWriteError in /var/lib/cassandra/commitlog/CommitLog-3-1430566697635.log
        at 
org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:143)
        at 
org.apache.cassandra.db.commitlog.CommitLogAllocator$3.run(CommitLogAllocator.java:208)
        at 
org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:99)
        at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: 
/var/lib/cassandra/commitlog/CommitLog-3-1430566697635.log (Permission denied)
        at java.io.RandomAccessFile.open(Native Method)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:241)
        at 
org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:125)
        ... 4 more
 INFO [CompactionExecutor:4] 2015-05-02 04:38:54,130 CompactionTask.java (line 
120) Compacting 
[SSTableReader(path='/var/lib/cassandra/data/system/sstable_activity/system-sstable_activity-jb-500-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/system/sstable_activity/system-sstable_activity-jb-501-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/system/sstable_activity/system-sstable_activity-jb-499-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/system/sstable_activity/system-sstable_activity-jb-502-Data.db')]
 INFO [CompactionExecutor:12] 2015-05-02 04:38:54,140 ColumnFamilyStore.java 
(line 795) Enqueuing flush of 
Memtable-compactions_in_progress@1153169288(419/4190 serialized/live bytes, 21 
ops)
 INFO [FlushWriter:1] 2015-05-02 04:38:54,141 Memtable.java (line 358) Writing 
Memtable-compactions_in_progress@1153169288(419/4190 serialized/live bytes, 21 
ops)
 INFO [CompactionExecutor:25] 2015-05-02 04:38:54,144 ColumnFamilyStore.java 
(line 795) Enqueuing flush of 
Memtable-compactions_in_progress@1811681988(441/4410 serialized/live bytes, 20 
ops)
 INFO [FlushWriter:2] 2015-05-02 04:38:54,144 Memtable.java (line 358) Writing 
Memtable-compactions_in_progress@1811681988(441/4410 serialized/live bytes, 20 
ops)
 INFO [FlushWriter:1] 2015-05-02 04:38:54,152 Memtable.java (line 398) 
Completed flushing 
/var/lib/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-jb-138175-Data.db
 (245 bytes) for commitlog position ReplayPosition(segmentId=1430566697633, 
position=1621)

 INFO [main] 2015-05-02 04:38:54,889 StorageService.java (line 515) Cassandra 
version: 2.0.14.352
 INFO [main] 2015-05-02 04:38:54,889 StorageService.java (line 516) Thrift API 
version: 19.39.0
 INFO [main] 2015-05-02 04:38:54,893 StorageService.java (line 517) CQL 
supported versions: 2.0.0,3.1.7 (default: 3.1.7)
 INFO [main] 2015-05-02 04:38:54,911 StorageService.java (line 540) Loading 
persisted ring state
 INFO [main] 2015-05-02 04:38:55,404 StorageService.java (line 678) Starting up 
server gossip
{noformat}

As you can see, the CLA died, effectively making the machine 
useless/problematic, but everything else started up and adversely affected the 
cluster.  We should instead exit when the CLA dies.


was (Author: brandon.williams):
Here's another example:

{noformat}
ERROR [COMMIT-LOG-ALLOCATOR] 2015-05-02 04:38:54,102 CommitLog.java (line 420) 
Failed to allocate new commit log segments. Commit disk failure policy is stop; 
terminating thread
FSWriteError in /var/lib/cassandra/commitlog/CommitLog-3-1430566697635.log
        at 
org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:143)
        at 
org.apache.cassandra.db.commitlog.CommitLogAllocator$3.run(CommitLogAllocator.java:208)
        at 
org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:99)
        at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: 
/var/lib/cassandra/commitlog/CommitLog-3-1430566697635.log (Permission denied)
        at java.io.RandomAccessFile.open(Native Method)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:241)
        at 
org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:125)
        ... 4 more
 INFO [CompactionExecutor:4] 2015-05-02 04:38:54,130 CompactionTask.java (line 
120) Compacting 
[SSTableReader(path='/var/lib/cassandra/data/system/sstable_activity/system-sstable_activity-jb-500-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/system/sstable_activity/system-sstable_activity-jb-501-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/system/sstable_activity/system-sstable_activity-jb-499-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/system/sstable_activity/system-sstable_activity-jb-502-Data.db')]
 INFO [CompactionExecutor:12] 2015-05-02 04:38:54,140 ColumnFamilyStore.java 
(line 795) Enqueuing flush of 
Memtable-compactions_in_progress@1153169288(419/4190 serialized/live bytes, 21 
ops)
 INFO [FlushWriter:1] 2015-05-02 04:38:54,141 Memtable.java (line 358) Writing 
Memtable-compactions_in_progress@1153169288(419/4190 serialized/live bytes, 21 
ops)
 INFO [CompactionExecutor:25] 2015-05-02 04:38:54,144 ColumnFamilyStore.java 
(line 795) Enqueuing flush of 
Memtable-compactions_in_progress@1811681988(441/4410 serialized/live bytes, 20 
ops)
 INFO [FlushWriter:2] 2015-05-02 04:38:54,144 Memtable.java (line 358) Writing 
Memtable-compactions_in_progress@1811681988(441/4410 serialized/live bytes, 20 
ops)
 INFO [FlushWriter:1] 2015-05-02 04:38:54,152 Memtable.java (line 398) 
Completed flushing 
/var/lib/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-jb-138175-Data.db
 (245 bytes) for commitlog position ReplayPosition(segmentId=1430566697633, 
position=1621)
 INFO [CompactionExecutor:12] 2015-05-02 04:38:54,153 CompactionTask.java (line 
120) Compacting 
[SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles_indexRecords/UPS_v3_NAM-Profiles_indexRecords-jb-22-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles_indexRecords/UPS_v3_NAM-Profiles_indexRecords-jb-23-Data.db')]
 INFO [CompactionExecutor:28] 2015-05-02 04:38:54,153 CompactionTask.java (line 
120) Compacting 
[SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles_indexes/UPS_v3_NAM-Profiles_indexes-jb-14-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles_indexes/UPS_v3_NAM-Profiles_indexes-jb-15-Data.db')]
 INFO [FlushWriter:2] 2015-05-02 04:38:54,153 Memtable.java (line 398) 
Completed flushing 
/var/lib/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-jb-138176-Data.db
 (221 bytes) for commitlog position ReplayPosition(segmentId=1430566697633, 
position=2301)
 INFO [CompactionExecutor:20] 2015-05-02 04:38:54,153 CompactionTask.java (line 
120) Compacting 
[SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles_indexCache/UPS_v3_NAM-Profiles_indexCache-jb-20-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles_indexCache/UPS_v3_NAM-Profiles_indexCache-jb-19-Data.db')]
 INFO [CompactionExecutor:25] 2015-05-02 04:38:54,449 CompactionTask.java (line 
120) Compacting 
[SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles/UPS_v3_NAM-Profiles-jb-29968-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles/UPS_v3_NAM-Profiles-jb-30074-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles/UPS_v3_NAM-Profiles-jb-29970-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles/UPS_v3_NAM-Profiles-jb-29969-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles/UPS_v3_NAM-Profiles-jb-29976-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles/UPS_v3_NAM-Profiles-jb-29974-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles/UPS_v3_NAM-Profiles-jb-29977-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles/UPS_v3_NAM-Profiles-jb-29978-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles/UPS_v3_NAM-Profiles-jb-30077-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles/UPS_v3_NAM-Profiles-jb-30063-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles/UPS_v3_NAM-Profiles-jb-29967-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles/UPS_v3_NAM-Profiles-jb-29971-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles/UPS_v3_NAM-Profiles-jb-29972-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles/UPS_v3_NAM-Profiles-jb-30068-Data.db'),
 
SSTableReader(path='/var/lib/cassandra/data/UPS_v3_NAM/Profiles/UPS_v3_NAM-Profiles-jb-30062-Data.db')]
 INFO [main] 2015-05-02 04:38:54,889 StorageService.java (line 515) Cassandra 
version: 2.0.14.352
 INFO [main] 2015-05-02 04:38:54,889 StorageService.java (line 516) Thrift API 
version: 19.39.0
 INFO [main] 2015-05-02 04:38:54,893 StorageService.java (line 517) CQL 
supported versions: 2.0.0,3.1.7 (default: 3.1.7)
 INFO [main] 2015-05-02 04:38:54,911 StorageService.java (line 540) Loading 
persisted ring state
 INFO [main] 2015-05-02 04:38:55,404 StorageService.java (line 678) Starting up 
server gossip
{noformat}

As you can see, the CLA died, effectively making the machine 
useless/problematic, but everything else started up and adversely affected the 
cluster.  We should instead exit when the CLA dies.

> Gossip (and mutations) lock up on startup
> -----------------------------------------
>
>                 Key: CASSANDRA-9279
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9279
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sebastian Estevez
>             Fix For: 2.0.x
>
>         Attachments: Screen Shot 2015-04-30 at 4.41.57 PM.png
>
>
> Cluster running 2.0.14.352 on EC2 - c3.4xl's
> 2 nodes out of 8 exhibited the following behavior
> When starting up the node we noticed it was gray in OpsCenter. Other 
> monitoring tool showed it as up. 
> Turned out gossip tasks were piling up and we could see the following in the 
> system.log:
> {code}
>  WARN [GossipTasks:1] 2015-04-30 20:22:29,512 Gossiper.java (line 671) Gossip 
> stage has 4270 pending tasks; skipping status check (no nodes will be marked 
> down)
>  WARN [GossipTasks:1] 2015-04-30 20:22:30,612 Gossiper.java (line 671) Gossip 
> stage has 4272 pending tasks; skipping status check (no nodes will be marked 
> down)
>  WARN [GossipTasks:1] 2015-04-30 20:22:31,713 Gossiper.java (line 671) Gossip 
> stage has 4273 pending tasks; skipping status check (no nodes will be marked 
> down)
> ...
> {code}
> and tpstats shows blocked tasks--gossip and mutations:
> {code}
> GossipStage                       1      3904          29384         0        
>          0
> {code}
> the CPU's are inactive (See attachment)
> and dstat output:
> {code}
> You did not select any stats, using -cdngy by default.
> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
> usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
>   2   0  97   0   0   0|1324k 1381k|   0     0 |   0     0 |6252  5548
>   0   0 100   0   0   0|   0    64k|  42k 1017k|   0     0 |3075  2537
>   0   0  99   0   0   0|   0  8192B|  39k  794k|   0     0 |6999  7039
>   0   0 100   0   0   0|   0     0 |  39k  759k|   0     0 |3067  2726
>   0   0  99   0   0   0|   0   184k|  48k 1086k|   0     0 |4829  4178
>   0   0  99   0   0   0|   0  8192B|  34k  802k|   0     0 |1671  1240
>   0   0 100   0   0   0|   0  8192B|  48k 1067k|   0     0 |1878  1193
> {code}
> I managed to grab a thread dump:
> https://gist.githubusercontent.com/anonymous/3b7b4698c32032603493/raw/read.md
> and dmesg:
> https://gist.githubusercontent.com/anonymous/5982b15337c9afbd5d49/raw/f3c2e4411b9d59e90f4615d93c7c1ad25922e170/read.md
> Restarting the node solved the issue (it came up normally), we don't know 
> what is causing it but apparently (per the thread dump) gossip threads are 
> blocked writing the system keyspace and the writes waiting on the commitlog.
> Gossip:
> {code}
> "GossipStage:1" daemon prio=10 tid=0x00007ffa23471800 nid=0xa13fa waiting on 
> condition [0x00007ff9cbe26000]
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00000005d3f50960> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:964)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
>       at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731)
>       at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:351)
>       at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:336)
>       at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:211)
>       at 
> org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:709)
>       at 
> org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:208)
>       at 
> org.apache.cassandra.db.SystemKeyspace.updatePeerInfo(SystemKeyspace.java:379)
>       - locked <0x00000005d3f41ed8> (a java.lang.Class for 
> org.apache.cassandra.db.SystemKeyspace)
>       at 
> org.apache.cassandra.service.StorageService.updatePeerInfo(StorageService.java:1414)
>       at 
> org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1524)
>       at 
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1350)
>       at 
> org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1083)
>       at org.apache.cassandra.gms.Gossiper.applyNewStates(Gossiper.java:1065)
>       at 
> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1023)
>       at 
> org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:58)
>       at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:744)
>    Locked ownable synchronizers:
>       - <0x0000000609517438> (a 
> java.util.concurrent.ThreadPoolExecutor$Worker)
> ...
> {code}
> Mutation:
> {code}
> "MutationStage:32" daemon prio=10 tid=0x00007ffa2339c800 nid=0xa1399 waiting 
> on condition [0x00007ff9cd6c8000]
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00000005d486a888> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>       at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:349)
>       at 
> org.apache.cassandra.db.commitlog.PeriodicCommitLogExecutorService.add(PeriodicCommitLogExecutorService.java:106)
>       at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:206)
>       at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:357)
>       at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:336)
>       at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:211)
>       at 
> org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56)
>       at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-9279) Gossip (and mutations) lock up on startup

Reply via email to