[jira] [Commented] (CASSANDRA-12655) Incremental repair & compaction hang on random nodes

Wei Deng (JIRA) Sat, 17 Sep 2016 08:53:18 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-12655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15499243#comment-15499243
 ]


Wei Deng commented on CASSANDRA-12655:
--------------------------------------

BTW, in addition to the behavior that anti-compaction getting blocked by other 
regular compactions, you may have run into some other compaction hang issue 
with the regular compactions (especially when you say "nodetool 
compactionstats" also hang forever), as Marcus pointed out. When you run into 
this issue again, if you can always see CPU being completely idle with no CPU 
core working on any regular compaction threads while anti-compaction is still 
blocked by all pending regular compactions, then you've likely run into one of 
those compaction hang bugs in earlier version of 2.2.x.

Definitely moving to the latest 2.2.x version will help to avoid those known 
and fixed compaction hang problems and that should be a required first step. 
Then you will need to wait for the improvement in CASSANDRA-12200 to completely 
avoid trivial repair from being blocked by backlogged compactions. As 
CASSANDRA-12200 is an improvement instead of a bug and likely will not go into 
2.2, you likely will need to plan to cherry-pick the fix and back-port to your 
own 2.2 version, if you don't have plan to go to 3.x shortly.

> Incremental repair & compaction hang on random nodes
> ----------------------------------------------------
>
>                 Key: CASSANDRA-12655
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12655
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Compaction
>         Environment: CentOS Linux release 7.1.1503 (Core)
> RAM - 64GB
> HEAP - 16GB
> Load on each node - ~5GB
> Cassandra Version - 2.2.5
>            Reporter: Navjyot Nishant
>            Priority: Blocker
>
> Hi We are setting up incremental repair on our 18 node cluster. Avg load on 
> each node is ~5GB. The repair run fine on couple of nodes and sudently get 
> stuck on random nodes. Upon checking the system.log of impacted node we dont 
> see much information.
> Following are the lines we see in system.log and its there from the point 
> repair is not making progress -
> {code}
> INFO  [CompactionExecutor:3490] 2016-09-16 11:14:44,236 
> CompactionManager.java:1221 - Anticompacting 
> [BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30832-big-Data.db'),
>  
> BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30811-big-Data.db')]
> INFO  [IndexSummaryManager:1] 2016-09-16 11:14:49,954 
> IndexSummaryRedistribution.java:74 - Redistributing index summaries
> INFO  [IndexSummaryManager:1] 2016-09-16 12:14:49,961 
> IndexSummaryRedistribution.java:74 - Redistributing index summaries
> {code}
> When we try to see pending compaction by executing {code}nodetool 
> compactionstats{code} it hangs as well and doesn't return anything. However 
> {code}nodetool tpstats{code} show active and pending compaction which never 
> come down and keep increasing. 
> {code}
> Pool Name                    Active   Pending      Completed   Blocked  All 
> time blocked
> MutationStage                     0         0         221208         0        
>          0
> ReadStage                         0         0        1288839         0        
>          0
> RequestResponseStage              0         0         104356         0        
>          0
> ReadRepairStage                   0         0             72         0        
>          0
> CounterMutationStage              0         0              0         0        
>          0
> HintedHandoff                     0         0             46         0        
>          0
> MiscStage                         0         0              0         0        
>          0
> CompactionExecutor                8        66          68124         0        
>          0
> MemtableReclaimMemory             0         0            166         0        
>          0
> PendingRangeCalculator            0         0             38         0        
>          0
> GossipStage                       0         0         242455         0        
>          0
> MigrationStage                    0         0              0         0        
>          0
> MemtablePostFlush                 0         0           3682         0        
>          0
> ValidationExecutor                0         0           2246         0        
>          0
> Sampler                           0         0              0         0        
>          0
> MemtableFlushWriter               0         0            166         0        
>          0
> InternalResponseStage             0         0           8866         0        
>          0
> AntiEntropyStage                  0         0          15417         0        
>          0
> Repair#7                          0         0            160         0        
>          0
> CacheCleanupExecutor              0         0              0         0        
>          0
> Native-Transport-Requests         0         0         327334         0        
>          0
> Message type           Dropped
> READ                         0
> RANGE_SLICE                  0
> _TRACE                       0
> MUTATION                     0
> COUNTER_MUTATION             0
> REQUEST_RESPONSE             0
> PAGED_RANGE                  0
> READ_REPAIR                  0
> {code}
> {code} nodetool netstats{code} shows some pending messages which never get 
> processed and noting in progress -
> {code}
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 15585
> Mismatch (Blocking): 0
> Mismatch (Background): 0
> Pool Name                    Active   Pending      Completed
> Large messages                  n/a        12            562
> Small messages                  n/a         0         999779
> Gossip messages                 n/a         0         264394
> {code}
> The only solution we have is bounce the node and all the pending compactions 
> started getting processed immediately and get processed in 5 - 10 minutes.
> This is a road blocker issue for us and and help in this matter would be 
> highly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-12655) Incremental repair & compaction hang on random nodes

Reply via email to