[jira] [Commented] (CASSANDRA-12655) Incremental repair & compaction hang on random nodes

Navjyot Nishant (JIRA) Sun, 18 Sep 2016 00:01:41 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-12655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15500435#comment-15500435
 ]


Navjyot Nishant commented on CASSANDRA-12655:
---------------------------------------------

Hello Wei, Thank for responding. Actually its an issue with compaction getting 
blocked, anticompaction is moving through without any issue. 
Let me explain in detail -

1. We run incremental repair one one node at a time. 
2. When repair starts it shows completion progress and for large keyspace after 
showing 100% it take some times/couple of minutes to move forward with next 
keyspace. When we verified actually it wait for anticompaction to get completed 
on all the relevant replicas. The moment anticompaction gets completed on all 
replicas it move forward with next keyspace. 
3. Then compaction starts followed by anticompaction which sometime get hang on 
random replicas, resulting that particular replica become unresponsive which 
impact the repair running on next keyspace/node hence the repair also become 
unresponsive.
I am able to omit this blocking behavior if i disable autocompaction before 
starting the repair. But post repair when i enable anticompaction it gets 
blocked on random node and the only way to resolve it bounce the node, which 
doesn't seems practical.

For now i am able to resolve this issue by not using -dcpar. So far i have been 
trying to use -dcpar to speedup the repair but the moment i have removed it it 
is not complaining and compaction is also going through. This spare us some 
time to plan for the upgrade early next year directly to 3.x.

-dcpar is working fine on other non prod environment but it seems it has 
problem with one of the largest keyspace which has table of size 3-4GB?

If you guys can relate the above issues & resolution that would be great.

Thanks!

> Incremental repair & compaction hang on random nodes
> ----------------------------------------------------
>
>                 Key: CASSANDRA-12655
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12655
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Compaction
>         Environment: CentOS Linux release 7.1.1503 (Core)
> RAM - 64GB
> HEAP - 16GB
> Load on each node - ~5GB
> Cassandra Version - 2.2.5
>            Reporter: Navjyot Nishant
>            Priority: Blocker
>
> Hi We are setting up incremental repair on our 18 node cluster. Avg load on 
> each node is ~5GB. The repair run fine on couple of nodes and sudently get 
> stuck on random nodes. Upon checking the system.log of impacted node we dont 
> see much information.
> Following are the lines we see in system.log and its there from the point 
> repair is not making progress -
> {code}
> INFO  [CompactionExecutor:3490] 2016-09-16 11:14:44,236 
> CompactionManager.java:1221 - Anticompacting 
> [BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30832-big-Data.db'),
>  
> BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30811-big-Data.db')]
> INFO  [IndexSummaryManager:1] 2016-09-16 11:14:49,954 
> IndexSummaryRedistribution.java:74 - Redistributing index summaries
> INFO  [IndexSummaryManager:1] 2016-09-16 12:14:49,961 
> IndexSummaryRedistribution.java:74 - Redistributing index summaries
> {code}
> When we try to see pending compaction by executing {code}nodetool 
> compactionstats{code} it hangs as well and doesn't return anything. However 
> {code}nodetool tpstats{code} show active and pending compaction which never 
> come down and keep increasing. 
> {code}
> Pool Name                    Active   Pending      Completed   Blocked  All 
> time blocked
> MutationStage                     0         0         221208         0        
>          0
> ReadStage                         0         0        1288839         0        
>          0
> RequestResponseStage              0         0         104356         0        
>          0
> ReadRepairStage                   0         0             72         0        
>          0
> CounterMutationStage              0         0              0         0        
>          0
> HintedHandoff                     0         0             46         0        
>          0
> MiscStage                         0         0              0         0        
>          0
> CompactionExecutor                8        66          68124         0        
>          0
> MemtableReclaimMemory             0         0            166         0        
>          0
> PendingRangeCalculator            0         0             38         0        
>          0
> GossipStage                       0         0         242455         0        
>          0
> MigrationStage                    0         0              0         0        
>          0
> MemtablePostFlush                 0         0           3682         0        
>          0
> ValidationExecutor                0         0           2246         0        
>          0
> Sampler                           0         0              0         0        
>          0
> MemtableFlushWriter               0         0            166         0        
>          0
> InternalResponseStage             0         0           8866         0        
>          0
> AntiEntropyStage                  0         0          15417         0        
>          0
> Repair#7                          0         0            160         0        
>          0
> CacheCleanupExecutor              0         0              0         0        
>          0
> Native-Transport-Requests         0         0         327334         0        
>          0
> Message type           Dropped
> READ                         0
> RANGE_SLICE                  0
> _TRACE                       0
> MUTATION                     0
> COUNTER_MUTATION             0
> REQUEST_RESPONSE             0
> PAGED_RANGE                  0
> READ_REPAIR                  0
> {code}
> {code} nodetool netstats{code} shows some pending messages which never get 
> processed and noting in progress -
> {code}
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 15585
> Mismatch (Blocking): 0
> Mismatch (Background): 0
> Pool Name                    Active   Pending      Completed
> Large messages                  n/a        12            562
> Small messages                  n/a         0         999779
> Gossip messages                 n/a         0         264394
> {code}
> The only solution we have is bounce the node and all the pending compactions 
> started getting processed immediately and get processed in 5 - 10 minutes.
> This is a road blocker issue for us and and help in this matter would be 
> highly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-12655) Incremental repair & compaction hang on random nodes

Reply via email to