[
https://issues.apache.org/jira/browse/CASSANDRA-12655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15500435#comment-15500435
]
Navjyot Nishant commented on CASSANDRA-12655:
---------------------------------------------
Hello Wei, Thank for responding. Actually its an issue with compaction getting
blocked, anticompaction is moving through without any issue.
Let me explain in detail -
1. We run incremental repair one one node at a time.
2. When repair starts it shows completion progress and for large keyspace after
showing 100% it take some times/couple of minutes to move forward with next
keyspace. When we verified actually it wait for anticompaction to get completed
on all the relevant replicas. The moment anticompaction gets completed on all
replicas it move forward with next keyspace.
3. Then compaction starts followed by anticompaction which sometime get hang on
random replicas, resulting that particular replica become unresponsive which
impact the repair running on next keyspace/node hence the repair also become
unresponsive.
I am able to omit this blocking behavior if i disable autocompaction before
starting the repair. But post repair when i enable anticompaction it gets
blocked on random node and the only way to resolve it bounce the node, which
doesn't seems practical.
For now i am able to resolve this issue by not using -dcpar. So far i have been
trying to use -dcpar to speedup the repair but the moment i have removed it it
is not complaining and compaction is also going through. This spare us some
time to plan for the upgrade early next year directly to 3.x.
-dcpar is working fine on other non prod environment but it seems it has
problem with one of the largest keyspace which has table of size 3-4GB?
If you guys can relate the above issues & resolution that would be great.
Thanks!
> Incremental repair & compaction hang on random nodes
> ----------------------------------------------------
>
> Key: CASSANDRA-12655
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12655
> Project: Cassandra
> Issue Type: Bug
> Components: Compaction
> Environment: CentOS Linux release 7.1.1503 (Core)
> RAM - 64GB
> HEAP - 16GB
> Load on each node - ~5GB
> Cassandra Version - 2.2.5
> Reporter: Navjyot Nishant
> Priority: Blocker
>
> Hi We are setting up incremental repair on our 18 node cluster. Avg load on
> each node is ~5GB. The repair run fine on couple of nodes and sudently get
> stuck on random nodes. Upon checking the system.log of impacted node we dont
> see much information.
> Following are the lines we see in system.log and its there from the point
> repair is not making progress -
> {code}
> INFO [CompactionExecutor:3490] 2016-09-16 11:14:44,236
> CompactionManager.java:1221 - Anticompacting
> [BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30832-big-Data.db'),
>
> BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30811-big-Data.db')]
> INFO [IndexSummaryManager:1] 2016-09-16 11:14:49,954
> IndexSummaryRedistribution.java:74 - Redistributing index summaries
> INFO [IndexSummaryManager:1] 2016-09-16 12:14:49,961
> IndexSummaryRedistribution.java:74 - Redistributing index summaries
> {code}
> When we try to see pending compaction by executing {code}nodetool
> compactionstats{code} it hangs as well and doesn't return anything. However
> {code}nodetool tpstats{code} show active and pending compaction which never
> come down and keep increasing.
> {code}
> Pool Name Active Pending Completed Blocked All
> time blocked
> MutationStage 0 0 221208 0
> 0
> ReadStage 0 0 1288839 0
> 0
> RequestResponseStage 0 0 104356 0
> 0
> ReadRepairStage 0 0 72 0
> 0
> CounterMutationStage 0 0 0 0
> 0
> HintedHandoff 0 0 46 0
> 0
> MiscStage 0 0 0 0
> 0
> CompactionExecutor 8 66 68124 0
> 0
> MemtableReclaimMemory 0 0 166 0
> 0
> PendingRangeCalculator 0 0 38 0
> 0
> GossipStage 0 0 242455 0
> 0
> MigrationStage 0 0 0 0
> 0
> MemtablePostFlush 0 0 3682 0
> 0
> ValidationExecutor 0 0 2246 0
> 0
> Sampler 0 0 0 0
> 0
> MemtableFlushWriter 0 0 166 0
> 0
> InternalResponseStage 0 0 8866 0
> 0
> AntiEntropyStage 0 0 15417 0
> 0
> Repair#7 0 0 160 0
> 0
> CacheCleanupExecutor 0 0 0 0
> 0
> Native-Transport-Requests 0 0 327334 0
> 0
> Message type Dropped
> READ 0
> RANGE_SLICE 0
> _TRACE 0
> MUTATION 0
> COUNTER_MUTATION 0
> REQUEST_RESPONSE 0
> PAGED_RANGE 0
> READ_REPAIR 0
> {code}
> {code} nodetool netstats{code} shows some pending messages which never get
> processed and noting in progress -
> {code}
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 15585
> Mismatch (Blocking): 0
> Mismatch (Background): 0
> Pool Name Active Pending Completed
> Large messages n/a 12 562
> Small messages n/a 0 999779
> Gossip messages n/a 0 264394
> {code}
> The only solution we have is bounce the node and all the pending compactions
> started getting processed immediately and get processed in 5 - 10 minutes.
> This is a road blocker issue for us and and help in this matter would be
> highly appreciated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)