[ https://issues.apache.org/jira/browse/CASSANDRA-12655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Navjyot Nishant updated CASSANDRA-12655: ---------------------------------------- Description: Hi We are setting up incremental repair on our 18 node cluster. Avg load on each node is ~5GB. The repair run fine on couple of nodes and sudently get stuck on random nodes. Upon checking the system.log of impacted node we dont see much information. Following are the lines we see in system.log and its there from the point repair is not making progress - {code} INFO [CompactionExecutor:3490] 2016-09-16 11:14:44,236 CompactionManager.java:1221 - Anticompacting [BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30832-big-Data.db'), BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30811-big-Data.db')] INFO [IndexSummaryManager:1] 2016-09-16 11:14:49,954 IndexSummaryRedistribution.java:74 - Redistributing index summaries INFO [IndexSummaryManager:1] 2016-09-16 12:14:49,961 IndexSummaryRedistribution.java:74 - Redistributing index summaries {code} When we try to see pending compaction by executing {code}nodetool compactionstats{code} it hangs as well and doesn't return anything. However {code}nodetool tpstats{code} show active and pending compaction which never come down and keep increasing. {code} Pool Name Active Pending Completed Blocked All time blocked MutationStage 0 0 221208 0 0 ReadStage 0 0 1288839 0 0 RequestResponseStage 0 0 104356 0 0 ReadRepairStage 0 0 72 0 0 CounterMutationStage 0 0 0 0 0 HintedHandoff 0 0 46 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 8 66 68124 0 0 MemtableReclaimMemory 0 0 166 0 0 PendingRangeCalculator 0 0 38 0 0 GossipStage 0 0 242455 0 0 MigrationStage 0 0 0 0 0 MemtablePostFlush 0 0 3682 0 0 ValidationExecutor 0 0 2246 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0 166 0 0 InternalResponseStage 0 0 8866 0 0 AntiEntropyStage 0 0 15417 0 0 Repair#7 0 0 160 0 0 CacheCleanupExecutor 0 0 0 0 0 Native-Transport-Requests 0 0 327334 0 0 Message type Dropped READ 0 RANGE_SLICE 0 _TRACE 0 MUTATION 0 COUNTER_MUTATION 0 REQUEST_RESPONSE 0 PAGED_RANGE 0 READ_REPAIR 0 {code} {code} nodetool netstats{code} shows some pending messages which never get processed and noting in progress - {code} Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 15585 Mismatch (Blocking): 0 Mismatch (Background): 0 Pool Name Active Pending Completed Large messages n/a 12 562 Small messages n/a 0 999779 Gossip messages n/a 0 264394 {code} The only solution we have is bounce the node and all the pending compactions started getting processed immediately and get processed in 5 - 10 minutes. This is a road blocker issue for us and and help in this matter would be highly appreciated. was: Hi We are setting up incremental repair on our 18 node cluster. Avg load on each node is ~5GB. The repair run fine on couple of nodes and sudently get stuck on random nodes. Upon checking the system.log of impacted node we dont see much information. Following are the lines we see in system.log and its there from the point repair is not making progress - {code} INFO [CompactionExecutor:3490] 2016-09-16 11:14:44,236 CompactionManager.java:1221 - Anticompacting [BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30832-big-Data.db'), BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30811-big-Data.db')] INFO [IndexSummaryManager:1] 2016-09-16 11:14:49,954 IndexSummaryRedistribution.java:74 - Redistributing index summaries INFO [IndexSummaryManager:1] 2016-09-16 12:14:49,961 IndexSummaryRedistribution.java:74 - Redistributing index summaries {code} When we try to see pending compaction by executing {code}nodetool compactionstats{code} it hangs as well and doesn't return anything. However {code}nodetool tpstats{code} show active and pending compaction which never come down and keep increasing. {code} Pool Name Active Pending Completed Blocked All time blocked MutationStage 0 0 221208 0 0 ReadStage 0 0 1288839 0 0 RequestResponseStage 0 0 104356 0 0 ReadRepairStage 0 0 72 0 0 CounterMutationStage 0 0 0 0 0 HintedHandoff 0 0 46 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 8 66 68124 0 0 MemtableReclaimMemory 0 0 166 0 0 PendingRangeCalculator 0 0 38 0 0 GossipStage 0 0 242455 0 0 MigrationStage 0 0 0 0 0 MemtablePostFlush 0 0 3682 0 0 ValidationExecutor 0 0 2246 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0 166 0 0 InternalResponseStage 0 0 8866 0 0 AntiEntropyStage 0 0 15417 0 0 Repair#7 0 0 160 0 0 CacheCleanupExecutor 0 0 0 0 0 Native-Transport-Requests 0 0 327334 0 0 Message type Dropped READ 0 RANGE_SLICE 0 _TRACE 0 MUTATION 0 COUNTER_MUTATION 0 REQUEST_RESPONSE 0 PAGED_RANGE 0 READ_REPAIR 0 {code} The only solution we have is bounce the node and all the pending compactions started getting processed immediately and get processed in 5 - 10 minutes. This is a road blocker issue for us and and help in this matter would be highly appreciated. > Incremental repair & compaction hang on random nodes > ---------------------------------------------------- > > Key: CASSANDRA-12655 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12655 > Project: Cassandra > Issue Type: Bug > Components: Compaction > Environment: CentOS Linux release 7.1.1503 (Core) > RAM - 64GB > HEAP - 16GB > Load on each node - ~5GB > Cassandra Version - 2.2.5 > Reporter: Navjyot Nishant > Priority: Blocker > > Hi We are setting up incremental repair on our 18 node cluster. Avg load on > each node is ~5GB. The repair run fine on couple of nodes and sudently get > stuck on random nodes. Upon checking the system.log of impacted node we dont > see much information. > Following are the lines we see in system.log and its there from the point > repair is not making progress - > {code} > INFO [CompactionExecutor:3490] 2016-09-16 11:14:44,236 > CompactionManager.java:1221 - Anticompacting > [BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30832-big-Data.db'), > > BigTableReader(path='/cassandra/data/gccatlgsvcks/message_backup-cab0485008ed11e5bfed452cdd54652d/la-30811-big-Data.db')] > INFO [IndexSummaryManager:1] 2016-09-16 11:14:49,954 > IndexSummaryRedistribution.java:74 - Redistributing index summaries > INFO [IndexSummaryManager:1] 2016-09-16 12:14:49,961 > IndexSummaryRedistribution.java:74 - Redistributing index summaries > {code} > When we try to see pending compaction by executing {code}nodetool > compactionstats{code} it hangs as well and doesn't return anything. However > {code}nodetool tpstats{code} show active and pending compaction which never > come down and keep increasing. > {code} > Pool Name Active Pending Completed Blocked All > time blocked > MutationStage 0 0 221208 0 > 0 > ReadStage 0 0 1288839 0 > 0 > RequestResponseStage 0 0 104356 0 > 0 > ReadRepairStage 0 0 72 0 > 0 > CounterMutationStage 0 0 0 0 > 0 > HintedHandoff 0 0 46 0 > 0 > MiscStage 0 0 0 0 > 0 > CompactionExecutor 8 66 68124 0 > 0 > MemtableReclaimMemory 0 0 166 0 > 0 > PendingRangeCalculator 0 0 38 0 > 0 > GossipStage 0 0 242455 0 > 0 > MigrationStage 0 0 0 0 > 0 > MemtablePostFlush 0 0 3682 0 > 0 > ValidationExecutor 0 0 2246 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 0 0 166 0 > 0 > InternalResponseStage 0 0 8866 0 > 0 > AntiEntropyStage 0 0 15417 0 > 0 > Repair#7 0 0 160 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > Native-Transport-Requests 0 0 327334 0 > 0 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > {code} nodetool netstats{code} shows some pending messages which never get > processed and noting in progress - > {code} > Mode: NORMAL > Not sending any streams. > Read Repair Statistics: > Attempted: 15585 > Mismatch (Blocking): 0 > Mismatch (Background): 0 > Pool Name Active Pending Completed > Large messages n/a 12 562 > Small messages n/a 0 999779 > Gossip messages n/a 0 264394 > {code} > The only solution we have is bounce the node and all the pending compactions > started getting processed immediately and get processed in 5 - 10 minutes. > This is a road blocker issue for us and and help in this matter would be > highly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)