[ 
https://issues.apache.org/jira/browse/CASSANDRA-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14116877#comment-14116877
 ] 

Duncan Sands commented on CASSANDRA-6651:
-----------------------------------------

I'm on 2.0.10, and seem to have hit this while running repair this weekend.

The repair #e3b1fc60-3010-11e4-bd56-390059926170 on node 192.168.21.13 got 
stuck (see log snippet below).  Summary of the log snippet: it requested a 
merkle tree for "trades" to nodes 172.18.68.138 and 192.168.21.13; it almost 
immediately got a merkle tree back from 192.168.21.13 (itself!) but never got 
anything back from 172.18.68.138.  The log snippet:

 INFO [AntiEntropyStage:1] 2014-08-30 08:43:22,823 Validator.java (line 254) 
[repair #d1667400-3010-11e4-847a-71045e056b2b] Sending completed merkle tree to 
/192.168.60.141 for tick_data/historical_prices
 INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,382 Validator.java (line 254) 
[repair #e0c033a0-3010-11e4-b654-51c077eaf311] Sending completed merkle tree to 
/172.18.68.138 for all_production/table_metadata
 INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,517 RepairSession.java (line 
166) [repair #e3b1fc60-3010-11e4-bd56-390059926170] Received merkle tree for 
swxess_connections from /172.18.68.138
 INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,517 RepairJob.java (line 143) 
[repair #e3b1fc60-3010-11e4-bd56-390059926170] requesting merkle trees for 
trades (to [/172.18.68.138, /192.168.21.13])
 INFO [RepairJobTask:1] 2014-08-30 08:43:25,521 Differencer.java (line 67) 
[repair #e3b1fc60-3010-11e4-bd56-390059926170] Endpoints /192.168.21.13 and 
/172.18.68.138 are consistent for swxess_connections
 INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,521 RepairSession.java (line 
223) [repair #e3b1fc60-3010-11e4-bd56-390059926170] swxess_connections is fully 
synced
 INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,551 RepairSession.java (line 
166) [repair #e3b1fc60-3010-11e4-bd56-390059926170] Received merkle tree for 
trades from /192.168.21.13

It is now 36 hours later and it never got a merkle tree back from 172.18.68.138.


In fact 172.18.68.138 shows no sign of having received the merkle tree request 
for "trades".  Log snippet from around the time of the request:

 INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,372 Validator.java (line 254) 
[repair #e3b1fc60-3010-11e4-bd56-390059926170] Sending completed merkle tree to 
/192.168.21.13 for zrh_simulation/swxess_connections
 INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,385 RepairSession.java (line 
166) [repair #e0c033a0-3010-11e4-b65451c077eaf311] Received merkle tree for 
table_metadata from /172.18.68.139
 INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,401 RepairSession.java (line 
166) [repair #e0c033a0-3010-11e4-b65451c077eaf311] Received merkle tree for 
table_metadata from /172.18.68.138
 INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,817 Validator.java (line 254) 
[repair #ef7cc8e0-3010-11e4-afa2-87e1f9d131a9] Sending completed merkle tree to 
/192.168.60.142 for OpsCenter/rollups300
 INFO [AntiEntropyStage:1] 2014-08-30 08:43:26,980 Validator.java (line 254) 
[repair #ef7cc8e0-3010-11e4-afa2-87e1f9d131a9] Sending completed merkle tree to 
/192.168.60.142 for OpsCenter/rollups7200
 INFO [STREAM-INIT-/192.168.60.142:43638] 2014-08-30 08:43:27,007 
StreamResultFuture.java (line 121) [Stream 
#f23afe81-3010-11e4-afa2-87e1f9d131a9] Received streaming plan for Repair
 INFO [STREAM-INIT-/192.168.60.141:33019] 2014-08-30 08:43:27,012 
StreamResultFuture.java (line 121) [Stream 
#f23afe80-3010-11e4-847a-71045e056b2b] Received streaming plan for Repair
 INFO [STREAM-IN-/192.168.60.142] 2014-08-30 08:43:27,028 
StreamResultFuture.java (line 173) [Stream 
#f23afe81-3010-11e4-afa2-87e1f9d131a9] Prepare completed. Receiving 4 
files(543791 bytes), sending 4 files(570871 bytes)
 INFO [STREAM-IN-/192.168.60.141] 2014-08-30 08:43:27,032 
StreamResultFuture.java (line 173) [Stream 
#f23afe80-3010-11e4-847a-71045e056b2b] Prepare completed. Receiving 5 
files(569602 bytes), sending 4 files(570871 bytes)

The request doesn't turn up anywhere in the log files for the following 36 
hours.

This seems to be the same issue as discussed in this ticket.

> Repair hanging
> --------------
>
>                 Key: CASSANDRA-6651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Eitan Eibschutz
>            Assignee: Yuki Morishita
>
> Hi,
> We have a 12 node cluster in PROD environment and we've noticed that repairs 
> are never finishing. The behavior that we've observed is that a repair 
> process will run until at some point it hangs and no other processing is 
> happening.
> For example, at the moment, I have a repair process that has been running for 
> two days and not finishing:
> nodetool tpstats is showing 2 active and 2 pending AntiEntropySessions
> nodetool compactionstats is showing:
> pending tasks: 0
> Active compaction remaining time :        n/a
> nodetools netstats is showing:
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 0
> Mismatch (Blocking): 142110
> Mismatch (Background): 0
> Pool Name                    Active   Pending      Completed
> Commands                        n/a         0      107589657
> Responses                       n/a         0      116430785 
> The last entry that I see in the log is:
> INFO [AntiEntropySessions:18] 2014-02-03 04:01:39,145 RepairJob.java (line 
> 116) [repair #ae78c6c0-8c2b-11e3-b950-c3b81a36bc9b] requesting merkle trees 
> for MyCF (to [/x.x.x.x, /y.y.y.y, /z.z.z.z])
> The repair started at 4am so it stopped after 1:40 minute.
> On node y.y.y.y I can see this in the log:
> INFO [MiscStage:1] 2014-02-03 04:01:38,985 ColumnFamilyStore.java (line 740) 
> Enqueuing flush of Memtable-MyCF@1290890489(2176/5931 serialized/live bytes, 
> 32 ops)
>  INFO [FlushWriter:411] 2014-02-03 04:01:38,986 Memtable.java (line 333) 
> Writing Memtable-MyCF@1290890489(2176/5931 serialized/live bytes, 32 ops)
>  INFO [FlushWriter:411] 2014-02-03 04:01:39,048 Memtable.java (line 373) 
> Completed flushing 
> /var/lib/cassandra/main-db/data/MyKS/MyCF/MyKS-MyCF-jb-518-Data.db (1789 
> bytes) for commitlog position ReplayPosition(segmentId=1390437013339, 
> position=21868792)
>  INFO [ScheduledTasks:1] 2014-02-03 05:00:04,794 ColumnFamilyStore.java (line 
> 740) Enqueuing flush of Memtable-compaction_history@1649414699(1635/17360 
> serialized/live bytes, 42 ops)
> So for some reason the merkle tree for this CF is never sent back to the node 
> being repaired and it's hanging.
> I've also noticed that sometimes, restarting node y.y.y.y will cause the  
> repair to resume.
> Another observation is that sometimes when restarting y.y.y.y it will not 
> start with these errors:
> ERROR 16:34:18,485 Exception encountered during startup
> java.lang.IllegalStateException: Unfinished compactions reference missing 
> sstables. This should never happen since compactions are marked finished 
> before we start removing the old sstables.
>       at 
> org.apache.cassandra.db.ColumnFamilyStore.removeUnfinishedCompactionLeftovers(ColumnFamilyStore.java:495)
>       at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:264)
>       at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:461)
>       at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:504)
> java.lang.IllegalStateException: Unfinished compactions reference missing 
> sstables. This should never happen since compactions are marked finished 
> before we start removing the old sstables.
>       at 
> org.apache.cassandra.db.ColumnFamilyStore.removeUnfinishedCompactionLeftovers(ColumnFamilyStore.java:495)
>       at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:264)
>       at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:461)
>       at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:504)
> Exception encountered during startup: Unfinished compactions reference 
> missing sstables. This should never happen since compactions are marked 
> finished before we start removing the old sstables.
> And it will only restart after manually cleaning the compactions_in-progress 
> folder.
> I'm not sure if these two issues are related but we've seen both on all the 
> nodes in our cluster.
> I'll be happy to provide more info if needed as we are not sure what could 
> cause this behavior.
> Another thing in our environment is that some of the Cassandra nodes have 
> more than one network interface and RPC is listening on 0.0.0.0, not sure if 
> it has anything to do with this.
> Thanks,
> Eitan 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to