[
https://issues.apache.org/jira/browse/CASSANDRA-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14116877#comment-14116877
]
Duncan Sands commented on CASSANDRA-6651:
-----------------------------------------
I'm on 2.0.10, and seem to have hit this while running repair this weekend.
The repair #e3b1fc60-3010-11e4-bd56-390059926170 on node 192.168.21.13 got
stuck (see log snippet below). Summary of the log snippet: it requested a
merkle tree for "trades" to nodes 172.18.68.138 and 192.168.21.13; it almost
immediately got a merkle tree back from 192.168.21.13 (itself!) but never got
anything back from 172.18.68.138. The log snippet:
INFO [AntiEntropyStage:1] 2014-08-30 08:43:22,823 Validator.java (line 254)
[repair #d1667400-3010-11e4-847a-71045e056b2b] Sending completed merkle tree to
/192.168.60.141 for tick_data/historical_prices
INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,382 Validator.java (line 254)
[repair #e0c033a0-3010-11e4-b654-51c077eaf311] Sending completed merkle tree to
/172.18.68.138 for all_production/table_metadata
INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,517 RepairSession.java (line
166) [repair #e3b1fc60-3010-11e4-bd56-390059926170] Received merkle tree for
swxess_connections from /172.18.68.138
INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,517 RepairJob.java (line 143)
[repair #e3b1fc60-3010-11e4-bd56-390059926170] requesting merkle trees for
trades (to [/172.18.68.138, /192.168.21.13])
INFO [RepairJobTask:1] 2014-08-30 08:43:25,521 Differencer.java (line 67)
[repair #e3b1fc60-3010-11e4-bd56-390059926170] Endpoints /192.168.21.13 and
/172.18.68.138 are consistent for swxess_connections
INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,521 RepairSession.java (line
223) [repair #e3b1fc60-3010-11e4-bd56-390059926170] swxess_connections is fully
synced
INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,551 RepairSession.java (line
166) [repair #e3b1fc60-3010-11e4-bd56-390059926170] Received merkle tree for
trades from /192.168.21.13
It is now 36 hours later and it never got a merkle tree back from 172.18.68.138.
In fact 172.18.68.138 shows no sign of having received the merkle tree request
for "trades". Log snippet from around the time of the request:
INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,372 Validator.java (line 254)
[repair #e3b1fc60-3010-11e4-bd56-390059926170] Sending completed merkle tree to
/192.168.21.13 for zrh_simulation/swxess_connections
INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,385 RepairSession.java (line
166) [repair #e0c033a0-3010-11e4-b65451c077eaf311] Received merkle tree for
table_metadata from /172.18.68.139
INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,401 RepairSession.java (line
166) [repair #e0c033a0-3010-11e4-b65451c077eaf311] Received merkle tree for
table_metadata from /172.18.68.138
INFO [AntiEntropyStage:1] 2014-08-30 08:43:25,817 Validator.java (line 254)
[repair #ef7cc8e0-3010-11e4-afa2-87e1f9d131a9] Sending completed merkle tree to
/192.168.60.142 for OpsCenter/rollups300
INFO [AntiEntropyStage:1] 2014-08-30 08:43:26,980 Validator.java (line 254)
[repair #ef7cc8e0-3010-11e4-afa2-87e1f9d131a9] Sending completed merkle tree to
/192.168.60.142 for OpsCenter/rollups7200
INFO [STREAM-INIT-/192.168.60.142:43638] 2014-08-30 08:43:27,007
StreamResultFuture.java (line 121) [Stream
#f23afe81-3010-11e4-afa2-87e1f9d131a9] Received streaming plan for Repair
INFO [STREAM-INIT-/192.168.60.141:33019] 2014-08-30 08:43:27,012
StreamResultFuture.java (line 121) [Stream
#f23afe80-3010-11e4-847a-71045e056b2b] Received streaming plan for Repair
INFO [STREAM-IN-/192.168.60.142] 2014-08-30 08:43:27,028
StreamResultFuture.java (line 173) [Stream
#f23afe81-3010-11e4-afa2-87e1f9d131a9] Prepare completed. Receiving 4
files(543791 bytes), sending 4 files(570871 bytes)
INFO [STREAM-IN-/192.168.60.141] 2014-08-30 08:43:27,032
StreamResultFuture.java (line 173) [Stream
#f23afe80-3010-11e4-847a-71045e056b2b] Prepare completed. Receiving 5
files(569602 bytes), sending 4 files(570871 bytes)
The request doesn't turn up anywhere in the log files for the following 36
hours.
This seems to be the same issue as discussed in this ticket.
> Repair hanging
> --------------
>
> Key: CASSANDRA-6651
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6651
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Eitan Eibschutz
> Assignee: Yuki Morishita
>
> Hi,
> We have a 12 node cluster in PROD environment and we've noticed that repairs
> are never finishing. The behavior that we've observed is that a repair
> process will run until at some point it hangs and no other processing is
> happening.
> For example, at the moment, I have a repair process that has been running for
> two days and not finishing:
> nodetool tpstats is showing 2 active and 2 pending AntiEntropySessions
> nodetool compactionstats is showing:
> pending tasks: 0
> Active compaction remaining time : n/a
> nodetools netstats is showing:
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 0
> Mismatch (Blocking): 142110
> Mismatch (Background): 0
> Pool Name Active Pending Completed
> Commands n/a 0 107589657
> Responses n/a 0 116430785
> The last entry that I see in the log is:
> INFO [AntiEntropySessions:18] 2014-02-03 04:01:39,145 RepairJob.java (line
> 116) [repair #ae78c6c0-8c2b-11e3-b950-c3b81a36bc9b] requesting merkle trees
> for MyCF (to [/x.x.x.x, /y.y.y.y, /z.z.z.z])
> The repair started at 4am so it stopped after 1:40 minute.
> On node y.y.y.y I can see this in the log:
> INFO [MiscStage:1] 2014-02-03 04:01:38,985 ColumnFamilyStore.java (line 740)
> Enqueuing flush of Memtable-MyCF@1290890489(2176/5931 serialized/live bytes,
> 32 ops)
> INFO [FlushWriter:411] 2014-02-03 04:01:38,986 Memtable.java (line 333)
> Writing Memtable-MyCF@1290890489(2176/5931 serialized/live bytes, 32 ops)
> INFO [FlushWriter:411] 2014-02-03 04:01:39,048 Memtable.java (line 373)
> Completed flushing
> /var/lib/cassandra/main-db/data/MyKS/MyCF/MyKS-MyCF-jb-518-Data.db (1789
> bytes) for commitlog position ReplayPosition(segmentId=1390437013339,
> position=21868792)
> INFO [ScheduledTasks:1] 2014-02-03 05:00:04,794 ColumnFamilyStore.java (line
> 740) Enqueuing flush of Memtable-compaction_history@1649414699(1635/17360
> serialized/live bytes, 42 ops)
> So for some reason the merkle tree for this CF is never sent back to the node
> being repaired and it's hanging.
> I've also noticed that sometimes, restarting node y.y.y.y will cause the
> repair to resume.
> Another observation is that sometimes when restarting y.y.y.y it will not
> start with these errors:
> ERROR 16:34:18,485 Exception encountered during startup
> java.lang.IllegalStateException: Unfinished compactions reference missing
> sstables. This should never happen since compactions are marked finished
> before we start removing the old sstables.
> at
> org.apache.cassandra.db.ColumnFamilyStore.removeUnfinishedCompactionLeftovers(ColumnFamilyStore.java:495)
> at
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:264)
> at
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:461)
> at
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:504)
> java.lang.IllegalStateException: Unfinished compactions reference missing
> sstables. This should never happen since compactions are marked finished
> before we start removing the old sstables.
> at
> org.apache.cassandra.db.ColumnFamilyStore.removeUnfinishedCompactionLeftovers(ColumnFamilyStore.java:495)
> at
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:264)
> at
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:461)
> at
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:504)
> Exception encountered during startup: Unfinished compactions reference
> missing sstables. This should never happen since compactions are marked
> finished before we start removing the old sstables.
> And it will only restart after manually cleaning the compactions_in-progress
> folder.
> I'm not sure if these two issues are related but we've seen both on all the
> nodes in our cluster.
> I'll be happy to provide more info if needed as we are not sure what could
> cause this behavior.
> Another thing in our environment is that some of the Cassandra nodes have
> more than one network interface and RPC is listening on 0.0.0.0, not sure if
> it has anything to do with this.
> Thanks,
> Eitan
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)