[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

2014-07-25 Thread Vladimir Avram (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074685#comment-14074685
 ] 

Vladimir Avram commented on CASSANDRA-7560:
---

Thanks, I will try the patch out on 2.0.7

 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
 --

 Key: CASSANDRA-7560
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7560
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Vladimir Avram
Assignee: Yuki Morishita
 Fix For: 2.0.10

 Attachments: 0001-backport-CASSANDRA-6747.patch, 
 cassandra_daemon.log, cassandra_daemon_rep1.log, cassandra_daemon_rep2.log, 
 nodetool_command.log


 Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
 AntiEntropySessions.
 The system logs will show the repair command starting
 {noformat}
  INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
 Starting repair command #1, repairing 256 ranges for keyspace x
 {noformat}
 You can then see a few AntiEntropySessions completing with:
 {noformat}
 INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
 successfully
 {noformat}
 Finally we reach an AntiEntropySession at some point that hangs just before 
 requesting the merkle trees for the next column family in line for repair. So 
 we first see the previous CF being finished and the whole repair sessions 
 hangs here with no visible progress or errors on this or any of the related 
 nodes.
 {noformat}
 INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 
 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully 
 synced
 {noformat}
 Notes:
 * Single DC 6 node cluster with an average load of 86 GB per node.
 * This appears to be random; it does not always happen on the same CF or on 
 the same session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6651) Repair hanging

2014-07-18 Thread Vladimir Avram (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066981#comment-14066981
 ] 

Vladimir Avram commented on CASSANDRA-6651:
---

We've seen something similar, not quite as described in the issue but the same 
as what [~mjaszczyk] reported in his comment. It's not even that the request 
for a merkle tree never gets fulfilled, we just never see the next expected 
merkle tree request ever get sent. I filed a bug for this particular issue as 
it isn't quite what is reported here: CASSANDRA-7560 

 Repair hanging
 --

 Key: CASSANDRA-6651
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6651
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Eitan Eibschutz
Assignee: Yuki Morishita

 Hi,
 We have a 12 node cluster in PROD environment and we've noticed that repairs 
 are never finishing. The behavior that we've observed is that a repair 
 process will run until at some point it hangs and no other processing is 
 happening.
 For example, at the moment, I have a repair process that has been running for 
 two days and not finishing:
 nodetool tpstats is showing 2 active and 2 pending AntiEntropySessions
 nodetool compactionstats is showing:
 pending tasks: 0
 Active compaction remaining time :n/a
 nodetools netstats is showing:
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 0
 Mismatch (Blocking): 142110
 Mismatch (Background): 0
 Pool NameActive   Pending  Completed
 Commandsn/a 0  107589657
 Responses   n/a 0  116430785 
 The last entry that I see in the log is:
 INFO [AntiEntropySessions:18] 2014-02-03 04:01:39,145 RepairJob.java (line 
 116) [repair #ae78c6c0-8c2b-11e3-b950-c3b81a36bc9b] requesting merkle trees 
 for MyCF (to [/x.x.x.x, /y.y.y.y, /z.z.z.z])
 The repair started at 4am so it stopped after 1:40 minute.
 On node y.y.y.y I can see this in the log:
 INFO [MiscStage:1] 2014-02-03 04:01:38,985 ColumnFamilyStore.java (line 740) 
 Enqueuing flush of Memtable-MyCF@1290890489(2176/5931 serialized/live bytes, 
 32 ops)
  INFO [FlushWriter:411] 2014-02-03 04:01:38,986 Memtable.java (line 333) 
 Writing Memtable-MyCF@1290890489(2176/5931 serialized/live bytes, 32 ops)
  INFO [FlushWriter:411] 2014-02-03 04:01:39,048 Memtable.java (line 373) 
 Completed flushing 
 /var/lib/cassandra/main-db/data/MyKS/MyCF/MyKS-MyCF-jb-518-Data.db (1789 
 bytes) for commitlog position ReplayPosition(segmentId=1390437013339, 
 position=21868792)
  INFO [ScheduledTasks:1] 2014-02-03 05:00:04,794 ColumnFamilyStore.java (line 
 740) Enqueuing flush of Memtable-compaction_history@1649414699(1635/17360 
 serialized/live bytes, 42 ops)
 So for some reason the merkle tree for this CF is never sent back to the node 
 being repaired and it's hanging.
 I've also noticed that sometimes, restarting node y.y.y.y will cause the  
 repair to resume.
 Another observation is that sometimes when restarting y.y.y.y it will not 
 start with these errors:
 ERROR 16:34:18,485 Exception encountered during startup
 java.lang.IllegalStateException: Unfinished compactions reference missing 
 sstables. This should never happen since compactions are marked finished 
 before we start removing the old sstables.
   at 
 org.apache.cassandra.db.ColumnFamilyStore.removeUnfinishedCompactionLeftovers(ColumnFamilyStore.java:495)
   at 
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:264)
   at 
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:461)
   at 
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:504)
 java.lang.IllegalStateException: Unfinished compactions reference missing 
 sstables. This should never happen since compactions are marked finished 
 before we start removing the old sstables.
   at 
 org.apache.cassandra.db.ColumnFamilyStore.removeUnfinishedCompactionLeftovers(ColumnFamilyStore.java:495)
   at 
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:264)
   at 
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:461)
   at 
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:504)
 Exception encountered during startup: Unfinished compactions reference 
 missing sstables. This should never happen since compactions are marked 
 finished before we start removing the old sstables.
 And it will only restart after manually cleaning the compactions_in-progress 
 folder.
 I'm not sure if these two issues are related but we've seen both on all the 
 nodes in our cluster.
 I'll be happy to provide more info if needed as we are not sure what could 
 cause this behavior.
 Another thing in our environment is 

[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

2014-07-18 Thread Vladimir Avram (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067194#comment-14067194
 ] 

Vladimir Avram commented on CASSANDRA-7560:
---

Here is what 'nodetool tpstats' looks like
{noformat}
Pool NameActive   Pending  Completed   Blocked  All 
time blocked
ReadStage 2 2  125806824 0  
   0
RequestResponseStage  0 0  355784492 0  
   0
MutationStage32   766  333060443 0  
   0
ReadRepairStage   0 04972365 0  
   0
ReplicateOnWriteStage 0 0   47863116 0  
   0
GossipStage   0 01110849 0  
   0
AntiEntropyStage  0 0   2384 0  
   0
MigrationStage0 0  0 0  
   0
MemoryMeter   0 0  31508 0  
   0
MemtablePostFlusher   0 0  21543 0  
   0
FlushWriter   0 0  20196 0  
  10
MiscStage 0 0   1049 0  
   0
PendingRangeCalculator0 0  6 0  
   0
commitlog_archiver0 0  0 0  
   0
AntiEntropySessions   1 1  3 0  
   0
InternalResponseStage 0 0146 0  
   0
HintedHandoff 0 0150 0  
   0

Message type   Dropped
RANGE_SLICE  0
READ_REPAIR  6
PAGED_RANGE  0
BINARY   0
READ62
MUTATION  2377
_TRACE   0
REQUEST_RESPONSE46
COUNTER_MUTATION   347
{noformat}

netstats

{noformat}
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 4395496
Mismatch (Blocking): 49764
Mismatch (Background): 3505
Pool NameActive   Pending  Completed
Commandsn/a 0  355976985
Responses   n/a 0  407590806
{noformat}

 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
 --

 Key: CASSANDRA-7560
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7560
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Vladimir Avram

 Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
 AntiEntropySessions.
 The system logs will show the repair command starting
 {noformat}
  INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
 Starting repair command #1, repairing 256 ranges for keyspace x
 {noformat}
 You can then see a few AntiEntropySessions completing with:
 {noformat}
 INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
 successfully
 {noformat}
 Finally we reach an AntiEntropySession at some point that hangs just before 
 requesting the merkle trees for the next column family in line for repair. So 
 we first see the previous CF being finished and the whole repair sessions 
 hangs here with no visible progress or errors on this or any of the related 
 nodes.
 {noformat}
 INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 
 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully 
 synced
 {noformat}
 Notes:
 * Single DC 6 node cluster with an average load of 86 GB per node.
 * This appears to be random; it does not always happen on the same CF or on 
 the same session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

2014-07-18 Thread Vladimir Avram (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Avram updated CASSANDRA-7560:
--

Attachment: cassandra_daemon.log

jstack output from JVM running C*

 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
 --

 Key: CASSANDRA-7560
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7560
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Vladimir Avram
 Attachments: cassandra_daemon.log


 Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
 AntiEntropySessions.
 The system logs will show the repair command starting
 {noformat}
  INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
 Starting repair command #1, repairing 256 ranges for keyspace x
 {noformat}
 You can then see a few AntiEntropySessions completing with:
 {noformat}
 INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
 successfully
 {noformat}
 Finally we reach an AntiEntropySession at some point that hangs just before 
 requesting the merkle trees for the next column family in line for repair. So 
 we first see the previous CF being finished and the whole repair sessions 
 hangs here with no visible progress or errors on this or any of the related 
 nodes.
 {noformat}
 INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 
 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully 
 synced
 {noformat}
 Notes:
 * Single DC 6 node cluster with an average load of 86 GB per node.
 * This appears to be random; it does not always happen on the same CF or on 
 the same session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

2014-07-18 Thread Vladimir Avram (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Avram updated CASSANDRA-7560:
--

Attachment: nodetool_command.log

jstack output of the JVM running nodetool

 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
 --

 Key: CASSANDRA-7560
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7560
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Vladimir Avram
 Attachments: cassandra_daemon.log, nodetool_command.log


 Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
 AntiEntropySessions.
 The system logs will show the repair command starting
 {noformat}
  INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
 Starting repair command #1, repairing 256 ranges for keyspace x
 {noformat}
 You can then see a few AntiEntropySessions completing with:
 {noformat}
 INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
 successfully
 {noformat}
 Finally we reach an AntiEntropySession at some point that hangs just before 
 requesting the merkle trees for the next column family in line for repair. So 
 we first see the previous CF being finished and the whole repair sessions 
 hangs here with no visible progress or errors on this or any of the related 
 nodes.
 {noformat}
 INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 
 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully 
 synced
 {noformat}
 Notes:
 * Single DC 6 node cluster with an average load of 86 GB per node.
 * This appears to be random; it does not always happen on the same CF or on 
 the same session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

2014-07-18 Thread Vladimir Avram (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Avram updated CASSANDRA-7560:
--

Attachment: cassandra_daemon_rep1.log

There is also a stalled AntiEntropySession on this node.

 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
 --

 Key: CASSANDRA-7560
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7560
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Vladimir Avram
 Attachments: cassandra_daemon.log, cassandra_daemon_rep1.log, 
 nodetool_command.log


 Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
 AntiEntropySessions.
 The system logs will show the repair command starting
 {noformat}
  INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
 Starting repair command #1, repairing 256 ranges for keyspace x
 {noformat}
 You can then see a few AntiEntropySessions completing with:
 {noformat}
 INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
 successfully
 {noformat}
 Finally we reach an AntiEntropySession at some point that hangs just before 
 requesting the merkle trees for the next column family in line for repair. So 
 we first see the previous CF being finished and the whole repair sessions 
 hangs here with no visible progress or errors on this or any of the related 
 nodes.
 {noformat}
 INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 
 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully 
 synced
 {noformat}
 Notes:
 * Single DC 6 node cluster with an average load of 86 GB per node.
 * This appears to be random; it does not always happen on the same CF or on 
 the same session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

2014-07-18 Thread Vladimir Avram (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067338#comment-14067338
 ] 

Vladimir Avram edited comment on CASSANDRA-7560 at 7/19/14 1:56 AM:


There is also a stalled AntiEntropySession on rep1


was (Author: vladmore):
There is also a stalled AntiEntropySession on this node.

 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
 --

 Key: CASSANDRA-7560
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7560
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Vladimir Avram
 Attachments: cassandra_daemon.log, cassandra_daemon_rep1.log, 
 cassandra_daemon_rep2.log, nodetool_command.log


 Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
 AntiEntropySessions.
 The system logs will show the repair command starting
 {noformat}
  INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
 Starting repair command #1, repairing 256 ranges for keyspace x
 {noformat}
 You can then see a few AntiEntropySessions completing with:
 {noformat}
 INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
 successfully
 {noformat}
 Finally we reach an AntiEntropySession at some point that hangs just before 
 requesting the merkle trees for the next column family in line for repair. So 
 we first see the previous CF being finished and the whole repair sessions 
 hangs here with no visible progress or errors on this or any of the related 
 nodes.
 {noformat}
 INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 
 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully 
 synced
 {noformat}
 Notes:
 * Single DC 6 node cluster with an average load of 86 GB per node.
 * This appears to be random; it does not always happen on the same CF or on 
 the same session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

2014-07-18 Thread Vladimir Avram (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Avram updated CASSANDRA-7560:
--

Attachment: cassandra_daemon_rep2.log

 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
 --

 Key: CASSANDRA-7560
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7560
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Vladimir Avram
 Attachments: cassandra_daemon.log, cassandra_daemon_rep1.log, 
 cassandra_daemon_rep2.log, nodetool_command.log


 Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
 AntiEntropySessions.
 The system logs will show the repair command starting
 {noformat}
  INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
 Starting repair command #1, repairing 256 ranges for keyspace x
 {noformat}
 You can then see a few AntiEntropySessions completing with:
 {noformat}
 INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
 successfully
 {noformat}
 Finally we reach an AntiEntropySession at some point that hangs just before 
 requesting the merkle trees for the next column family in line for repair. So 
 we first see the previous CF being finished and the whole repair sessions 
 hangs here with no visible progress or errors on this or any of the related 
 nodes.
 {noformat}
 INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 
 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully 
 synced
 {noformat}
 Notes:
 * Single DC 6 node cluster with an average load of 86 GB per node.
 * This appears to be random; it does not always happen on the same CF or on 
 the same session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

2014-07-16 Thread Vladimir Avram (JIRA)
Vladimir Avram created CASSANDRA-7560:
-

 Summary: 'nodetool repair -pr' leads to indefinitely hanging 
AntiEntropySession
 Key: CASSANDRA-7560
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7560
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Vladimir Avram


Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
AntiEntropySessions.

The system logs will show the repair command starting

{panel}
 INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
Starting repair command #1, repairing 256 ranges for keyspace x
{panel}

You can then see a few AntiEntropySessions completing with:

{panel}
INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
successfully
{panel}

Finally we reach an AntiEntropySession at some point that hangs just before 
requesting the merkle trees for the next column family in line for repair. So 
we first see the previous CF being finished and the whole repair sessions hangs 
here with no visible progress or errors on this or any of the related nodes.

{panel}
INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 221) 
[repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully synced
{panel}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

2014-07-16 Thread Vladimir Avram (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Avram updated CASSANDRA-7560:
--

Description: 
Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
AntiEntropySessions.

The system logs will show the repair command starting

{noformat}
 INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
Starting repair command #1, repairing 256 ranges for keyspace x
{noformat}

You can then see a few AntiEntropySessions completing with:

{noformat}
INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
successfully
{noformat}

Finally we reach an AntiEntropySession at some point that hangs just before 
requesting the merkle trees for the next column family in line for repair. So 
we first see the previous CF being finished and the whole repair sessions hangs 
here with no visible progress or errors on this or any of the related nodes.

{noformat}
INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 221) 
[repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully synced
{noformat}

  was:
Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
AntiEntropySessions.

The system logs will show the repair command starting

{panel}
 INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
Starting repair command #1, repairing 256 ranges for keyspace x
{panel}

You can then see a few AntiEntropySessions completing with:

{panel}
INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
successfully
{panel}

Finally we reach an AntiEntropySession at some point that hangs just before 
requesting the merkle trees for the next column family in line for repair. So 
we first see the previous CF being finished and the whole repair sessions hangs 
here with no visible progress or errors on this or any of the related nodes.

{panel}
INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 221) 
[repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully synced
{panel}


 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
 --

 Key: CASSANDRA-7560
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7560
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Vladimir Avram

 Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
 AntiEntropySessions.
 The system logs will show the repair command starting
 {noformat}
  INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
 Starting repair command #1, repairing 256 ranges for keyspace x
 {noformat}
 You can then see a few AntiEntropySessions completing with:
 {noformat}
 INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
 successfully
 {noformat}
 Finally we reach an AntiEntropySession at some point that hangs just before 
 requesting the merkle trees for the next column family in line for repair. So 
 we first see the previous CF being finished and the whole repair sessions 
 hangs here with no visible progress or errors on this or any of the related 
 nodes.
 {noformat}
 INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 
 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully 
 synced
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

2014-07-16 Thread Vladimir Avram (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Avram updated CASSANDRA-7560:
--

Description: 
Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
AntiEntropySessions.

The system logs will show the repair command starting

{noformat}
 INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
Starting repair command #1, repairing 256 ranges for keyspace x
{noformat}

You can then see a few AntiEntropySessions completing with:

{noformat}
INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
successfully
{noformat}

Finally we reach an AntiEntropySession at some point that hangs just before 
requesting the merkle trees for the next column family in line for repair. So 
we first see the previous CF being finished and the whole repair sessions hangs 
here with no visible progress or errors on this or any of the related nodes.

{noformat}
INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 221) 
[repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully synced
{noformat}

Notes:
* Single DC 6 node cluster with an average load of 86 GB per node.
* This appears to be random; it does not always happen on the same CF or on the 
same session.

  was:
Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
AntiEntropySessions.

The system logs will show the repair command starting

{noformat}
 INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
Starting repair command #1, repairing 256 ranges for keyspace x
{noformat}

You can then see a few AntiEntropySessions completing with:

{noformat}
INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
successfully
{noformat}

Finally we reach an AntiEntropySession at some point that hangs just before 
requesting the merkle trees for the next column family in line for repair. So 
we first see the previous CF being finished and the whole repair sessions hangs 
here with no visible progress or errors on this or any of the related nodes.

{noformat}
INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 221) 
[repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully synced
{noformat}


 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
 --

 Key: CASSANDRA-7560
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7560
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Vladimir Avram

 Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
 AntiEntropySessions.
 The system logs will show the repair command starting
 {noformat}
  INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
 Starting repair command #1, repairing 256 ranges for keyspace x
 {noformat}
 You can then see a few AntiEntropySessions completing with:
 {noformat}
 INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
 successfully
 {noformat}
 Finally we reach an AntiEntropySession at some point that hangs just before 
 requesting the merkle trees for the next column family in line for repair. So 
 we first see the previous CF being finished and the whole repair sessions 
 hangs here with no visible progress or errors on this or any of the related 
 nodes.
 {noformat}
 INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 
 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully 
 synced
 {noformat}
 Notes:
 * Single DC 6 node cluster with an average load of 86 GB per node.
 * This appears to be random; it does not always happen on the same CF or on 
 the same session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)