[jira] [Issue Comment Edited] (CASSANDRA-2433) Failed Streams Break Repair

2011-05-23 Thread Stu Hood (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038181#comment-13038181
 ] 

Stu Hood edited comment on CASSANDRA-2433 at 5/23/11 8:09 PM:
--

0001
* Since we're not trying to control throughput or monitor sessions, could we 
just use Stage.MISC?

0002
* I think RepairSession.exception needs to be volatile to ensure that the 
awoken thread sees it
* Would it be better if RepairSession implemented 
IEndpointStateChangeSubscriber directly?
* The endpoint set needs to be threadsafe, since it will be modified by the 
endpoint state change thread, and the AE_STAGE thread

0003
* Should StreamInSession.retries be volatile/atomic? (likely they won't retry 
quickly enough for it to be a problem, but...)

0004
* Playing devil's advocate: would sending a half-built tree in case of failure 
still be useful?

Thanks Sylvain!

  was (Author: stuhood):
0001
* Since we're not trying to control throughput or monitor sessions, could we 
just use Stage.MISC?
0002
* I think RepairSession.exception needs to be volatile to ensure that the 
awoken thread sees it
* Would it be better if RepairSession implemented 
IEndpointStateChangeSubscriber directly?
* The endpoint set needs to be threadsafe, since it will be modified by the 
endpoint state change thread, and the AE_STAGE thread
0003
* Should StreamInSession.retries be volatile/atomic? (likely they won't retry 
quickly enough for it to be a problem, but...)
0004
* Playing devil's advocate: would sending a half-built tree in case of failure 
still be useful?

Thanks Sylvain!
  
 Failed Streams Break Repair
 ---

 Key: CASSANDRA-2433
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benjamin Coverston
Assignee: Sylvain Lebresne
  Labels: repair
 Fix For: 0.8.1

 Attachments: 
 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v2.patch, 
 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 
 0002-Register-in-gossip-to-handle-node-failures-v2.patch, 
 0002-Register-in-gossip-to-handle-node-failures.patch, 
 0003-Report-streaming-errors-back-to-repair-v2.patch, 
 0003-Report-streaming-errors-back-to-repair.patch, 
 0004-Reports-validation-compaction-errors-back-to-repair-v2.patch, 
 0004-Reports-validation-compaction-errors-back-to-repair.patch


 Running repair in cases where a stream fails we are seeing multiple problems.
 1. Although retry is initiated and completes, the old stream doesn't seem to 
 clean itself up and repair hangs.
 2. The temp files are left behind and multiple failures can end up filling up 
 the data partition.
 These issues together are making repair very difficult for nearly everyone 
 running repair on a non-trivial sized data set.
 This issue is also being worked on w.r.t CASSANDRA-2088, however that was 
 moved to 0.8 for a few reasons. This ticket is to fix the immediate issues 
 that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Issue Comment Edited] (CASSANDRA-2433) Failed Streams Break Repair

2011-05-23 Thread Stu Hood (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038181#comment-13038181
 ] 

Stu Hood edited comment on CASSANDRA-2433 at 5/23/11 8:10 PM:
--

0001
* Since we're not trying to control throughput or monitor sessions, could we 
just use Stage.MISC?

0002
* I think RepairSession.exception needs to be volatile to ensure that the 
awoken thread sees it
* Would it be better if RepairSession implemented 
IEndpointStateChangeSubscriber directly?
* The endpoint set needs to be threadsafe, since it will be modified by the 
endpoint state change thread, and the AE_STAGE thread

0003
* Should StreamInSession.retries be volatile/atomic? (likely they won't retry 
quickly enough for it to be a problem, but...)

0004
* Playing devil's advocate: would sending a half-built tree in case of failure 
still be useful?
* success might need to be volatile as well

Thanks Sylvain!

  was (Author: stuhood):
0001
* Since we're not trying to control throughput or monitor sessions, could we 
just use Stage.MISC?

0002
* I think RepairSession.exception needs to be volatile to ensure that the 
awoken thread sees it
* Would it be better if RepairSession implemented 
IEndpointStateChangeSubscriber directly?
* The endpoint set needs to be threadsafe, since it will be modified by the 
endpoint state change thread, and the AE_STAGE thread

0003
* Should StreamInSession.retries be volatile/atomic? (likely they won't retry 
quickly enough for it to be a problem, but...)

0004
* Playing devil's advocate: would sending a half-built tree in case of failure 
still be useful?

Thanks Sylvain!
  
 Failed Streams Break Repair
 ---

 Key: CASSANDRA-2433
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benjamin Coverston
Assignee: Sylvain Lebresne
  Labels: repair
 Fix For: 0.8.1

 Attachments: 
 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v2.patch, 
 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 
 0002-Register-in-gossip-to-handle-node-failures-v2.patch, 
 0002-Register-in-gossip-to-handle-node-failures.patch, 
 0003-Report-streaming-errors-back-to-repair-v2.patch, 
 0003-Report-streaming-errors-back-to-repair.patch, 
 0004-Reports-validation-compaction-errors-back-to-repair-v2.patch, 
 0004-Reports-validation-compaction-errors-back-to-repair.patch


 Running repair in cases where a stream fails we are seeing multiple problems.
 1. Although retry is initiated and completes, the old stream doesn't seem to 
 clean itself up and repair hangs.
 2. The temp files are left behind and multiple failures can end up filling up 
 the data partition.
 These issues together are making repair very difficult for nearly everyone 
 running repair on a non-trivial sized data set.
 This issue is also being worked on w.r.t CASSANDRA-2088, however that was 
 moved to 0.8 for a few reasons. This ticket is to fix the immediate issues 
 that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira