[jira] [Issue Comment Edited] (CASSANDRA-2433) Failed Streams Break Repair
[ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038181#comment-13038181 ] Stu Hood edited comment on CASSANDRA-2433 at 5/23/11 8:09 PM: -- 0001 * Since we're not trying to control throughput or monitor sessions, could we just use Stage.MISC? 0002 * I think RepairSession.exception needs to be volatile to ensure that the awoken thread sees it * Would it be better if RepairSession implemented IEndpointStateChangeSubscriber directly? * The endpoint set needs to be threadsafe, since it will be modified by the endpoint state change thread, and the AE_STAGE thread 0003 * Should StreamInSession.retries be volatile/atomic? (likely they won't retry quickly enough for it to be a problem, but...) 0004 * Playing devil's advocate: would sending a half-built tree in case of failure still be useful? Thanks Sylvain! was (Author: stuhood): 0001 * Since we're not trying to control throughput or monitor sessions, could we just use Stage.MISC? 0002 * I think RepairSession.exception needs to be volatile to ensure that the awoken thread sees it * Would it be better if RepairSession implemented IEndpointStateChangeSubscriber directly? * The endpoint set needs to be threadsafe, since it will be modified by the endpoint state change thread, and the AE_STAGE thread 0003 * Should StreamInSession.retries be volatile/atomic? (likely they won't retry quickly enough for it to be a problem, but...) 0004 * Playing devil's advocate: would sending a half-built tree in case of failure still be useful? Thanks Sylvain! Failed Streams Break Repair --- Key: CASSANDRA-2433 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benjamin Coverston Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.1 Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v2.patch, 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 0002-Register-in-gossip-to-handle-node-failures-v2.patch, 0002-Register-in-gossip-to-handle-node-failures.patch, 0003-Report-streaming-errors-back-to-repair-v2.patch, 0003-Report-streaming-errors-back-to-repair.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v2.patch, 0004-Reports-validation-compaction-errors-back-to-repair.patch Running repair in cases where a stream fails we are seeing multiple problems. 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs. 2. The temp files are left behind and multiple failures can end up filling up the data partition. These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set. This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (CASSANDRA-2433) Failed Streams Break Repair
[ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038181#comment-13038181 ] Stu Hood edited comment on CASSANDRA-2433 at 5/23/11 8:10 PM: -- 0001 * Since we're not trying to control throughput or monitor sessions, could we just use Stage.MISC? 0002 * I think RepairSession.exception needs to be volatile to ensure that the awoken thread sees it * Would it be better if RepairSession implemented IEndpointStateChangeSubscriber directly? * The endpoint set needs to be threadsafe, since it will be modified by the endpoint state change thread, and the AE_STAGE thread 0003 * Should StreamInSession.retries be volatile/atomic? (likely they won't retry quickly enough for it to be a problem, but...) 0004 * Playing devil's advocate: would sending a half-built tree in case of failure still be useful? * success might need to be volatile as well Thanks Sylvain! was (Author: stuhood): 0001 * Since we're not trying to control throughput or monitor sessions, could we just use Stage.MISC? 0002 * I think RepairSession.exception needs to be volatile to ensure that the awoken thread sees it * Would it be better if RepairSession implemented IEndpointStateChangeSubscriber directly? * The endpoint set needs to be threadsafe, since it will be modified by the endpoint state change thread, and the AE_STAGE thread 0003 * Should StreamInSession.retries be volatile/atomic? (likely they won't retry quickly enough for it to be a problem, but...) 0004 * Playing devil's advocate: would sending a half-built tree in case of failure still be useful? Thanks Sylvain! Failed Streams Break Repair --- Key: CASSANDRA-2433 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benjamin Coverston Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.1 Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v2.patch, 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 0002-Register-in-gossip-to-handle-node-failures-v2.patch, 0002-Register-in-gossip-to-handle-node-failures.patch, 0003-Report-streaming-errors-back-to-repair-v2.patch, 0003-Report-streaming-errors-back-to-repair.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v2.patch, 0004-Reports-validation-compaction-errors-back-to-repair.patch Running repair in cases where a stream fails we are seeing multiple problems. 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs. 2. The temp files are left behind and multiple failures can end up filling up the data partition. These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set. This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira