[ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038181#comment-13038181 ]
Stu Hood edited comment on CASSANDRA-2433 at 5/23/11 8:10 PM: -------------------------------------------------------------- 0001 * Since we're not trying to control throughput or monitor sessions, could we just use Stage.MISC? 0002 * I think RepairSession.exception needs to be volatile to ensure that the awoken thread sees it * Would it be better if RepairSession implemented IEndpointStateChangeSubscriber directly? * The endpoint set needs to be threadsafe, since it will be modified by the endpoint state change thread, and the AE_STAGE thread 0003 * Should StreamInSession.retries be volatile/atomic? (likely they won't retry quickly enough for it to be a problem, but...) 0004 * Playing devil's advocate: would sending a half-built tree in case of failure still be useful? * success might need to be volatile as well Thanks Sylvain! was (Author: stuhood): 0001 * Since we're not trying to control throughput or monitor sessions, could we just use Stage.MISC? 0002 * I think RepairSession.exception needs to be volatile to ensure that the awoken thread sees it * Would it be better if RepairSession implemented IEndpointStateChangeSubscriber directly? * The endpoint set needs to be threadsafe, since it will be modified by the endpoint state change thread, and the AE_STAGE thread 0003 * Should StreamInSession.retries be volatile/atomic? (likely they won't retry quickly enough for it to be a problem, but...) 0004 * Playing devil's advocate: would sending a half-built tree in case of failure still be useful? Thanks Sylvain! > Failed Streams Break Repair > --------------------------- > > Key: CASSANDRA-2433 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2433 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: Benjamin Coverston > Assignee: Sylvain Lebresne > Labels: repair > Fix For: 0.8.1 > > Attachments: > 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v2.patch, > 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, > 0002-Register-in-gossip-to-handle-node-failures-v2.patch, > 0002-Register-in-gossip-to-handle-node-failures.patch, > 0003-Report-streaming-errors-back-to-repair-v2.patch, > 0003-Report-streaming-errors-back-to-repair.patch, > 0004-Reports-validation-compaction-errors-back-to-repair-v2.patch, > 0004-Reports-validation-compaction-errors-back-to-repair.patch > > > Running repair in cases where a stream fails we are seeing multiple problems. > 1. Although retry is initiated and completes, the old stream doesn't seem to > clean itself up and repair hangs. > 2. The temp files are left behind and multiple failures can end up filling up > the data partition. > These issues together are making repair very difficult for nearly everyone > running repair on a non-trivial sized data set. > This issue is also being worked on w.r.t CASSANDRA-2088, however that was > moved to 0.8 for a few reasons. This ticket is to fix the immediate issues > that we are seeing in 0.7. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira