[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13068907#comment-13068907 ] Sylvain Lebresne commented on CASSANDRA-2816: - Alright, v5 looks good to me. Committed, thanks. Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch, 2816-v2.txt, 2816-v4.txt, 2816-v5.txt, 2816_0.8_v3.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13068953#comment-13068953 ] Hudson commented on CASSANDRA-2816: --- Integrated in Cassandra-0.8 #231 (See [https://builds.apache.org/job/Cassandra-0.8/231/]) Properly synchronize merkle tree computation patch by slebresne; reviewed by jbellis for CASSANDRA-2816 slebresne : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1149121 Files : * /cassandra/branches/cassandra-0.8/test/unit/org/apache/cassandra/service/AntiEntropyServiceTestAbstract.java * /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/AntiEntropyService.java * /cassandra/branches/cassandra-0.8/CHANGES.txt * /cassandra/branches/cassandra-0.8/conf/cassandra.yaml * /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/StorageService.java * /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/concurrent/DebuggableThreadPoolExecutor.java * /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/db/compaction/CompactionManager.java Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch, 2816-v2.txt, 2816-v4.txt, 2816-v5.txt, 2816_0.8_v3.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13068508#comment-13068508 ] Sylvain Lebresne commented on CASSANDRA-2816: - I think that if we don't want validation executor of v4 to ever queue tasks (which is what we need), then we need the executor queue to be a bounded queue of size 0 (i.e. that doesn't accept element). Indeed, as per the documentation of ThreadPoolExecutor: {noformat} If corePoolSize or more threads are running, the Executor always prefers queuing a request rather than adding a new thread. {noformat} Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch, 2816-v2.txt, 2816-v4.txt, 2816_0.8_v3.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067011#comment-13067011 ] Jonathan Ellis commented on CASSANDRA-2816: --- bq. I'm kinda +1 on the simple version w/o bounds Me too. +1 with that change. Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067019#comment-13067019 ] Jonathan Ellis commented on CASSANDRA-2816: --- (I'll go ahead and submit a version with that change.) Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13064874#comment-13064874 ] Jonathan Ellis commented on CASSANDRA-2816: --- bq. making it unlimited feels dangerous, because if you do so, it means that if the use start a lot of repair, all the validation compaction will start right away But the easy solution is don't do that. By setting a finite number greater than one, you have to restart machines when you realize oh, I want to have 3 simultaneous now. I'd rather keep it simple: make it unbounded, no configuration settings. If you ignore the instructions to only run one repair at once, then either you know what you're doing (maybe you have SSDs) or you will find out very quickly and never do it again. :) Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13064898#comment-13064898 ] Peter Schuller commented on CASSANDRA-2816: --- I'm kinda +1 on the simple version w/o bounds but not too fussy since I can obviously set it very high for my use case. In any case, the most important part for mixed small/large type of situation is that concurrent repair is possible, even if configuration changes are needed. Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13059962#comment-13059962 ] Jonathan Ellis commented on CASSANDRA-2816: --- bq. The patch implements the idea of scheduling the merkle tree requests one by one, to make sure the tree are started as close as possible of the same time. Can you point out where this happens in AES? bq. This also put validation compaction in their own executor (to avoid them to be queued up behind standard compactions). That specific executor is created with 2 core threads, to allow for Peter's use case of wanting to do multiple repairs at the same time. That is, by default, you can do 2 repairs involving the same node and be ok That feels like the wrong default to me. I think you can make a case for one (minimal interference with the rest of the system) or unlimited (no weird cliff to catch the unwary repair operator). But two is weird. :) Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13059991#comment-13059991 ] Sylvain Lebresne commented on CASSANDRA-2816: - bq. Can you point out where this happens in AES? Mostly in AES.rendezvous and AES.RepairSession. Basically, RepairSession creates a queue of jobs, a job representing the repair of a given column family (for a given range, but that comes from the session itself). AES.rendezvous is then call for each received merkleTree. It waits to have all the merkeTree for the first job in the queue. When that is done, it dequeue the job (computing the merkle tree differences and scheduling streaming accordingly) and send the tree request for the next job in the queue. Moreover, in StorageService.forceTableRepair(), when scheduling the repair for all the ranges of the node, we actually start the session for the first range and wait for all the jobs for this range to be done before starting the next session. bq. That feels like the wrong default to me. I think you can make a case for one (minimal interference with the rest of the system) or unlimited (no weird cliff to catch the unwary repair operator). But two is weird. Well the rational was the following one: if you set it to two, then you're saying that as soon as you start 2 repairs in parallel, they will start being inaccurate. But as Peter was suggesting (maybe in another ticket but anyway), if you have huge CF and tiny ones, it's nice to be able to repair on the tiny ones while a repair on the huge one(s) is running. Now, making it unlimited feels dangerous, because if you do so, it means that if the use start a lot of repair, all the validation compaction will start right away. This will kill the cluster (at least a few nodes if all those repair were started on the same node). It sounded better to have degraded precision for repair in those cases rather than basically killing the nodes. Maybe 2 or 4 may be a better default than 2, but 1 is a bit limited and unlimited is clearly much too dangerous. Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13059445#comment-13059445 ] Terje Marthinussen commented on CASSANDRA-2816: --- Things definitely seems to be improved overall, but weird things still happens. So... 12 node cluster, this is maybe ugly, I know, but start repair on all of them. Most nodes are fine, but one goes crazy. Disk use is now 3-4 times what it was before the repair started, and it does not seem to be done yet. I have really no idea if this is the case, but I am getting the hunch that this node has ended up streaming out some of the data it is getting in. Would this be possible? Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13059459#comment-13059459 ] Sylvain Lebresne commented on CASSANDRA-2816: - bq. So... 12 node cluster, this is maybe ugly, I know, but start repair on all of them. Is it started on all of them ? If so, this is kind of expected in the sense that the patch assumes that each node does not do more than 2 repairs (for any column family) at the same time (this is configurable through the new concurrent_validators option, but it's probably better to stick to 2 and stagger the repair). If you do more than that (that is, if you did repair on all node at the same time and RF2), then we're back on our old demons. bq. I have really no idea if this is the case, but I am getting the hunch that this node has ended up streaming out some of the data it is getting in. Would this be possible? Not really. That is, it could be that you create a merkle tree on some data and once you start streaming you, you're picking up data that was just streamed to you and wasn't there when computing the tree. This patch is suppose to fixes this in parts, but this can still happen if you do repairs in parallel on neighboring nodes. However, you shouldn't get into a situation where 2 nodes stream forever because they pick up what is just streamed to them for instance, because what is streaming is determined at the very beginning of the streaming session. So my first question would be, was all those repair started in parallel. If yes, you shall not do this :). CASSANDRA-2606 and CASSANDRA-2610 are here to help making the repair of a full cluster much easier (and efficient), but right now it's more about getting patch in one at a time. If the repairs were started one at a time in a rolling fashion, then we do have a unknown problem somewhere. Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13059485#comment-13059485 ] Terje Marthinussen commented on CASSANDRA-2816: --- Cool! Then you confirmed what I have sort of believed for a while, but my understanding of code has been a bit in conflict with: http://wiki.apache.org/cassandra/Operations which says: It is safe to run repair against multiple machines at the same time, but to minimize the impact on your application workload it is recommended to wait for it to complete on one node before invoking it against the next. I have always read that as if you have the HW, go for it! May I change to: It is safe to run repair against multiple machines at the same time. However, to minimize the amount of data transferred during a repair, careful synchronization is required between the nodes taking part of the repair. This is difficult to do if nodes with the same data replicas runs repair at the same time and doing so can in extreme cases generate excessive transfers of data. Improvements is being worked on, but for now, avoid scheduling repair on several nodes with replicas of the same data at the same time. Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13059637#comment-13059637 ] Terje Marthinussen commented on CASSANDRA-2816: --- Regardless of change of documentation however, I don't think it should be possible to actually trigger a scenario like this in the first place. The system should protect the user from that. I also noticed that in this case, we have RF3. The node which is going somewhat crazy is number 6, however during the repair, it does log that it talks compares and streams data with node 4, 5, 7 and 8. Seems like a couple of nodes too many? Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13059636#comment-13059636 ] Terje Marthinussen commented on CASSANDRA-2816: --- Regardless of change of documentation however, I don't think it should be possible to actually trigger a scenario like this in the first place. The system should protect the user from that. I also noticed that in this case, we have RF3. The node which is going somewhat crazy is number 6, however during the repair, it does log that it talks compares and streams data with node 4, 5, 7 and 8. Seems like a couple of nodes too many? Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13059643#comment-13059643 ] Jonathan Ellis commented on CASSANDRA-2816: --- bq. May I change to Sure. bq. The system should protect the user from that I'm not sure that in a p2p design we can posit an omniscient the system. Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13059655#comment-13059655 ] Terje Marthinussen commented on CASSANDRA-2816: --- bq.I'm not sure that in a p2p design we can posit an omniscient the system. Is that a philosophical statement? :) As Cassandra, at least for now, is a p2p network with fairly clearly defined boundaries, I will continue calling it a system for now :) However, looking at it from the p2p viewpoint, the user potentially have no clue about where replicas are stored and given this, it may be impossible for the user to issue repair manually on more than one node at a time without getting in trouble. Given a large enough p2p setup, it would also be non-trivial to actually schedule a complete repair without ending up with 2 or more repairs running on the same replica set. Since Cassandra do no checkpoint the synchronization so it is forced to rescan everything on every repair, repairs easily take so long that you are forced to run it on several nodes at a time if you are going to manage to finish repairing all nodes in 10 days... Anyway, this is way outside the scope of this jira :) Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13059658#comment-13059658 ] Terje Marthinussen commented on CASSANDRA-2816: --- bq. I also noticed that in this case, we have RF3. The node which is going somewhat crazy is number 6, however during the repair, it does log that it talks compares and streams data with node 4, 5, 7 and 8. This is maybe correct. Node 7 will replicate to node 6 and 8 so 6 and 8 would share data. So, to make things safe, even with this patch, every 4th node can run repair at the same time if RF=3?, but you still need to run repair on each of those 4 nodes to make sure it is all repaired? Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13057339#comment-13057339 ] Terje Marthinussen commented on CASSANDRA-2816: --- This is what heap looks like when GC start slowing things down so much that even gossip gets delayed long enough for nodes to be down for some seconds. num #instances #bytes class name -- 1: 9453188 453753024 java.nio.HeapByteBuffer 2: 10081546 392167064 [B 3: 7616875 24374 org.apache.cassandra.db.Column 4: 9739914 233757936 java.util.concurrent.ConcurrentSkipListMap$Node 5: 4131938 99166512 java.util.concurrent.ConcurrentSkipListMap$Index 6: 1549230 49575360 org.apache.cassandra.db.DeletedColumn I guess this really ends up maybe being the mix of everything going on in total and all the reading and writing that may occur when repair runs (valiadation compactions, streaming, normal compactions and regular traffic all at the same time and maybe many CFs at the same time). However, I have suspected for some time that our young size was a bit on the small side and after increasing it and giving the heap a few more GB to work with, it seems like things are behaving quite a bit better. I mentioned issues with this patch when testing for CASSANDRA-2521. That was a problem caused by me. Was playing around with git for the first time and I manage to apply 2816 to a different branch than the one I used for testing :( My appologies. Initial testing with that corrected looks a lot better for my small scale test case, but I noticed one time where I deleted an sstable and restarted. It did not get repaired (repair scanned but did nothing). Not entirely sure what to make out of that, I then tested to delete another sstable and repair started running. I will test more over the next days. Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055421#comment-13055421 ] Sylvain Lebresne commented on CASSANDRA-2816: - bq. We have also spotted very noticable issues with full GCs when the merkle trees are passed around. Hopefully this could fix that too. This do make sure that we don't do multiple validation at the same time and that we keep a small number of merkle tree in memory at the same time. So I suppose this could help on the GC side. But overall I don't know if I am too optimistic about that, in part because I'm not sure what causes your issues. But this can't hurt on that side at least. bq. I will see if I can get this patch tested somewhere if it is ready for that. I believe it should be ready for that. bq. would it be an potential interesting idea to separate tombstones in different sstables. The thing is that some tombstones may be irrelevant become some update supersedes it (this is specially true of row tombstones). Hence basing a repair on tombstone only may transfer irrelevant data. I suppose it may depend on the use case this will be more or less a big deal. Also, this means that a read will be impacted in that we will often have to hit twice as many sstables. Given that it's not a crazy idea either to want to repair data regularly (if only for durability guarantee), I don't know if it is worth the trouble (we would have to separate tombstones from data at flush time, we'll have to maintain the two separate set of data/tombstone sstables, etc...). bq. make compaction deterministic or synchronized by a master across nodes Pretty sure we want to avoid going to a master architecture for everything if we can. Having master means that failure handling is more difficult (think network partition for instance) and require leader election and such, and the whole point of the fully distribution of Cassandra is to avoid those. Even without consider those, synchronizing compaction means synchronizing flush somehow and you want to be precise if you're going to use whole sstable md5s, which will be hard and quite probably inefficient. Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055455#comment-13055455 ] Terje Marthinussen commented on CASSANDRA-2816: --- I don't know what causes GC when doing repairs either, but fire off repair on a few nodes with 100 million docs/node and there is a reasonable chance that a node here and there will log messages about reducing cache sizes due to memory pressure (I am not really sure it is a good idea to do this at all, reducing caches during stress rarely improves anything) or full GC. The thought about the master controlled compaction would not really affect network splits etc. Reconciliation after a network split is really as complex with or without a master. We need to get back to a state where all the nodes have the same data anyway which is a complex task anyway. This is more a consideration of the fact that we do not necessarily need to live in quorum based world during compaction and we are free to use alternative approaches in the compaction without changing read/write path or affecting availability. Master selection is not really a problem here. Start compaction, talk to other nodes with the same token ranges, select a leader. Does not even have to be the same master every time and could consider if we could make compaction part of a background read repair to reduce the amount of times we need to read/write data. For instance, if we can verify that the oldest/biggest sstables is 100% in sync with data on other replicas when it is compacted (why not do it during compaction when we go through the data anyway rather than later?),can we use that info to optimize the scans done during repairs by only using data in sstables with data received after some checkpoint in time as the starting point for the consistency check? Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055493#comment-13055493 ] Jonathan Ellis commented on CASSANDRA-2816: --- bq. I am not really sure it is a good idea to do this at all, reducing caches during stress rarely improves anything (This is on by default because the most common cause of OOMing is people configuring their caches too large.) It sounds odd to me that repair would balloon memory usage dramatically. Do you have monitoring graphs that show the difference in heap usage between normal and repair in progress? Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054841#comment-13054841 ] Terje Marthinussen commented on CASSANDRA-2816: --- Sounds good to me. This sounds very interesting. We have also spotted very noticable issues with full GCs when the merkle trees are passed around. Hopefully this could fix that too. I will see if I can get this patch tested somewhere if it is ready for that. On a side topic, given the importance of getting tombstones properly synchronized within GCGraceSeconds, would it be an potential interesting idea to separate tombstones in different sstables to reduce the need to scan the whole dataset very frequently in the first place? Another thought may be to make compaction deterministic or synchronized by a master across nodes so for older data, all we needed was to compare pre-stored md5s of how whole sstables? That is, while keeping the masterless design for updates, we could consider a master based design for how older data is being organized by the compactor. so it would be much easier to verify that old data is the same without any large regular scans and that data is really the same after big compactions etc. Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Fix For: 0.8.2 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054289#comment-13054289 ] Peter Schuller commented on CASSANDRA-2816: --- I've thought about this problem too, and it is really significant for some use-cases. Again because so few writes are needed in order to trigger large amounts of data being sent given the merklee tree granularity. While I'm all for a fixing it by e.g. more immediate snapshotting, I would like to raise the issue that repairs overall have pretty significant side-effects; particularly ones that can self-magnify and cause further problems. Beyond the obvious it does disk I/O and It uses CPU, we have: * Over-repair due to merklee tree granularity can cause jumps in CF sizes, killing cache locality * Combine that with concurrent repairs then repairing the size-jumped set of sstables and you can magnify that effect on other nodes causing huge size increases. * Up to recently, mixing large and small cf:s was a significant problem if you wanted to have different frequencies and different gc grace times, due to one repair blocking on another. But fixes to this and the other JIRA about concurrency, might disable the fix for that that was concurrent compaction - so back to square one. I guess overall, it seems very easy to shoot yourself in the foot with repair. Any opinions on CASSANDRA-2699 for longer term changes to repair? Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 0.7.0, 0.8.0 Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054297#comment-13054297 ] Sylvain Lebresne commented on CASSANDRA-2816: - I'm not sure what you mean by snapshotting immediately or polishing our snapshot support, but one approach that I think is equivalent to that (or maybe that is what you meant by 'snapshotting') would be to grab references to the sstables at the very beginning for each request and use those all throughout the repair. This has however a problem: this means we retain sstables from being deleted during repair, including sstables that are compacted in the meantime. Because repair can take a while, this will be bad. This will also require changes to the wire protocol (because we'll need a way to indicate during streaming the set of sstables to consider), and since we've kind of decided to not do that in minor releases (at least until we've discussed that), this means this cannot be released quickly. Which is bad, because I'm pretty sure this is a good part of the reason why some people with big data sets have had huge pain with repair. Scheduling the validation one by one avoids those problems. In theory this means we'll do less work in parallel, but in practice I doubt this is a big since the goal is probably to have repair have less impact on the node rather than more. It will also make this more easy to reason about. Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 0.7.0, 0.8.0 Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053820#comment-13053820 ] Jonathan Ellis commented on CASSANDRA-2816: --- I guess a dedicated validation executor is ok as long as it still obeys the global compaction i/o limit. Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 0.7.0, 0.8.0 Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054233#comment-13054233 ] Stu Hood commented on CASSANDRA-2816: - I'm a fan of the snapshotting immediately after receiving the request approach. In general, polishing our snapshot support to allow for this kind of usecase is likely to open up other interesting possibilities. Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 0.7.0, 0.8.0 Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
[ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054237#comment-13054237 ] Jonathan Ellis commented on CASSANDRA-2816: --- Supporting actual live-reading of snapshotted sstables is a little more than polishing. Repair doesn't synchronize merkle tree creation properly Key: CASSANDRA-2816 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 0.7.0, 0.8.0 Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Labels: repair Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair. When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons: * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem) * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones. Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible). One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and: * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away). * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira