> Thing is, why is it so easy for the repair process to break? OK, I admit I'm > not sure why nodes are reported as "dead" once in a while, but it's > absolutely certain that they simply don't fall off the edge, are knocked out > for 10 min or anything like that. Why is there no built-in tolerance/retry > mechanism so that a node that may seem silent for a minute can be contacted > later, or, better yet, a different node with a relevant replica is > contacted? > > As was evident from some presentations at Cassandra-NYC yesterday, failed > compactions and repairs are a major problem for a number of users. The > cluster can quickly become unusable. I think it would be a good idea to > build more robustness into these procedures,
I am trying to argue for removing the failure-detector-kills-repair in https://issues.apache.org/jira/browse/CASSANDRA-3569, but I don't know whether that will happen since there is opposition. However, that only fixes the particular issue you are having right now. There are significant problems with repair, and the answer to why there is no retry is probably because it takes non-trivial amounts of work to make the current repair process be fault-tolerant in the face of TCP connections dying. Personally, my pet ticket to fix repair once and for all is https://issues.apache.org/jira/browse/CASSANDRA-2699 which should, at least as I envisioned it, fix a lot of problems, including making it much much much more robust to transient failures (it would just automatically be robust without specific code necessary to deal with it, because repair work would happen piecemeal and incrementally in a repeating fashion anyway). Nodes could basically be going up and down in any wild haywire mode and things would just automatically continue to work in the background. Repair would become irrelevant to cluster maintenance, and you wouldn't really have to think about whether or not someone is repairing. You would also not have to think about repair vs. gc grace time because it would all just sit there and work without intervention. It's a pretty big ticket though and not something I'm gonna be working on in my spare time, so I don't know whether or when I would actually work on that ticket (depends on priorities). I have the ideas but I can't promise to fix it :) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)