[jira] [Issue Comment Edited] (CASSANDRA-4032) memtable.updateLiveRatio() is blocking, causing insane latencies for writes
[ https://issues.apache.org/jira/browse/CASSANDRA-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13226166#comment-13226166 ] Peter Schuller edited comment on CASSANDRA-4032 at 3/9/12 4:05 PM: --- {quote} Are we sure that what we want is a SynchronousQueue with task rejected? After all, there is only on global memoryMeter, so we could end up failing to updateLiveRatio just based on a race, even if calculations are fast. I'd suggest instead a bounded queue (but maybe not infinite and we could indeed just skip task if that queue gets full). {quote} I agree it's fishy, though I'd suggest a separate ticket. This patch is intended to make the code behave the way the original commit intended. This (from the code, not my patch) seems legit though: {code} // we're careful to only allow one count to run at a time because counting is slow // (can be minutes, for a large memtable and a busy server), so we could keep memtables // alive after they're flushed and would otherwise be GC'd. {code} We could have one queue per unique CF and have a consumer that iterates over the set of queues, guaranteeing that each CF gets processed once per cycle. A simpler solution is probably preferable though if we can think of one. was (Author: scode): {code} Are we sure that what we want is a SynchronousQueue with task rejected? After all, there is only on global memoryMeter, so we could end up failing to updateLiveRatio just based on a race, even if calculations are fast. I'd suggest instead a bounded queue (but maybe not infinite and we could indeed just skip task if that queue gets full). {code} I agree it's fishy, though I'd suggest a separate ticket. This patch is intended to make the code behave the way the original commit intended. This (from the code, not my patch) seems legit though: {code} // we're careful to only allow one count to run at a time because counting is slow // (can be minutes, for a large memtable and a busy server), so we could keep memtables // alive after they're flushed and would otherwise be GC'd. {code} We could have one queue per unique CF and have a consumer that iterates over the set of queues, guaranteeing that each CF gets processed once per cycle. A simpler solution is probably preferable though if we can think of one. memtable.updateLiveRatio() is blocking, causing insane latencies for writes --- Key: CASSANDRA-4032 URL: https://issues.apache.org/jira/browse/CASSANDRA-4032 Project: Cassandra Issue Type: Bug Components: Core Reporter: Peter Schuller Assignee: Peter Schuller Fix For: 1.1.0 Attachments: CASSANDRA-4032-1.1.0-v1.txt Reproduce by just starting a fresh cassandra with a heap large enough for live ratio calculation (which is {{O(n)}}) to be insanely slow, and then running {{./bin/stress -d host -n1 -t10}}. With a large enough heap and default flushing behavior this is bad enough that stress gets timeouts. Example (blocked for is my debug log added around submit()): {code} INFO [MemoryMeter:1] 2012-03-09 15:07:30,857 Memtable.java (line 198) CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 8.89014894083727 (just-counted was 8.89014894083727). calculation took 28273ms for 1320245 columns WARN [MutationStage:8] 2012-03-09 15:07:30,857 Memtable.java (line 209) submit() blocked for: 231135 {code} The calling code was written assuming a RejectedExecutionException is thrown, but it's not because {{DebuggableThreadPoolExecutor}} installs a blocking rejection handler. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (CASSANDRA-3952) avoid quadratic startup time in LeveledManifest
[ https://issues.apache.org/jira/browse/CASSANDRA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223091#comment-13223091 ] Peter Schuller edited comment on CASSANDRA-3952 at 3/6/12 8:41 AM: --- Committed with an additional assertion and the map renamed to {{sstableGenerations}}, and including 1.1.0. It was marked for 1.1.1 but frankly if this *does* introduce some kind of bug, it feels more dangerous to have that crop up in an upgrade to 1.1.1 than to have it in the initial release. was (Author: scode): Committed, including 1.1.0. It was marked for 1.1.1 but frankly if this *does* introduce some kind of bug, it feels more dangerous to have that crop up in an upgrade to 1.1.1 than to have it in the initial release. avoid quadratic startup time in LeveledManifest --- Key: CASSANDRA-3952 URL: https://issues.apache.org/jira/browse/CASSANDRA-3952 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Dave Brosius Priority: Minor Labels: lhf Fix For: 1.1.0 Attachments: speed_up_level_of.diff Checking that each sstable is in the manifest on startup is O(N**2) in the number of sstables: {code} . // ensure all SSTables are in the manifest for (SSTableReader ssTableReader : cfs.getSSTables()) { if (manifest.levelOf(ssTableReader) 0) manifest.add(ssTableReader); } {code} {code} private int levelOf(SSTableReader sstable) { for (int level = 0; level generations.length; level++) { if (generations[level].contains(sstable)) return level; } return -1; } {code} Note that the contains call is a linear List.contains. We need to switch to a sorted list and bsearch, or a tree, to support TB-levels of data in LeveledCompactionStrategy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes
[ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217415#comment-13217415 ] Peter Schuller edited comment on CASSANDRA-3294 at 2/27/12 7:21 PM: {quote} This sounds like reinventing the existing failure detector to me. {quote} Except we don't use it that way at all (see CASSANDRA-3927). Even if we did though, I personally think it's totally the wrong solution to this problem since we have the *perfect* measurement - whether the TCP connection is up. It's fine if we have other information that actively indicates we shouldn't send messages to it (whether it's the FD or the fact that we have 500 000 messages queued to the node), but if we *know* the TCP connection is down, we should just not send messages to it, period. With the only caveat being that of course we'd have to make sure TCP connections are in fact pro-actively kept up under all circumstances (I'd have to look at code to figure out what issues there are, if any, in detail). {quote} The main idea of the algorithm I have mentioned is to make sure that we always do operations (write/read etc.) on the nodes that have the highest probability to be alive determined by live traffic going there instead of passively relying on the failure detector. {quote} I have an unfiled ticket to suggest making the proximity sorting probabilistic to avoid the binary either we get traffic or we dont (or either we get data or we get digest) situation. That would certainly help. As would least-requests-outstanding. You can totally make it so that this ticket is irrelevant by just making the general case well-supported enough that there is no reason to special case this. This was originally filed since we had none of that, and we still don't, and it seemed like a very trivial case to handle for the TCP connection to be actively reset by the other side. {quote} After reading CASSANDRA-3722 it seems we can implement required logic at the snitch level taking latency measurements into account. I think we can close this one in favor of CASSANDRA-3722 and continue work/discussion there. What do you think, Brandon, Peter? {quote} I think CASSANDRA-3722's original premise doesn't address the concerns I see in real life (I don't want special cases trying to communicate X is happening), but towards the end I start agreeing with the ticket more. In any case, feel free to close if you want. If I ever get to actually implementing this (if at that point there is no other mechanism to remove the need) I'll just re-file or re-open with a patch. We don't need to track this if others aren't interested. was (Author: scode): {quote} This sounds like reinventing the existing failure detector to me. {quote} Except we don't use it that way at all (see CASSANDRA-3927). Even if we did though, I personally think it's totally the wrong solution to this problem since we have the *perfect* measurement - whether the TCP connection is up. It's fine if we have other information that actively indicates we shouldn't send messages to it (whether it's the FD or the fact that we have 500 000 messages queued to the node), but if we *know* the TCP connection is down, we should just not send messages to it, period. With the only caveat being that of course we'd have to make sure TCP connections are in fact pro-actively kept up under all circumstances (I'd have to look at code to figure out what issues there are, if any, in detail). {quote} The main idea of the algorithm I have mentioned is to make sure that we always do operations (write/read etc.) on the nodes that have the highest probability to be alive determined by live traffic going there instead of passively relying on the failure detector. {quote} I have an unfiled ticket to suggest making the proximity sorting probabilistic to avoid the binary either we get traffic or we dont (or either we get data or we get digest) situation. That would certainly help. As would least-requests-outstanding. You can totally make it so that this ticket is irrelevant by just making the general case well-supported enough that there is no reason to special case this. This was originally filed since we had none of that, and we still don't, and it seemed like a very trivial case to handle for the TCP connection to be actively reset by the other side. {code} After reading CASSANDRA-3722 it seems we can implement required logic at the snitch level taking latency measurements into account. I think we can close this one in favor of CASSANDRA-3722 and continue work/discussion there. What do you think, Brandon, Peter? {code} I think CASSANDRA-3722's original premise doesn't address the concerns I see in real life (I don't want special cases trying to communicate X is happening), but towards the end I start agreeing with the
[jira] [Issue Comment Edited] (CASSANDRA-3722) Send Hints to Dynamic Snitch when Compaction or repair is going on for a node.
[ https://issues.apache.org/jira/browse/CASSANDRA-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217445#comment-13217445 ] Peter Schuller edited comment on CASSANDRA-3722 at 2/27/12 7:38 PM: I'm -0 on the original bit of this ticket, but +1 on more generic changes that covers the original use case as good if not better anyway. I think that instead of trying to predict exactly the behavior of some particular event like compaction, we should just be better at actually responding to what is actually going on: * We have CASSANDRA-2540 which can help avoid blocking uselessly on a dropped or slow request even if we haven't had the opportunity to react to overall behavior yet (I have a partial patch that breaks read repair, I haven't had time to finish it). * Taking into account the number of outstanding requests is IMO a necessity. There is plenty of precedent for anyone who wants that (least used connections policies in various LB:s), but more importantly it would so clearly help in several situations, including: ** Sudden GC pause of a node ** Sudden death of a node ** Sudden page cache eviction and slowness of a node, before snitching figures it out ** Constantly overloaded node; even with the dynsnitch it would improve the situation as the number of requests affected by a dynsnitch reset is lessened ** Packet loss/hiccup/whatever across DC:s There is some potential for foot shooting in the sense that if a node is broken in a way that it responds with incorrect data, but responds faster than anyone else, it will tend to swallow all the traffic. But honestly, that feels like a minor concern to me based on what I've seen actually happen in production clusters. If we ever start sending non-successes back over inter-node RPC, this would change however. My only major concern is potential performance impacts of keeping track of the number of outstanding requests, but if that *does* become a problem one can make it probabilistic - have N % of all requests be tracked. Less impact, but also less immediate response to what's happening. This will also have the side-effect of mitigating sudden bursts of promotion into old-gen if we combine it with pro-actively dropping read-repair messages for nodes that are overloaded (effectively prioritizing data reads), hence helping for CASSANDRA-3853. {quote} Should we T (send additional requests which are not part of the normal operations) the requests until the other node recovers? {quote} In the absence of read repair, we'd have to do speculative reads as Stu has previously noted. With read repair turned on, this is not an issue because the node will still receive requests and eventually warm up. Only with read repair turned off do we not send requests to more than the first N of endpoints, with N being what is required by CL. Semi-relatedly, I think it would be a good idea to make the proximity sorting probabilistic in nature so that we don't do a binary flip back and fourth between who gets data vs. digest reads or who doesn't get reads at all. That might mitigate this problem, but not help fundamentally since the rate of warm-up would decrease with a node being slow. I do want to make this point though: *Every single production cluster* I have ever been involved with so far, has been such that you basically never want to turn read repair off. Not because of read repair itself, but because of the traffic it generates. Having nodes not receive traffic is extremely dangerous under most circumstances as it leaves nodes cold, only to suddenly explode and cause timeouts and other bad behavior as soon as e.g. some neighbor goes down and it suddenly starts taking traffic. This is an easy way to make production clusters fall over. If your workload is entirely in memory or otherwise not reliant on caching the problem is much less pronounced, but even then I would generally recommend that you keep it turned on if only because your nodes will have to be able to take the additional load *anyway* if you are to survive other nodes in the neighborhood going down. It just makes clusters much more easy to reason about. was (Author: scode): I'm -0 on the original bit of this ticket, but +1 on more generic changes that covers the original use case as good if not better anyway. I think that instead of trying to predict exactly the behavior of some particular event like compaction, we should just be better at actually responding to what is actually going on: * We have CASSANDRA-2540 which can help avoid blocking uselessly on a dropped or slow request even if we haven't had the opportunity to react to overall behavior yet (I have a partial patch that breaks read repair, I haven't had time to finish it). * Taking into account the number of outstanding requests is IMO a necessity. There is
[jira] [Issue Comment Edited] (CASSANDRA-3797) StorageProxy static initialization not triggered until thrift requests come in
[ https://issues.apache.org/jira/browse/CASSANDRA-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217491#comment-13217491 ] Peter Schuller edited comment on CASSANDRA-3797 at 2/27/12 8:09 PM: Looks like {{3797-forname.txt}} is the same file as the original patch. In any case, suppose we just go for Class.forName() to avoid introducing that annoying method, and assuming it makes the metrics from CASSANDRA-3671 work, can I get a +1? was (Author: scode): Looks like {{3797-forname.txt}} is the same file as the original patch. In any case, suppose we just go for Class.forName() to avoid introducing that annoying method, and assuming it makes the metrics from CASSANDRA-3671, can I get a +1? StorageProxy static initialization not triggered until thrift requests come in -- Key: CASSANDRA-3797 URL: https://issues.apache.org/jira/browse/CASSANDRA-3797 Project: Cassandra Issue Type: Bug Reporter: Peter Schuller Assignee: Peter Schuller Priority: Minor Fix For: 1.1.0 Attachments: 3797-forname.txt, CASSANDRA-3797-trunk-v1.txt While plugging in the metrics library for CASSANDRA-3671 I realized (because the metrics library was trying to add a shutdown hook on metric creation) that starting cassandra and simply shutting it down, causes StorageProxy to not be initialized until the drain shutdown hook. Effects: * StorageProxy mbean missing in visualvm/jconsole after initial startup (seriously, I thought I was going nuts ;)) * And in general anything that makes assumptions about running early, or at least not during JVM shutdown, such as the metrics library, will be problematic -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (CASSANDRA-3912) support incremental repair controlled by external agent
[ https://issues.apache.org/jira/browse/CASSANDRA-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209117#comment-13209117 ] Peter Schuller edited comment on CASSANDRA-3912 at 2/16/12 5:22 AM: Agreed. The good news is that the actual commands necessary ({{getprimaryrange}} and {{repairrange}}) are easy patches. The bad news is that it turns out the AntiEntropyService does not support arbitrary ranges. Attaching {{CASSANDRA\-3912\-v2\-001\-add\-nodetool\-commands.txt}} and {{CASSANDRA\-3912\-v2\-002\-fix\-antientropyservice.txt}}. Had it not been for AES I'd want to propose we commit this to 1.1 since it would be additive only, but given the AES fix I don't know... I guess probably not? It's a shame because I think it would be a boon to users with large nodes struggling with repair (despite the fact that, as you point out, each repair implies a flush). was (Author: scode): Agreed. The good news is that the actual commands necessary ({{getprimaryrange}} and {{repairrange}}) are easy patches. The bad news is that it turns out the AntiEntropyService does not support arbitrary ranges. Attaching {{CASSANDRA\-3912\-v2\-001\-add\-nodetool\-commands.txt}} and {{CASSANDRA-3912-v2-002-fix-antientropyservice.txt}}. Had it not been for AES I'd want to propose we commit this to 1.1 since it would be additive only, but given the AES fix I don't know... I guess probably not? It's a shame because I think it would be a boon to users with large nodes struggling with repair (despite the fact that, as you point out, each repair implies a flush). support incremental repair controlled by external agent --- Key: CASSANDRA-3912 URL: https://issues.apache.org/jira/browse/CASSANDRA-3912 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Peter Schuller Assignee: Peter Schuller Attachments: CASSANDRA-3912-trunk-v1.txt, CASSANDRA-3912-v2-001-add-nodetool-commands.txt, CASSANDRA-3912-v2-002-fix-antientropyservice.txt As a poor man's pre-cursor to CASSANDRA-2699, exposing the ability to repair small parts of a range is extremely useful because it allows (with external scripting logic) to slowly repair a node's content over time. Other than avoiding the bulkyness of complete repairs, it means that you can safely do repairs even if you absolutely cannot afford e.g. disk spaces spikes (see CASSANDRA-2699 for what the issues are). Attaching a patch that exposes a repairincremental command to nodetool, where you specify a step and the number of total steps. Incrementally performing a repair in 100 steps, for example, would be done by: {code} nodetool repairincremental 0 100 nodetool repairincremental 1 100 ... nodetool repairincremental 99 100 {code} An external script can be used to keep track of what has been repaired and when. This should allow (1) allow incremental repair to happen now/soon, and (2) allow experimentation and evaluation for an implementation of CASSANDRA-2699 which I still think is a good idea. This patch does nothing to help the average deployment, but at least makes incremental repair possible given sufficient effort spent on external scripting. The big no-no about the patch is that it is entirely specific to RandomPartitioner and BigIntegerToken. If someone can suggest a way to implement this command generically using the Range/Token abstractions, I'd be happy to hear suggestions. An alternative would be to provide a nodetool command that allows you to simply specify the specific token ranges on the command line. It makes using it a bit more difficult, but would mean that it works for any partitioner and token type. Unless someone can suggest a better way to do this, I think I'll provide a patch that does this. I'm still leaning towards supporting the simple step N out of M form though. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (CASSANDRA-3892) improve TokenMetadata abstraction, naming - audit current uses
[ https://issues.apache.org/jira/browse/CASSANDRA-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206391#comment-13206391 ] Peter Schuller edited comment on CASSANDRA-3892 at 2/12/12 10:39 AM: - Attaching {{CASSANDRA\-3892\-draft.txt}} which is a draft/work in progress. Mainly I'm asking for a stop right there if these types of changes seem like something that will never be accepted (they're semi-significant even though most of it constitute non-functional changes). I'm not asking nor suggesting for careful review, as it's better that I submit a more finished patch before that happens. Any requests for patch splitting strategies or overall don't do this/don't do that would be helpful though, if someone has them. Other than what's there in the current version, I want to move pending range calculation into token meta data (it will need to be given a strategy), and things like {{StorageService.handleStateNormal()}} being responsible for keeping the internal state of tokenmetadata (removing from moving) up-to-date I want gone. I've begun making naming and concepts a bit more consistent; the token meta data is now more consistently (but not fully yet) talking about endpoints as the main abstraction rather than mixing endpoints and tokens, and we have joining endpoints instead of bootstrap tokens. Moving endpoints is now also a map with O(n) access, and kept up to date in {{removeEndpoint()}} (may be other places that need fixing). I adjusted comments for {{calculatePendingRanges}} to be clear:er; for example the old comments made it sound like we were sending writes to places for good measure because we're in doubt, rather than because it is strictly necessary. Unless I hear objections I'll likely continue this on Sunday and submit another patch. was (Author: scode): Attaching {{CASSANDRA-3892-draft.txt}} which is a draft/work in progress. Mainly I'm asking for a stop right there if these types of changes seem like something that will never be accepted (they're semi-significant even though most of it constitute non-functional changes). I'm not asking nor suggesting for careful review, as it's better that I submit a more finished patch before that happens. Any requests for patch splitting strategies or overall don't do this/don't do that would be helpful though, if someone has them. Other than what's there in the current version, I want to move pending range calculation into token meta data (it will need to be given a strategy), and things like {{StorageService.handleStateNormal()}} being responsible for keeping the internal state of tokenmetadata (removing from moving) up-to-date I want gone. I've begun making naming and concepts a bit more consistent; the token meta data is now more consistently (but not fully yet) talking about endpoints as the main abstraction rather than mixing endpoints and tokens, and we have joining endpoints instead of bootstrap tokens. Moving endpoints is now also a map with O(n) access, and kept up to date in removeEndpoint() (may be other places that need fixing). I adjusted comments for {{calculatePendingRanges}} to be clear:er; for example the old comments made it sound like we were sending writes to places for good measure because we're in doubt, rather than because it is strictly necessary. Unless I hear objections I'll likely continue this on Sunday and submit another patch. improve TokenMetadata abstraction, naming - audit current uses -- Key: CASSANDRA-3892 URL: https://issues.apache.org/jira/browse/CASSANDRA-3892 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Peter Schuller Assignee: Peter Schuller Attachments: CASSANDRA-3892-draft.txt CASSANDRA-3417 has some background. I want to make the distinction more clear between looking at the ring from different perspectives (reads, writes, others) and adjust naming to be more clear on this. I also want to go through each use case and try to spot any subtle pre-existing bugs that I almost introduced in CASSANDRA-3417, had not Jonathan caught me. I will submit a patch soonish. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (CASSANDRA-3892) improve TokenMetadata abstraction, naming - audit current uses
[ https://issues.apache.org/jira/browse/CASSANDRA-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206391#comment-13206391 ] Peter Schuller edited comment on CASSANDRA-3892 at 2/12/12 10:39 AM: - Attaching {{CASSANDRA\-3892\-draft.txt}} which is a draft/work in progress. Mainly I'm asking for a stop right there if these types of changes seem like something that will never be accepted (they're semi-significant even though most of it constitute non-functional changes). I'm not asking nor suggesting for careful review, as it's better that I submit a more finished patch before that happens. Any requests for patch splitting strategies or overall don't do this/don't do that would be helpful though, if someone has them. Other than what's there in the current version, I want to move pending range calculation into token meta data (it will need to be given a strategy), and things like {{StorageService.handleStateNormal()}} being responsible for keeping the internal state of tokenmetadata (removing from moving) up-to-date I want gone. I've begun making naming and concepts a bit more consistent; the token meta data is now more consistently (but not fully yet) talking about endpoints as the main abstraction rather than mixing endpoints and tokens, and we have joining endpoints instead of bootstrap tokens. Moving endpoints is now also a map with O(n) access, and kept up to date in {{removeEndpoint()}} (may be other places that need fixing). I adjusted comments for {{calculatePendingRanges}} to be clearer; for example the old comments made it sound like we were sending writes to places for good measure because we're in doubt, rather than because it is strictly necessary. Unless I hear objections I'll likely continue this on Sunday and submit another patch. was (Author: scode): Attaching {{CASSANDRA\-3892\-draft.txt}} which is a draft/work in progress. Mainly I'm asking for a stop right there if these types of changes seem like something that will never be accepted (they're semi-significant even though most of it constitute non-functional changes). I'm not asking nor suggesting for careful review, as it's better that I submit a more finished patch before that happens. Any requests for patch splitting strategies or overall don't do this/don't do that would be helpful though, if someone has them. Other than what's there in the current version, I want to move pending range calculation into token meta data (it will need to be given a strategy), and things like {{StorageService.handleStateNormal()}} being responsible for keeping the internal state of tokenmetadata (removing from moving) up-to-date I want gone. I've begun making naming and concepts a bit more consistent; the token meta data is now more consistently (but not fully yet) talking about endpoints as the main abstraction rather than mixing endpoints and tokens, and we have joining endpoints instead of bootstrap tokens. Moving endpoints is now also a map with O(n) access, and kept up to date in {{removeEndpoint()}} (may be other places that need fixing). I adjusted comments for {{calculatePendingRanges}} to be clear:er; for example the old comments made it sound like we were sending writes to places for good measure because we're in doubt, rather than because it is strictly necessary. Unless I hear objections I'll likely continue this on Sunday and submit another patch. improve TokenMetadata abstraction, naming - audit current uses -- Key: CASSANDRA-3892 URL: https://issues.apache.org/jira/browse/CASSANDRA-3892 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Peter Schuller Assignee: Peter Schuller Attachments: CASSANDRA-3892-draft.txt CASSANDRA-3417 has some background. I want to make the distinction more clear between looking at the ring from different perspectives (reads, writes, others) and adjust naming to be more clear on this. I also want to go through each use case and try to spot any subtle pre-existing bugs that I almost introduced in CASSANDRA-3417, had not Jonathan caught me. I will submit a patch soonish. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (CASSANDRA-3892) improve TokenMetadata abstraction, naming - audit current uses
[ https://issues.apache.org/jira/browse/CASSANDRA-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206596#comment-13206596 ] Peter Schuller edited comment on CASSANDRA-3892 at 2/13/12 1:00 AM: Attaching {{CASSANDRA\-3892\-draft\-v2.txt}} with some more changes. I still consider it a draft because I have not yet done any testing, but it's more ripe for review now. A few of the sub-tasks I created are IMO serious as well. was (Author: scode): Attaching {{CASSANDRA-3892-draft-v2.txt}} with some more changes. I still consider it a draft because I have not yet done any testing, but it's more ripe for review now. A few of the sub-tasks I created are IMO serious as well. improve TokenMetadata abstraction, naming - audit current uses -- Key: CASSANDRA-3892 URL: https://issues.apache.org/jira/browse/CASSANDRA-3892 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Peter Schuller Assignee: Peter Schuller Attachments: CASSANDRA-3892-draft-v2.txt, CASSANDRA-3892-draft.txt CASSANDRA-3417 has some background. I want to make the distinction more clear between looking at the ring from different perspectives (reads, writes, others) and adjust naming to be more clear on this. I also want to go through each use case and try to spot any subtle pre-existing bugs that I almost introduced in CASSANDRA-3417, had not Jonathan caught me. I will submit a patch soonish. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (CASSANDRA-3897) StorageService.onAlive() only schedules hints for joined endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206600#comment-13206600 ] Peter Schuller edited comment on CASSANDRA-3897 at 2/13/12 1:08 AM: Why would this be the case? They are supposed to receive writes; I see no reason why hints should not be delivered. Hints is just a way to more quickly delivery writes in cases where nodes are down (i.e., more quickly when they go up) and avoid AES need. I don't see why a node actively bootstrapping into the ring should be discriminated against, in terms of seeing as reliable delivery of writes as other nodes. In other words, I don't buy your first sentence unless you explain why. I don't accept it axiomatically :) Obviously sending hints requires that hints are *there* first too, but the same argument applies. If a node is supposed to see a certain writes and it's considered down - hint it. Statistically I can see the argument that if a node is bootstrapping and down, it might be practically more likely that the node is just going to be down for a longer period, and/or that the node will completely re-bootstrap anyway (since normally a node is down because it's being restarted, which would imply re-bootstrap if the node is bootstrapping). was (Author: scode): Why would this be the case? They are supposed to receive writes; I see no reason why hints should not be delivered. Hints is just a way to more quickly delivery writes in cases where nodes are down (i.e., more quickly when they go up) and avoid AES need. I don't see why a node actively bootstrapping into the ring should be discriminated against, in terms of seeing as reliable delivery of writes as other nodes. In other words, I don't by your first sentence unless you explain why. I don't accept it axiomatically :) Obviously sending hints requires that hints are *there* first too, but the same argument applies. If a node is supposed to see a certain writes and it's considered down - hint it. Statistically I can see the argument that if a node is bootstrapping and down, it might be practically more likely that the node is just going to be down for a longer period, and/or that the node will completely re-bootstrap anyway (since normally a node is down because it's being restarted, which would imply re-bootstrap if the node is bootstrapping). StorageService.onAlive() only schedules hints for joined endpoints -- Key: CASSANDRA-3897 URL: https://issues.apache.org/jira/browse/CASSANDRA-3897 Project: Cassandra Issue Type: Sub-task Components: Core Reporter: Peter Schuller Assignee: Peter Schuller Priority: Minor It seems incorrect to not do hint delivery for nodes that are bootstrapping, as that would cause sudden spikes in read repair need or inconsistent reads when a node joins the ring. Particularly if the user is expecting to rely on the new hinted handoff code making AES much less needed. It would be a POLA violation for bootstrapping nodes to be an exception to that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (CASSANDRA-3895) Gossiper.doStatusCheck() uses isMember() suspiciously
[ https://issues.apache.org/jira/browse/CASSANDRA-3895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206627#comment-13206627 ] Peter Schuller edited comment on CASSANDRA-3895 at 2/13/12 2:29 AM: {quote} If fat clients disappear, no one really cares because they were never ring members. {quote} Ok. Well, they care in the joining (bootstrapping) node case since they are taking writes. But all nodes will naturally drop them by themselves without needing others to propagate. As long as we consider aVeryLongTime (3 days) long enough that it's safe to assume that this only triggers in legitimate cases, we're fine (if not, we're not as soon as CASSANDRA-3892 is fixed or if I am wrong and it's not a bug to begin with). I'll resolve this as INVALID then, though I'm skeptical about the 3 day very long time. was (Author: scode): {quote} If fat clients disappear, no one really cares because they were never ring members. {quote} Ok. Well, they care in the joining (bootstrapping) node case since they are taking writes. But all nodes will naturally drop them by themselves without needing others to propagate. As long as we consider aVeryLongTime (3 days) long enough that it's safe to assume that this only triggers in legitimate cases, we're fine (if not, we're not as soon as CASSANDRA-3892 is fixed or if I am wrong and it's not a bug to begin with). I'll resolve this as WONTFIX then, though I'm skeptical about the 3 day very long time. Gossiper.doStatusCheck() uses isMember() suspiciously - Key: CASSANDRA-3895 URL: https://issues.apache.org/jira/browse/CASSANDRA-3895 Project: Cassandra Issue Type: Sub-task Components: Core Reporter: Peter Schuller Assignee: Peter Schuller Priority: Minor There is code for fat client removal and old endpoint (non-fat) removal which uses {{TokenMetadata.isMember()}} which only considers nodes that are joined (takes reads) in the cluster. aVeryLongTime is set to 3 days. I could very well be wrong, but the fat client identification code, the way I interpret it, is using isMember() to check basically whether a node is part of the cluster (in the most vague/broad sense) in order to differentiate a real node (part of the cluster) from just a fat client. But a node that is boot strapping is not a fat client, nor will be me a member according to isMember(). I'm also a bit scared of, even in the case of there not being a fat client identification, simply forgetting an endpoint. It seems that an operator request should be relied upon to actively forget an endpoint (i.e., forced remove token). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (CASSANDRA-3735) Fix Unable to create hard link SSTableReaderTest error messages
[ https://issues.apache.org/jira/browse/CASSANDRA-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200642#comment-13200642 ] Peter Schuller edited comment on CASSANDRA-3735 at 2/5/12 2:54 AM: --- Attaching new version of 0002* that works (but still with the left-overs already mentioned by jbellis/sylvain) post CASSANDRA-2794. was (Author: scode): Attaching new version of 0002* that works post CASSANDRA-2794. Fix Unable to create hard link SSTableReaderTest error messages - Key: CASSANDRA-3735 URL: https://issues.apache.org/jira/browse/CASSANDRA-3735 Project: Cassandra Issue Type: Bug Reporter: Jonathan Ellis Assignee: Jonathan Ellis Priority: Minor Attachments: 0001-fix-generation-update-in-loadNewSSTables.patch, 0002-remove-incremental-backups-before-reloading-sstables-v2.patch, 0002-remove-incremental-backups-before-reloading-sstables.patch Sample failure (on Windows): {noformat} [junit] java.io.IOException: Exception while executing the command: cmd /c mklink /H C:\Users\Jonathan\projects\cassandra\git\build\test\cassandra\data\Keyspace1\backups\Standard1-hc-1-Index.db c:\Users\Jonathan\projects\cassandra\git\build\test\cassandra\data\Keyspace1\Standard1-hc-1-Index.db,command error Code: 1, command output: Cannot create a file when that file already exists. [junit] [junit] at org.apache.cassandra.utils.CLibrary.exec(CLibrary.java:213) [junit] at org.apache.cassandra.utils.CLibrary.createHardLinkWithExec(CLibrary.java:188) [junit] at org.apache.cassandra.utils.CLibrary.createHardLink(CLibrary.java:151) [junit] at org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:833) [junit] at org.apache.cassandra.db.DataTracker$1.runMayThrow(DataTracker.java:161) [junit] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) [junit] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) [junit] at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) [junit] at java.util.concurrent.FutureTask.run(FutureTask.java:138) [junit] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98) [junit] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206) [junit] at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [junit] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [junit] at java.lang.Thread.run(Thread.java:662) [junit] ERROR 17:10:17,111 Fatal exception in thread Thread[NonPeriodicTasks:1,5,main] {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (CASSANDRA-3831) scaling to large clusters in GossipStage impossible due to calculatePendingRanges
[ https://issues.apache.org/jira/browse/CASSANDRA-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200251#comment-13200251 ] Peter Schuller edited comment on CASSANDRA-3831 at 2/4/12 1:25 AM: --- I am attaching {{CASSANDRA\-3831\-memoization\-not\-for\-inclusion.txt}} as an FYI and in case it helps others. It's against 0.8, and implements memoization of calculate pending ranges. The correct/clean fix is probably to change behavior so that it doesn't get called unnecessarily to begin with (and to make sure the computational complexity is reasonable when it does get called). This patch was made specifically to address the production issue we are having in a minimally dangerous fashion, and is not to be taken as a suggested fix. was (Author: scode): I am attaching {{CASSANDRA\-3831\-memoization\-not\-for\-inclusion.txt}} as an FYI and in case it helps others. It's against 0.8, and implements memoization of calculate pending ranges. The correct/clean fix is probably to change behavior so that it doesn't get called unnecessarily to begin with. This patch was made specifically to address the production issue we are having in a minimally dangerous fashion, and is not to be taken as a suggested fix. scaling to large clusters in GossipStage impossible due to calculatePendingRanges -- Key: CASSANDRA-3831 URL: https://issues.apache.org/jira/browse/CASSANDRA-3831 Project: Cassandra Issue Type: Bug Components: Core Reporter: Peter Schuller Assignee: Peter Schuller Priority: Critical Attachments: CASSANDRA-3831-memoization-not-for-inclusion.txt (most observations below are from 0.8, but I just now tested on trunk and I can trigger this problem *just* by bootstrapping a ~180 nod cluster concurrently, presumably due to the number of nodes that are simultaneously in bootstrap state) It turns out that: * (1) calculatePendingRanges is not just expensive, it's computationally complex - cubic or worse * (2) it gets called *NOT* just once per node being bootstrapped/leaving etc, but is called repeatedly *while* nodes are in these states As a result, clusters start exploding when you start reading 100-300 nodes. The GossipStage will get backed up because a single calculdatePenginRanges takes seconds, and depending on what the average heartbeat interval is in relation to this, this can lead to *massive* cluster-wide flapping. This all started because we hit this in production; several nodes would start flapping several other nodes as down, with many nodes seeing the entire cluster, or a large portion of it, as down. Logging in to some of these nodes you would see that they would be constantly flapping up/down for minutes at a time until one became lucky and it stabilized. In the end we had to perform an emergency full-cluster restart with gossip patched to force-forget certain nodes in bootstrapping state. I can't go into all details here from the post-mortem (just the write-up would take a day), but in short: * We graphed the number of hosts in the cluster that had more than 5 Down (in a cluster that should have 0 down) on a minutely timeline. * We also graphed the number of hosts in the cluster that had GossipStage backed up. * The two graphs correlated *extremely* well * jstack sampling showed it being CPU bound doing mostly sorting under calculatePendingRanges * We were never able to exactly reproduce it with normal RING_DELAY and gossip intervals, even on a 184 node cluster (the production cluster is around 180). * Dropping RING_DELAY and in particular dropping gossip interval to 10 ms instead of 1000 ms, we were able to observe all of the behavior we saw in production. So our steps to reproduce are: * Launch 184 node cluster w/ gossip interval at 10ms and RING_DELAY at 1 second. * Do something like: {{while [ 1 ] ; do date ; echo decom ; nodetool decommission ; date ; echo done leaving decommed for a while ; sleep 3 ; date ; echo done restarting; sudo rm -rf /data/disk1/commitlog/* ; sudo rm -rf /data/diskarray/tables/* ; sudo monit restart cassandra ;date ; echo restarted waiting for a while ; sleep 40; done}} (or just do a manual decom/bootstrap once, it triggers every time) * Watch all nodes flap massively and not recover at all, or maybe after a *long* time. I observed the flapping using a python script that every 5 second (randomly spread out) asked for unreachable nodes from *all* nodes in the cluster, and printed any nodes and their counts when they had unreachables 5. The cluster can be observed instantly going into massive flapping when leaving/bootstrap is initiated. Script
[jira] [Issue Comment Edited] (CASSANDRA-3820) Columns missing after upgrade from 0.8.5 to 1.0.7.
[ https://issues.apache.org/jira/browse/CASSANDRA-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197496#comment-13197496 ] Peter Schuller edited comment on CASSANDRA-3820 at 2/1/12 1:53 AM: --- Check whether the .bf files contain all zeroes above roughly 235 mb or so. If you have lots of rows, your BF will be that large. We encountered a bug internally whereby all bloom filters larger than 2^31 bits were large on disk, but everything after the first 2^31 bits were all zeroes. Unfortunately I don't know whether this is specific to patches made to our branch, and I have been so busy I haven't been able to follow up to figure out whether it affects the upstream version. But - just tail -c 1000 | hexdump. If you only have zeroes, this is the bug. Make sure to tail on a large .bf file (take the largest, easiest). was (Author: scode): Check whether the .bf files contain all zeroes above roughly 235 mb or so. If you have lots of rows, your BF will be that large. We encountered a bug internally whereby all bloom filters larger than 2^31 bits were large on disk, but everything afger the first 2^31 bits were all zeroes. Unfortunately I don't know whether this is specific to patches made to our branch, and I have been so busy I haven't been able to follow up to figure out whether it affects the upstream version. But - just tail -c 1000 | hexdump. If you only have zeroes, this is the bug. Make sure to tail on a large .bf file (take the largest, easiest). Columns missing after upgrade from 0.8.5 to 1.0.7. -- Key: CASSANDRA-3820 URL: https://issues.apache.org/jira/browse/CASSANDRA-3820 Project: Cassandra Issue Type: Bug Affects Versions: 1.0.7 Reporter: Jason Harvey After an upgrade, one of our CFs had a lot of rows with missing columns. I've been able to reproduce in test conditions. Working on getting the tables to DataStax(data is private). 0.8 results: {code} [default@reddit] get CommentVote[36353467625f6837336f32]; = (column=date, value=313332333932323930392e3531, timestamp=1323922909506508) = (column=ip, value=REDACTED, timestamp=1327048432717348, ttl=2592000) = (column=name, value=31, timestamp=1327048433000740) = (column=REDACTED, value=30, timestamp=1323922909506432) = (column=thing1_id, value=REDACTED, timestamp=1323922909506475) = (column=thing2_id, value=REDACTED, timestamp=1323922909506486) = (column=REDACTED, value=31, timestamp=1323922909506518) = (column=REDACTED, value=30, timestamp=1323922909506497) {code} 1.0 results: {code} [default@reddit] get CommentVote[36353467625f6837336f32]; = (column=ip, value=REDACTED, timestamp=1327048432717348, ttl=2592000) = (column=name, value=31, timestamp=1327048433000740) {code} A few notes: * The rows with missing data were fully restored after scrubbing the sstables. * The row which I reproduced on happened to be split across multiple sstables. * When I copied the first sstable I found the row on, I was able to 'list' rows from the sstable, but any and all 'get' calls failed. * These SStables were natively created on 0.8.5; they did not come from any previous upgrade. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (CASSANDRA-3670) provide red flags JMX instrumentation
[ https://issues.apache.org/jira/browse/CASSANDRA-3670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195216#comment-13195216 ] Peter Schuller edited comment on CASSANDRA-3670 at 1/27/12 11:13 PM: - CodaHale Metrics being evaluated in CASSANDRA-3671. If there's a +1 there, will go for same here. was (Author: scode): CodaHale Metrics being evaluated in CASSANDRA-3671. If there's a +1 here, will go for same here. provide red flags JMX instrumentation --- Key: CASSANDRA-3670 URL: https://issues.apache.org/jira/browse/CASSANDRA-3670 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Assignee: Peter Schuller Priority: Minor As discussed in CASSANDRA-3641, it would be nice to expose through JMX certain information which is almost without exception indicative of something being wrong with the node or cluster. In the CASSANDRA-3641 case, it was the detection of corrupt counter shards. Other examples include: * Number of times the selection of files to compact was adjusted due to disk space heuristics * Number of times compaction has failed * Any I/O error reading from or writing to disk (the work here is collecting, not exposing, so maybe not in an initial version) * Any data skipped due to checksum mismatches (when checksumming is being used); e.g., number of skips. * Any arbitrary exception at least in certain code paths (compaction, scrub, cleanup for starters) Probably other things. The motivation is that if we have clear and obvious indications that something truly is wrong, it seems suboptimal to just leave that information in the log somewhere, for someone to discover later when something else broke as a result and a human investigates. You might argue that one should use non-trivial log analysis to detect these things, but I highly doubt a lot of people do this and it seems very wasteful to require that in comparison to just providing the MBean. It is important to note that the *lack* of a certain problem being advertised in this MBean is not supposed to be indicative of a *lack* of a problem. Rather, the point is that to the extent we can easily do so, it is nice to have a clear method of communicating to monitoring systems where there *is* a clear indication of something being wrong. The main part of this ticket is not to cover everything under the sun, but rather to reach agreement on adding an MBean where these types of indicators can be collected. Individual counters can then be added over time as one thinks of them. I propose: * Create an org.apache.cassandra.db.RedFlags MBean * Populate with a few things to begin with. I'll submit the patch if there is agreement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (CASSANDRA-3070) counter repair
[ https://issues.apache.org/jira/browse/CASSANDRA-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13166617#comment-13166617 ] Peter Schuller edited comment on CASSANDRA-3070 at 12/9/11 10:19 PM: - This may be relevant, quoting myself from IRC: {code} 21:20:01 scode pcmanus: Hey, are you there? 21:20:21 scode pcmanus: I am investigating something which might be https://issues.apache.org/jira/browse/CASSANDRA-3070 21:20:37 scode pcmanus: And I could use the help of someone with his brain all over counters, and Stu isn't here atm. :) 21:21:16 scode pcmanus: https://gist.github.com/8202cb46c8bd00c8391b 21:21:37 scode pcmanus: I am investigating why with CL.ALL and CL.QUORUM, I get seemingly random/varying results when I read a counter. 21:21:53 scode pcmanus: I have the offending sstables on a three-node test setup and am inserting debug printouts in the code to trace the reconiliation. 21:21:57 scode pcmanus: The gist above shows what's happening. 21:22:11 scode pcmanus: The latter is the wrong one, and the former is the correct one. 21:22:28 scode pcmanus: The interesting bit is that I see shards with the same node_id *AND* clock, but *DIFFERENT* counts. 21:22:53 scode pcmanus: My understanding of counters is that there should never (globally across an entire cluster in all sstables) exist two shards for the same node_id+clock but with different counts. 21:22:57 scode pcmanus: Is my understanding correct there? 21:25:10 scode pcmanus: There is one node out of the three that has the offending card (with a count of 2 instead of 1). Like with 3070, we observed this after having expanded a cluster (though I'm not sure how that would cause it, and we don't know if there existed a problem before the expansion). {code} was (Author: scode): This may be relevant, quoting myself from IRC: {quote} 21:20:01 scode pcmanus: Hey, are you there? 21:20:21 scode pcmanus: I am investigating something which might be https://issues.apache.org/jira/browse/CASSANDRA-3070 21:20:37 scode pcmanus: And I could use the help of someone with his brain all over counters, and Stu isn't here atm. :) 21:21:16 scode pcmanus: https://gist.github.com/8202cb46c8bd00c8391b 21:21:37 scode pcmanus: I am investigating why with CL.ALL and CL.QUORUM, I get seemingly random/varying results when I read a counter. 21:21:53 scode pcmanus: I have the offending sstables on a three-node test setup and am inserting debug printouts in the code to trace the reconiliation. 21:21:57 scode pcmanus: The gist above shows what's happening. 21:22:11 scode pcmanus: The latter is the wrong one, and the former is the correct one. 21:22:28 scode pcmanus: The interesting bit is that I see shards with the same node_id *AND* clock, but *DIFFERENT*