from:"Peter Schuller \(Issue Comment Edited\) \(JIRA\)"

[jira] [Issue Comment Edited] (CASSANDRA-4032) memtable.updateLiveRatio() is blocking, causing insane latencies for writes

2012-03-09 Thread Peter Schuller (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13226166#comment-13226166
 ] 

Peter Schuller edited comment on CASSANDRA-4032 at 3/9/12 4:05 PM:
---

{quote}
Are we sure that what we want is a SynchronousQueue with task rejected? After 
all, there is only on global memoryMeter, so we could end up failing to 
updateLiveRatio just based on a race, even if calculations are fast. I'd 
suggest instead a bounded queue (but maybe not infinite and we could indeed 
just skip task if that queue gets full).
{quote}

I agree it's fishy, though I'd suggest a separate ticket. This patch is 
intended to make the code behave the way the original commit intended.

This (from the code, not my patch) seems legit though:

{code}
// we're careful to only allow one count to run at a time because counting 
is slow
// (can be minutes, for a large memtable and a busy server), so we could 
keep memtables
// alive after they're flushed and would otherwise be GC'd.
{code}

We could have one queue per unique CF and have a consumer that iterates over 
the set of queues, guaranteeing that each CF gets processed once per cycle. A 
simpler solution is probably preferable though if we can think of one.


  was (Author: scode):
{code}
Are we sure that what we want is a SynchronousQueue with task rejected? After 
all, there is only on global memoryMeter, so we could end up failing to 
updateLiveRatio just based on a race, even if calculations are fast. I'd 
suggest instead a bounded queue (but maybe not infinite and we could indeed 
just skip task if that queue gets full).
{code}

I agree it's fishy, though I'd suggest a separate ticket. This patch is 
intended to make the code behave the way the original commit intended.

This (from the code, not my patch) seems legit though:

{code}
// we're careful to only allow one count to run at a time because counting 
is slow
// (can be minutes, for a large memtable and a busy server), so we could 
keep memtables
// alive after they're flushed and would otherwise be GC'd.
{code}

We could have one queue per unique CF and have a consumer that iterates over 
the set of queues, guaranteeing that each CF gets processed once per cycle. A 
simpler solution is probably preferable though if we can think of one.

  
 memtable.updateLiveRatio() is blocking, causing insane latencies for writes
 ---

 Key: CASSANDRA-4032
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4032
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
 Fix For: 1.1.0

 Attachments: CASSANDRA-4032-1.1.0-v1.txt


 Reproduce by just starting a fresh cassandra with a heap large enough for 
 live ratio calculation (which is {{O(n)}}) to be insanely slow, and then 
 running {{./bin/stress -d host -n1 -t10}}. With a large enough heap 
 and default flushing behavior this is bad enough that stress gets timeouts.
 Example (blocked for is my debug log added around submit()):
 {code}
  INFO [MemoryMeter:1] 2012-03-09 15:07:30,857 Memtable.java (line 198) 
 CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 
 8.89014894083727 (just-counted was 8.89014894083727).  calculation took 
 28273ms for 1320245 columns
  WARN [MutationStage:8] 2012-03-09 15:07:30,857 Memtable.java (line 209) 
 submit() blocked for: 231135
 {code}
 The calling code was written assuming a RejectedExecutionException is thrown, 
 but it's not because {{DebuggableThreadPoolExecutor}} installs a blocking 
 rejection handler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-3952) avoid quadratic startup time in LeveledManifest

2012-03-06 Thread Peter Schuller (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223091#comment-13223091
 ] 

Peter Schuller edited comment on CASSANDRA-3952 at 3/6/12 8:41 AM:
---

Committed with an additional assertion and the map renamed to 
{{sstableGenerations}}, and including 1.1.0. It was marked for 1.1.1 but 
frankly if this *does* introduce some kind of bug, it feels more dangerous to 
have that crop up in an upgrade to 1.1.1 than to have it in the initial release.

  was (Author: scode):
Committed, including 1.1.0. It was marked for 1.1.1 but frankly if this 
*does* introduce some kind of bug, it feels more dangerous to have that crop up 
in an upgrade to 1.1.1 than to have it in the initial release.
  
 avoid quadratic startup time in LeveledManifest
 ---

 Key: CASSANDRA-3952
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3952
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Dave Brosius
Priority: Minor
  Labels: lhf
 Fix For: 1.1.0

 Attachments: speed_up_level_of.diff


 Checking that each sstable is in the manifest on startup is O(N**2) in the 
 number of sstables:
 {code}
 .   // ensure all SSTables are in the manifest
 for (SSTableReader ssTableReader : cfs.getSSTables())
 {
 if (manifest.levelOf(ssTableReader)  0)
 manifest.add(ssTableReader);
 }
 {code}
 {code}
 private int levelOf(SSTableReader sstable)
 {
 for (int level = 0; level  generations.length; level++)
 {
 if (generations[level].contains(sstable))
 return level;
 }
 return -1;
 }
 {code}
 Note that the contains call is a linear List.contains.
 We need to switch to a sorted list and bsearch, or a tree, to support 
 TB-levels of data in LeveledCompactionStrategy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

2012-02-27 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217415#comment-13217415
]

Peter Schuller edited comment on CASSANDRA-3294 at 2/27/12 7:21 PM:

{quote}
This sounds like reinventing the existing failure detector to me.
{quote}

Except we don't use it that way at all (see CASSANDRA-3927). Even if we did
though, I personally think it's totally the wrong solution to this problem
since we have the *perfect* measurement - whether the TCP connection is up.

It's fine if we have other information that actively indicates we shouldn't
send messages to it (whether it's the FD or the fact that we have 500 000
messages queued to the node), but if we *know* the TCP connection is down, we
should just not send messages to it, period. With the only caveat being that of
course we'd have to make sure TCP connections are in fact pro-actively kept up
under all circumstances (I'd have to look at code to figure out what issues
there are, if any, in detail).

{quote}
The main idea of the algorithm I have mentioned is to make sure that we always
do operations (write/read etc.) on the nodes that have the highest probability
to be alive determined by live traffic going there instead of passively relying
on the failure detector.
{quote}

I have an unfiled ticket to suggest making the proximity sorting probabilistic
to avoid the binary either we get traffic or we dont (or either we get data
or we get digest) situation. That would certainly help. As would
least-requests-outstanding.

You can totally make it so that this ticket is irrelevant by just making the
general case well-supported enough that there is no reason to special case
this. This was originally filed since we had none of that, and we still don't,
and it seemed like a very trivial case to handle for the TCP connection to be
actively reset by the other side.

{quote}
After reading CASSANDRA-3722 it seems we can implement required logic at the
snitch level taking latency measurements into account. I think we can close
this one in favor of CASSANDRA-3722 and continue work/discussion there. What do
you think, Brandon, Peter?
{quote}

I think CASSANDRA-3722's original premise doesn't address the concerns I see in
real life (I don't want special cases trying to communicate X is happening),
but towards the end I start agreeing with the ticket more.

In any case, feel free to close if you want. If I ever get to actually
implementing this (if at that point there is no other mechanism to remove the
need) I'll just re-file or re-open with a patch. We don't need to track this if
others aren't interested.

was (Author: scode):
{quote}
This sounds like reinventing the existing failure detector to me.
{quote}

{code}
After reading CASSANDRA-3722 it seems we can implement required logic at the
snitch level taking latency measurements into account. I think we can close
this one in favor of CASSANDRA-3722 and continue work/discussion there. What do
you think, Brandon, Peter?
{code}

[jira] [Issue Comment Edited] (CASSANDRA-3722) Send Hints to Dynamic Snitch when Compaction or repair is going on for a node.

2012-02-27 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217445#comment-13217445
]

Peter Schuller edited comment on CASSANDRA-3722 at 2/27/12 7:38 PM:

I'm -0 on the original bit of this ticket, but +1 on more generic changes that
covers the original use case as good if not better anyway. I think that instead
of trying to predict exactly the behavior of some particular event like
compaction, we should just be better at actually responding to what is actually
going on:

* We have CASSANDRA-2540 which can help avoid blocking uselessly on a dropped
or slow request even if we haven't had the opportunity to react to overall
behavior yet (I have a partial patch that breaks read repair, I haven't had
time to finish it).
* Taking into account the number of outstanding requests is IMO a necessity.
There is plenty of precedent for anyone who wants that (least used connections
policies in various LB:s), but more importantly it would so clearly help in
several situations, including:
** Sudden GC pause of a node
** Sudden death of a node
** Sudden page cache eviction and slowness of a node, before snitching figures
it out
** Constantly overloaded node; even with the dynsnitch it would improve the
situation as the number of requests affected by a dynsnitch reset is lessened
** Packet loss/hiccup/whatever across DC:s

There is some potential for foot shooting in the sense that if a node is broken
in a way that it responds with incorrect data, but responds faster than anyone
else, it will tend to swallow all the traffic. But honestly, that feels like
a minor concern to me based on what I've seen actually happen in production
clusters. If we ever start sending non-successes back over inter-node RPC, this
would change however.

My only major concern is potential performance impacts of keeping track of the
number of outstanding requests, but if that *does* become a problem one can
make it probabilistic - have N % of all requests be tracked. Less impact, but
also less immediate response to what's happening.

This will also have the side-effect of mitigating sudden bursts of promotion
into old-gen if we combine it with pro-actively dropping read-repair messages
for nodes that are overloaded (effectively prioritizing data reads), hence
helping for CASSANDRA-3853.

{quote}
Should we T (send additional requests which are not part of the normal
operations) the requests until the other node recovers?
{quote}

In the absence of read repair, we'd have to do speculative reads as Stu has
previously noted. With read repair turned on, this is not an issue because the
node will still receive requests and eventually warm up. Only with read repair
turned off do we not send requests to more than the first N of endpoints, with
N being what is required by CL.

Semi-relatedly, I think it would be a good idea to make the proximity sorting
probabilistic in nature so that we don't do a binary flip back and fourth
between who gets data vs. digest reads or who doesn't get reads at all. That
might mitigate this problem, but not help fundamentally since the rate of
warm-up would decrease with a node being slow.

I do want to make this point though: *Every single production cluster* I have
ever been involved with so far, has been such that you basically never want to
turn read repair off. Not because of read repair itself, but because of the
traffic it generates. Having nodes not receive traffic is extremely dangerous
under most circumstances as it leaves nodes cold, only to suddenly explode and
cause timeouts and other bad behavior as soon as e.g. some neighbor goes down
and it suddenly starts taking traffic. This is an easy way to make production
clusters fall over. If your workload is entirely in memory or otherwise not
reliant on caching the problem is much less pronounced, but even then I would
generally recommend that you keep it turned on if only because your nodes will
have to be able to take the additional load *anyway* if you are to survive
other nodes in the neighborhood going down. It just makes clusters much more
easy to reason about.

was (Author: scode):
I'm -0 on the original bit of this ticket, but +1 on more generic changes
that covers the original use case as good if not better anyway. I think that
instead of trying to predict exactly the behavior of some particular event like
compaction, we should just be better at actually responding to what is actually
going on:

[jira] [Issue Comment Edited] (CASSANDRA-3797) StorageProxy static initialization not triggered until thrift requests come in

2012-02-27 Thread Peter Schuller (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217491#comment-13217491
 ] 

Peter Schuller edited comment on CASSANDRA-3797 at 2/27/12 8:09 PM:


Looks like {{3797-forname.txt}} is the same file as the original patch. In any 
case, suppose we just go for Class.forName() to avoid introducing that annoying 
method, and assuming it makes the metrics from CASSANDRA-3671 work, can I get a 
+1?

  was (Author: scode):
Looks like {{3797-forname.txt}} is the same file as the original patch. In 
any case, suppose we just go for Class.forName() to avoid introducing that 
annoying method, and assuming it makes the metrics from CASSANDRA-3671, can I 
get a +1?
  
 StorageProxy static initialization not triggered until thrift requests come in
 --

 Key: CASSANDRA-3797
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3797
 Project: Cassandra
  Issue Type: Bug
Reporter: Peter Schuller
Assignee: Peter Schuller
Priority: Minor
 Fix For: 1.1.0

 Attachments: 3797-forname.txt, CASSANDRA-3797-trunk-v1.txt


 While plugging in the metrics library for CASSANDRA-3671 I realized (because 
 the metrics library was trying to add a shutdown hook on metric creation) 
 that starting cassandra and simply shutting it down, causes StorageProxy to 
 not be initialized until the drain shutdown hook.
 Effects:
 * StorageProxy mbean missing in visualvm/jconsole after initial startup 
 (seriously, I thought I was going nuts ;))
 * And in general anything that makes assumptions about running early, or at 
 least not during JVM shutdown, such as the metrics library, will be 
 problematic

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-3912) support incremental repair controlled by external agent

2012-02-15 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209117#comment-13209117
]

Peter Schuller edited comment on CASSANDRA-3912 at 2/16/12 5:22 AM:

Agreed.

The good news is that the actual commands necessary ({{getprimaryrange}} and
{{repairrange}}) are easy patches.

The bad news is that it turns out the AntiEntropyService does not support
arbitrary ranges.

Attaching {{CASSANDRA\-3912\-v2\-001\-add\-nodetool\-commands.txt}} and
{{CASSANDRA\-3912\-v2\-002\-fix\-antientropyservice.txt}}.

Had it not been for AES I'd want to propose we commit this to 1.1 since it
would be additive only, but given the AES fix I don't know... I guess probably
not?

It's a shame because I think it would be a boon to users with large nodes
struggling with repair (despite the fact that, as you point out, each repair
implies a flush).

was (Author: scode):
Agreed.

The good news is that the actual commands necessary ({{getprimaryrange}} and
{{repairrange}}) are easy patches.

The bad news is that it turns out the AntiEntropyService does not support
arbitrary ranges.

Attaching {{CASSANDRA\-3912\-v2\-001\-add\-nodetool\-commands.txt}} and
{{CASSANDRA-3912-v2-002-fix-antientropyservice.txt}}.

Had it not been for AES I'd want to propose we commit this to 1.1 since it
would be additive only, but given the AES fix I don't know... I guess probably
not?

It's a shame because I think it would be a boon to users with large nodes
struggling with repair (despite the fact that, as you point out, each repair
implies a flush).

support incremental repair controlled by external agent
---

Key: CASSANDRA-3912
URL: https://issues.apache.org/jira/browse/CASSANDRA-3912
Project: Cassandra
Issue Type: Improvement
Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
Attachments: CASSANDRA-3912-trunk-v1.txt,
CASSANDRA-3912-v2-001-add-nodetool-commands.txt,
CASSANDRA-3912-v2-002-fix-antientropyservice.txt

As a poor man's pre-cursor to CASSANDRA-2699, exposing the ability to repair
small parts of a range is extremely useful because it allows (with external
scripting logic) to slowly repair a node's content over time. Other than
avoiding the bulkyness of complete repairs, it means that you can safely do
repairs even if you absolutely cannot afford e.g. disk spaces spikes (see
CASSANDRA-2699 for what the issues are).
Attaching a patch that exposes a repairincremental command to nodetool,
where you specify a step and the number of total steps. Incrementally
performing a repair in 100 steps, for example, would be done by:
{code}
nodetool repairincremental 0 100
nodetool repairincremental 1 100
...
nodetool repairincremental 99 100
{code}
An external script can be used to keep track of what has been repaired and
when. This should allow (1) allow incremental repair to happen now/soon, and
(2) allow experimentation and evaluation for an implementation of
CASSANDRA-2699 which I still think is a good idea. This patch does nothing to
help the average deployment, but at least makes incremental repair possible
given sufficient effort spent on external scripting.
The big no-no about the patch is that it is entirely specific to
RandomPartitioner and BigIntegerToken. If someone can suggest a way to
implement this command generically using the Range/Token abstractions, I'd be
happy to hear suggestions.
An alternative would be to provide a nodetool command that allows you to
simply specify the specific token ranges on the command line. It makes using
it a bit more difficult, but would mean that it works for any partitioner and
token type.
Unless someone can suggest a better way to do this, I think I'll provide a
patch that does this. I'm still leaning towards supporting the simple step N
out of M form though.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-3892) improve TokenMetadata abstraction, naming - audit current uses

2012-02-12 Thread Peter Schuller (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206391#comment-13206391
 ] 

Peter Schuller edited comment on CASSANDRA-3892 at 2/12/12 10:39 AM:
-

Attaching {{CASSANDRA\-3892\-draft.txt}} which is a draft/work in progress. 
Mainly I'm asking for a stop right there if these types of changes seem like 
something that will never be accepted (they're semi-significant even though 
most of it constitute non-functional changes). I'm not asking nor suggesting 
for careful review, as it's better that I submit a more finished patch before 
that happens. Any requests for patch splitting strategies or overall don't do 
this/don't do that would be helpful though, if someone has them.

Other than what's there in the current version, I want to move pending range 
calculation into token meta data (it will need to be given a strategy), and 
things like {{StorageService.handleStateNormal()}} being responsible for 
keeping the internal state of tokenmetadata (removing from moving) up-to-date I 
want gone.

I've begun making naming and concepts a bit more consistent; the token meta 
data is now more consistently (but not fully yet) talking about endpoints as 
the main abstraction rather than mixing endpoints and tokens, and we have 
joining endpoints instead of bootstrap tokens.

Moving endpoints is now also a map with O(n) access, and kept up to date in 
{{removeEndpoint()}} (may be other places that need fixing).

I adjusted comments for {{calculatePendingRanges}} to be clear:er; for example 
the old comments made it sound like we were sending writes to places for good 
measure because we're in doubt, rather than because it is strictly necessary.

Unless I hear objections I'll likely continue this on Sunday and submit another 
patch.


  was (Author: scode):
Attaching {{CASSANDRA-3892-draft.txt}} which is a draft/work in progress. 
Mainly I'm asking for a stop right there if these types of changes seem like 
something that will never be accepted (they're semi-significant even though 
most of it constitute non-functional changes). I'm not asking nor suggesting 
for careful review, as it's better that I submit a more finished patch before 
that happens. Any requests for patch splitting strategies or overall don't do 
this/don't do that would be helpful though, if someone has them.

Other than what's there in the current version, I want to move pending range 
calculation into token meta data (it will need to be given a strategy), and 
things like {{StorageService.handleStateNormal()}} being responsible for 
keeping the internal state of tokenmetadata (removing from moving) up-to-date I 
want gone.

I've begun making naming and concepts a bit more consistent; the token meta 
data is now more consistently (but not fully yet) talking about endpoints as 
the main abstraction rather than mixing endpoints and tokens, and we have 
joining endpoints instead of bootstrap tokens.

Moving endpoints is now also a map with O(n) access, and kept up to date in 
removeEndpoint() (may be other places that need fixing).

I adjusted comments for {{calculatePendingRanges}} to be clear:er; for example 
the old comments made it sound like we were sending writes to places for good 
measure because we're in doubt, rather than because it is strictly necessary.

Unless I hear objections I'll likely continue this on Sunday and submit another 
patch.

  
 improve TokenMetadata abstraction, naming - audit current uses
 --

 Key: CASSANDRA-3892
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3892
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
 Attachments: CASSANDRA-3892-draft.txt


 CASSANDRA-3417 has some background. I want to make the distinction more clear 
 between looking at the ring from different perspectives (reads, writes, 
 others) and adjust naming to be more clear on this. I also want to go through 
 each use case and try to spot any subtle pre-existing bugs that I almost 
 introduced in CASSANDRA-3417, had not Jonathan caught me.
 I will submit a patch soonish.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-3892) improve TokenMetadata abstraction, naming - audit current uses

2012-02-12 Thread Peter Schuller (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206391#comment-13206391
 ] 

Peter Schuller edited comment on CASSANDRA-3892 at 2/12/12 10:39 AM:
-

Attaching {{CASSANDRA\-3892\-draft.txt}} which is a draft/work in progress. 
Mainly I'm asking for a stop right there if these types of changes seem like 
something that will never be accepted (they're semi-significant even though 
most of it constitute non-functional changes). I'm not asking nor suggesting 
for careful review, as it's better that I submit a more finished patch before 
that happens. Any requests for patch splitting strategies or overall don't do 
this/don't do that would be helpful though, if someone has them.

Other than what's there in the current version, I want to move pending range 
calculation into token meta data (it will need to be given a strategy), and 
things like {{StorageService.handleStateNormal()}} being responsible for 
keeping the internal state of tokenmetadata (removing from moving) up-to-date I 
want gone.

I've begun making naming and concepts a bit more consistent; the token meta 
data is now more consistently (but not fully yet) talking about endpoints as 
the main abstraction rather than mixing endpoints and tokens, and we have 
joining endpoints instead of bootstrap tokens.

Moving endpoints is now also a map with O(n) access, and kept up to date in 
{{removeEndpoint()}} (may be other places that need fixing).

I adjusted comments for {{calculatePendingRanges}} to be clearer; for example 
the old comments made it sound like we were sending writes to places for good 
measure because we're in doubt, rather than because it is strictly necessary.

Unless I hear objections I'll likely continue this on Sunday and submit another 
patch.


  was (Author: scode):
Attaching {{CASSANDRA\-3892\-draft.txt}} which is a draft/work in progress. 
Mainly I'm asking for a stop right there if these types of changes seem like 
something that will never be accepted (they're semi-significant even though 
most of it constitute non-functional changes). I'm not asking nor suggesting 
for careful review, as it's better that I submit a more finished patch before 
that happens. Any requests for patch splitting strategies or overall don't do 
this/don't do that would be helpful though, if someone has them.

Other than what's there in the current version, I want to move pending range 
calculation into token meta data (it will need to be given a strategy), and 
things like {{StorageService.handleStateNormal()}} being responsible for 
keeping the internal state of tokenmetadata (removing from moving) up-to-date I 
want gone.

I've begun making naming and concepts a bit more consistent; the token meta 
data is now more consistently (but not fully yet) talking about endpoints as 
the main abstraction rather than mixing endpoints and tokens, and we have 
joining endpoints instead of bootstrap tokens.

Moving endpoints is now also a map with O(n) access, and kept up to date in 
{{removeEndpoint()}} (may be other places that need fixing).

I adjusted comments for {{calculatePendingRanges}} to be clear:er; for example 
the old comments made it sound like we were sending writes to places for good 
measure because we're in doubt, rather than because it is strictly necessary.

Unless I hear objections I'll likely continue this on Sunday and submit another 
patch.

  
 improve TokenMetadata abstraction, naming - audit current uses
 --

 Key: CASSANDRA-3892
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3892
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
 Attachments: CASSANDRA-3892-draft.txt


 CASSANDRA-3417 has some background. I want to make the distinction more clear 
 between looking at the ring from different perspectives (reads, writes, 
 others) and adjust naming to be more clear on this. I also want to go through 
 each use case and try to spot any subtle pre-existing bugs that I almost 
 introduced in CASSANDRA-3417, had not Jonathan caught me.
 I will submit a patch soonish.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-3892) improve TokenMetadata abstraction, naming - audit current uses

2012-02-12 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206596#comment-13206596
]

Peter Schuller edited comment on CASSANDRA-3892 at 2/13/12 1:00 AM:

Attaching {{CASSANDRA\-3892\-draft\-v2.txt}} with some more changes. I still
consider it a draft because I have not yet done any testing, but it's more ripe
for review now.

A few of the sub-tasks I created are IMO serious as well.

was (Author: scode):
Attaching {{CASSANDRA-3892-draft-v2.txt}} with some more changes. I still
consider it a draft because I have not yet done any testing, but it's more ripe
for review now.

A few of the sub-tasks I created are IMO serious as well.

improve TokenMetadata abstraction, naming - audit current uses
--

Key: CASSANDRA-3892
URL: https://issues.apache.org/jira/browse/CASSANDRA-3892
Project: Cassandra
Issue Type: Improvement
Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
Attachments: CASSANDRA-3892-draft-v2.txt, CASSANDRA-3892-draft.txt

CASSANDRA-3417 has some background. I want to make the distinction more clear
between looking at the ring from different perspectives (reads, writes,
others) and adjust naming to be more clear on this. I also want to go through
each use case and try to spot any subtle pre-existing bugs that I almost
introduced in CASSANDRA-3417, had not Jonathan caught me.
I will submit a patch soonish.

[jira] [Issue Comment Edited] (CASSANDRA-3897) StorageService.onAlive() only schedules hints for joined endpoints

2012-02-12 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206600#comment-13206600
]

Peter Schuller edited comment on CASSANDRA-3897 at 2/13/12 1:08 AM:

Why would this be the case? They are supposed to receive writes; I see no
reason why hints should not be delivered. Hints is just a way to more quickly
delivery writes in cases where nodes are down (i.e., more quickly when they go
up) and avoid AES need. I don't see why a node actively bootstrapping into the
ring should be discriminated against, in terms of seeing as reliable delivery
of writes as other nodes.

In other words, I don't buy your first sentence unless you explain why. I don't
accept it axiomatically :)

Obviously sending hints requires that hints are *there* first too, but the same
argument applies. If a node is supposed to see a certain writes and it's
considered down - hint it.

Statistically I can see the argument that if a node is bootstrapping and down,
it might be practically more likely that the node is just going to be down for
a longer period, and/or that the node will completely re-bootstrap anyway
(since normally a node is down because it's being restarted, which would imply
re-bootstrap if the node is bootstrapping).

was (Author: scode):
Why would this be the case? They are supposed to receive writes; I see no
reason why hints should not be delivered. Hints is just a way to more quickly
delivery writes in cases where nodes are down (i.e., more quickly when they go
up) and avoid AES need. I don't see why a node actively bootstrapping into the
ring should be discriminated against, in terms of seeing as reliable delivery
of writes as other nodes.

In other words, I don't by your first sentence unless you explain why. I don't
accept it axiomatically :)

Obviously sending hints requires that hints are *there* first too, but the same
argument applies. If a node is supposed to see a certain writes and it's
considered down - hint it.

StorageService.onAlive() only schedules hints for joined endpoints
--

Key: CASSANDRA-3897
URL: https://issues.apache.org/jira/browse/CASSANDRA-3897
Project: Cassandra
Issue Type: Sub-task
Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
Priority: Minor

It seems incorrect to not do hint delivery for nodes that are bootstrapping,
as that would cause sudden spikes in read repair need or inconsistent reads
when a node joins the ring. Particularly if the user is expecting to rely on
the new hinted handoff code making AES much less needed. It would be a POLA
violation for bootstrapping nodes to be an exception to that.

[jira] [Issue Comment Edited] (CASSANDRA-3895) Gossiper.doStatusCheck() uses isMember() suspiciously

2012-02-12 Thread Peter Schuller (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206627#comment-13206627
 ] 

Peter Schuller edited comment on CASSANDRA-3895 at 2/13/12 2:29 AM:


{quote}
If fat clients disappear, no one really cares because they were never ring 
members.
{quote}

Ok. Well, they care in the joining (bootstrapping) node case since they are 
taking writes. But all nodes will naturally drop them by themselves without 
needing others to propagate. As long as we consider aVeryLongTime (3 days) long 
enough that it's safe to assume that this only triggers in legitimate cases, 
we're fine (if not, we're not as soon as CASSANDRA-3892 is fixed or if I am 
wrong and it's not a bug to begin with).

I'll resolve this as INVALID then, though I'm skeptical about the 3 day very 
long time.


  was (Author: scode):
{quote}
If fat clients disappear, no one really cares because they were never ring 
members.
{quote}

Ok. Well, they care in the joining (bootstrapping) node case since they are 
taking writes. But all nodes will naturally drop them by themselves without 
needing others to propagate. As long as we consider aVeryLongTime (3 days) long 
enough that it's safe to assume that this only triggers in legitimate cases, 
we're fine (if not, we're not as soon as CASSANDRA-3892 is fixed or if I am 
wrong and it's not a bug to begin with).

I'll resolve this as WONTFIX then, though I'm skeptical about the 3 day very 
long time.

  
 Gossiper.doStatusCheck() uses isMember() suspiciously
 -

 Key: CASSANDRA-3895
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3895
 Project: Cassandra
  Issue Type: Sub-task
  Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
Priority: Minor

 There is code for fat client removal and old endpoint (non-fat) removal 
 which uses {{TokenMetadata.isMember()}} which only considers nodes that are 
 joined (takes reads) in the cluster.
 aVeryLongTime is set to 3 days.
 I could very well be wrong, but the fat client identification code, the way I 
 interpret it, is using isMember() to check basically whether a node is part 
 of the cluster (in the most vague/broad sense) in order to differentiate a 
 real node (part of the cluster) from just a fat client. But a node that is 
 boot strapping is not a fat client, nor will be me a member according to 
 isMember().
 I'm also a bit scared of, even in the case of there not being a fat client 
 identification, simply forgetting an endpoint. It seems that an operator 
 request should be relied upon to actively forget an endpoint (i.e., forced 
 remove token).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-3735) Fix Unable to create hard link SSTableReaderTest error messages

2012-02-04 Thread Peter Schuller (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200642#comment-13200642
 ] 

Peter Schuller edited comment on CASSANDRA-3735 at 2/5/12 2:54 AM:
---

Attaching new version of 0002* that works (but still with the left-overs 
already mentioned by jbellis/sylvain) post CASSANDRA-2794.

  was (Author: scode):
Attaching new version of 0002* that works post CASSANDRA-2794.
  
 Fix Unable to create hard link SSTableReaderTest error messages
 -

 Key: CASSANDRA-3735
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3735
 Project: Cassandra
  Issue Type: Bug
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Attachments: 0001-fix-generation-update-in-loadNewSSTables.patch, 
 0002-remove-incremental-backups-before-reloading-sstables-v2.patch, 
 0002-remove-incremental-backups-before-reloading-sstables.patch


 Sample failure (on Windows):
 {noformat}
 [junit] java.io.IOException: Exception while executing the command: cmd 
 /c mklink /H 
 C:\Users\Jonathan\projects\cassandra\git\build\test\cassandra\data\Keyspace1\backups\Standard1-hc-1-Index.db
  
 c:\Users\Jonathan\projects\cassandra\git\build\test\cassandra\data\Keyspace1\Standard1-hc-1-Index.db,command
  error Code: 1, command output: Cannot create a file when that file already 
 exists.
 [junit]
 [junit] at org.apache.cassandra.utils.CLibrary.exec(CLibrary.java:213)
 [junit] at 
 org.apache.cassandra.utils.CLibrary.createHardLinkWithExec(CLibrary.java:188)
 [junit] at 
 org.apache.cassandra.utils.CLibrary.createHardLink(CLibrary.java:151)
 [junit] at 
 org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:833)
 [junit] at 
 org.apache.cassandra.db.DataTracker$1.runMayThrow(DataTracker.java:161)
 [junit] at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
 [junit] at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 [junit] at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 [junit] at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 [junit] at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
 [junit] at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
 [junit] at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 [junit] at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 [junit] at java.lang.Thread.run(Thread.java:662)
 [junit] ERROR 17:10:17,111 Fatal exception in thread 
 Thread[NonPeriodicTasks:1,5,main]
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-3831) scaling to large clusters in GossipStage impossible due to calculatePendingRanges

2012-02-03 Thread Peter Schuller (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200251#comment-13200251
 ] 

Peter Schuller edited comment on CASSANDRA-3831 at 2/4/12 1:25 AM:
---

I am attaching {{CASSANDRA\-3831\-memoization\-not\-for\-inclusion.txt}} as an 
FYI and in case it helps others. It's against 0.8, and implements memoization 
of calculate pending ranges.

The correct/clean fix is probably to change behavior so that it doesn't get 
called unnecessarily to begin with (and to make sure the computational 
complexity is reasonable when it does get called). This patch was made 
specifically to address the production issue we are having in a minimally 
dangerous fashion, and is not to be taken as a suggested fix.

  was (Author: scode):
I am attaching {{CASSANDRA\-3831\-memoization\-not\-for\-inclusion.txt}} as 
an FYI and in case it helps others. It's against 0.8, and implements 
memoization of calculate pending ranges.

The correct/clean fix is probably to change behavior so that it doesn't get 
called unnecessarily to begin with. This patch was made specifically to address 
the production issue we are having in a minimally dangerous fashion, and is not 
to be taken as a suggested fix.
  
 scaling to large clusters in GossipStage impossible due to 
 calculatePendingRanges 
 --

 Key: CASSANDRA-3831
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3831
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
Priority: Critical
 Attachments: CASSANDRA-3831-memoization-not-for-inclusion.txt


 (most observations below are from 0.8, but I just now tested on
 trunk and I can trigger this problem *just* by bootstrapping a ~180
 nod cluster concurrently, presumably due to the number of nodes that
 are simultaneously in bootstrap state)
 It turns out that:
 * (1) calculatePendingRanges is not just expensive, it's computationally 
 complex - cubic or worse
 * (2) it gets called *NOT* just once per node being bootstrapped/leaving etc, 
 but is called repeatedly *while* nodes are in these states
 As a result, clusters start exploding when you start reading 100-300
 nodes. The GossipStage will get backed up because a single
 calculdatePenginRanges takes seconds, and depending on what the
 average heartbeat interval is in relation to this, this can lead to
 *massive* cluster-wide flapping.
 This all started because we hit this in production; several nodes
 would start flapping several other nodes as down, with many nodes
 seeing the entire cluster, or a large portion of it, as down. Logging
 in to some of these nodes you would see that they would be constantly
 flapping up/down for minutes at a time until one became lucky and it
 stabilized.
 In the end we had to perform an emergency full-cluster restart with
 gossip patched to force-forget certain nodes in bootstrapping state.
 I can't go into all details here from the post-mortem (just the
 write-up would take a day), but in short:
 * We graphed the number of hosts in the cluster that had more than 5
   Down (in a cluster that should have 0 down) on a minutely timeline.
 * We also graphed the number of hosts in the cluster that had GossipStage 
 backed up.
 * The two graphs correlated *extremely* well
 * jstack sampling showed it being CPU bound doing mostly sorting under 
 calculatePendingRanges
 * We were never able to exactly reproduce it with normal RING_DELAY and 
 gossip intervals, even on a 184 node cluster (the production cluster is 
 around 180).
 * Dropping RING_DELAY and in particular dropping gossip interval to 10 ms 
 instead of 1000 ms, we were able to observe all of the behavior we saw in 
 production.
 So our steps to reproduce are:
 * Launch 184 node cluster w/ gossip interval at 10ms and RING_DELAY at 1 
 second.
 * Do something like: {{while [ 1 ] ; do date ; echo decom ; nodetool 
 decommission ; date ; echo done leaving decommed for a while ; sleep 3 ; date 
 ; echo done restarting; sudo rm -rf /data/disk1/commitlog/* ; sudo rm -rf 
 /data/diskarray/tables/* ; sudo monit restart cassandra ;date ; echo 
 restarted waiting for a while ; sleep 40; done}} (or just do a manual 
 decom/bootstrap once, it triggers every time)
 * Watch all nodes flap massively and not recover at all, or maybe after a 
 *long* time.
 I observed the flapping using a python script that every 5 second
 (randomly spread out) asked for unreachable nodes from *all* nodes in
 the cluster, and printed any nodes and their counts when they had
 unreachables  5. The cluster can be observed instantly going into
 massive flapping when leaving/bootstrap is initiated. Script

[jira] [Issue Comment Edited] (CASSANDRA-3820) Columns missing after upgrade from 0.8.5 to 1.0.7.

2012-01-31 Thread Peter Schuller (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197496#comment-13197496
 ] 

Peter Schuller edited comment on CASSANDRA-3820 at 2/1/12 1:53 AM:
---

Check whether the .bf files contain all zeroes above roughly 235 mb or so. If 
you have lots of rows, your BF will be that large.

We encountered a bug internally whereby all bloom filters larger than 2^31 bits 
were large on disk, but everything after the first 2^31 bits were all zeroes.

Unfortunately I don't know whether this is specific to patches made to our 
branch, and I have been so busy I haven't been able to follow up to figure out 
whether it affects the upstream version.

But - just tail -c 1000 | hexdump. If you only have zeroes, this is the bug. 
Make sure to tail on a large .bf file (take the largest, easiest).



  was (Author: scode):
Check whether the .bf files contain all zeroes above roughly 235 mb or so. 
If you have lots of rows, your BF will be that large.

We encountered a bug internally whereby all bloom filters larger than 2^31 bits 
were large on disk, but everything afger the first 2^31 bits were all zeroes.

Unfortunately I don't know whether this is specific to patches made to our 
branch, and I have been so busy I haven't been able to follow up to figure out 
whether it affects the upstream version.

But - just tail -c 1000 | hexdump. If you only have zeroes, this is the bug. 
Make sure to tail on a large .bf file (take the largest, easiest).


  
 Columns missing after upgrade from 0.8.5 to 1.0.7.
 --

 Key: CASSANDRA-3820
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3820
 Project: Cassandra
  Issue Type: Bug
Affects Versions: 1.0.7
Reporter: Jason Harvey

 After an upgrade, one of our CFs had a lot of rows with missing columns. I've 
 been able to reproduce in test conditions. Working on getting the tables to 
 DataStax(data is private).
 0.8 results:
 {code}
 [default@reddit] get CommentVote[36353467625f6837336f32];
 = (column=date, value=313332333932323930392e3531, timestamp=1323922909506508)
 = (column=ip, value=REDACTED, timestamp=1327048432717348, ttl=2592000)
 = (column=name, value=31, timestamp=1327048433000740)
 = (column=REDACTED, value=30, timestamp=1323922909506432)
 = (column=thing1_id, value=REDACTED, timestamp=1323922909506475)
 = (column=thing2_id, value=REDACTED, timestamp=1323922909506486)
 = (column=REDACTED, value=31, timestamp=1323922909506518)
 = (column=REDACTED, value=30, timestamp=1323922909506497)
 {code}
 1.0 results:
 {code}
 [default@reddit] get CommentVote[36353467625f6837336f32];
 = (column=ip, value=REDACTED, timestamp=1327048432717348, ttl=2592000)
 = (column=name, value=31, timestamp=1327048433000740)
 {code}
 A few notes:
 * The rows with missing data were fully restored after scrubbing the sstables.
 * The row which I reproduced on happened to be split across multiple sstables.
 * When I copied the first sstable I found the row on, I was able to 'list' 
 rows from the sstable, but any and all 'get' calls failed.
 * These SStables were natively created on 0.8.5; they did not come from any 
 previous upgrade.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-3670) provide red flags JMX instrumentation

2012-01-27 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195216#comment-13195216
]

Peter Schuller edited comment on CASSANDRA-3670 at 1/27/12 11:13 PM:
-

CodaHale Metrics being evaluated in CASSANDRA-3671. If there's a +1 there, will
go for same here.

was (Author: scode):
CodaHale Metrics being evaluated in CASSANDRA-3671. If there's a +1 here,
will go for same here.

provide red flags JMX instrumentation
---

Key: CASSANDRA-3670
URL: https://issues.apache.org/jira/browse/CASSANDRA-3670
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller
Priority: Minor

As discussed in CASSANDRA-3641, it would be nice to expose through JMX
certain information which is almost without exception indicative of something
being wrong with the node or cluster.
In the CASSANDRA-3641 case, it was the detection of corrupt counter shards.
Other examples include:
* Number of times the selection of files to compact was adjusted due to disk
space heuristics
* Number of times compaction has failed
* Any I/O error reading from or writing to disk (the work here is collecting,
not exposing, so maybe not in an initial version)
* Any data skipped due to checksum mismatches (when checksumming is being
used); e.g., number of skips.
* Any arbitrary exception at least in certain code paths (compaction, scrub,
cleanup for starters)
Probably other things.
The motivation is that if we have clear and obvious indications that
something truly is wrong, it seems suboptimal to just leave that information
in the log somewhere, for someone to discover later when something else broke
as a result and a human investigates. You might argue that one should use
non-trivial log analysis to detect these things, but I highly doubt a lot of
people do this and it seems very wasteful to require that in comparison to
just providing the MBean.
It is important to note that the *lack* of a certain problem being advertised
in this MBean is not supposed to be indicative of a *lack* of a problem.
Rather, the point is that to the extent we can easily do so, it is nice to
have a clear method of communicating to monitoring systems where there *is* a
clear indication of something being wrong.
The main part of this ticket is not to cover everything under the sun, but
rather to reach agreement on adding an MBean where these types of indicators
can be collected. Individual counters can then be added over time as one
thinks of them.
I propose:
* Create an org.apache.cassandra.db.RedFlags MBean
* Populate with a few things to begin with.
I'll submit the patch if there is agreement.

[jira] [Issue Comment Edited] (CASSANDRA-3070) counter repair

2011-12-09 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13166617#comment-13166617
]

Peter Schuller edited comment on CASSANDRA-3070 at 12/9/11 10:19 PM:
-

This may be relevant, quoting myself from IRC:

{code}
21:20:01 scode pcmanus: Hey, are you there?

21:20:21 scode pcmanus: I am
investigating something which might be
https://issues.apache.org/jira/browse/CASSANDRA-3070
21:20:37 scode pcmanus: And
I could use the help of someone with his brain all over counters, and Stu isn't
here atm. :)
21:21:16 scode pcmanus:
https://gist.github.com/8202cb46c8bd00c8391b

21:21:37 scode pcmanus: I am investigating why with CL.ALL and
CL.QUORUM, I get seemingly random/varying results when I read a counter.
21:21:53 scode
pcmanus: I have the offending sstables on a three-node test setup and am
inserting debug printouts in the code to trace the reconiliation.
21:21:57 scode pcmanus: The gist above shows
what's happening.

21:22:11 scode pcmanus: The latter is the wrong one, and the former is the
correct one.
21:22:28 scode pcmanus: The
interesting bit is that I see shards with the same node_id *AND* clock, but
*DIFFERENT* counts.
21:22:53 scode pcmanus: My understanding of counters is that
there should never (globally across an entire cluster in all sstables) exist
two shards for the same node_id+clock but with different
counts.

21:22:57 scode pcmanus: Is my understanding correct
there?
21:25:10
scode pcmanus: There is one node out of the three that has the offending
card (with a count of 2 instead of 1). Like with 3070, we observed this after
having expanded a cluster (though I'm not sure how that would cause it, and we
don't know if there existed a problem before the expansion).
{code}

was (Author: scode):
This may be relevant, quoting myself from IRC:

{quote}
21:20:01 scode pcmanus: Hey, are you there?

[jira] [Issue Comment Edited] (CASSANDRA-4032) memtable.updateLiveRatio() is blocking, causing insane latencies for writes

[jira] [Issue Comment Edited] (CASSANDRA-3952) avoid quadratic startup time in LeveledManifest

[jira] [Issue Comment Edited] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

[jira] [Issue Comment Edited] (CASSANDRA-3722) Send Hints to Dynamic Snitch when Compaction or repair is going on for a node.

[jira] [Issue Comment Edited] (CASSANDRA-3797) StorageProxy static initialization not triggered until thrift requests come in

[jira] [Issue Comment Edited] (CASSANDRA-3912) support incremental repair controlled by external agent

[jira] [Issue Comment Edited] (CASSANDRA-3892) improve TokenMetadata abstraction, naming - audit current uses

[jira] [Issue Comment Edited] (CASSANDRA-3892) improve TokenMetadata abstraction, naming - audit current uses

[jira] [Issue Comment Edited] (CASSANDRA-3892) improve TokenMetadata abstraction, naming - audit current uses

[jira] [Issue Comment Edited] (CASSANDRA-3897) StorageService.onAlive() only schedules hints for joined endpoints

[jira] [Issue Comment Edited] (CASSANDRA-3895) Gossiper.doStatusCheck() uses isMember() suspiciously

[jira] [Issue Comment Edited] (CASSANDRA-3735) Fix Unable to create hard link SSTableReaderTest error messages

[jira] [Issue Comment Edited] (CASSANDRA-3831) scaling to large clusters in GossipStage impossible due to calculatePendingRanges

[jira] [Issue Comment Edited] (CASSANDRA-3820) Columns missing after upgrade from 0.8.5 to 1.0.7.

[jira] [Issue Comment Edited] (CASSANDRA-3670) provide red flags JMX instrumentation

[jira] [Issue Comment Edited] (CASSANDRA-3070) counter repair

16 matches

Site Navigation

Mail list logo

Footer information