[jira] [Issue Comment Edited] (CASSANDRA-4032) memtable.updateLiveRatio() is blocking, causing insane latencies for writes

2012-03-09 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13226166#comment-13226166
 ] 

Peter Schuller edited comment on CASSANDRA-4032 at 3/9/12 4:05 PM:
---

{quote}
Are we sure that what we want is a SynchronousQueue with task rejected? After 
all, there is only on global memoryMeter, so we could end up failing to 
updateLiveRatio just based on a race, even if calculations are fast. I'd 
suggest instead a bounded queue (but maybe not infinite and we could indeed 
just skip task if that queue gets full).
{quote}

I agree it's fishy, though I'd suggest a separate ticket. This patch is 
intended to make the code behave the way the original commit intended.

This (from the code, not my patch) seems legit though:

{code}
// we're careful to only allow one count to run at a time because counting 
is slow
// (can be minutes, for a large memtable and a busy server), so we could 
keep memtables
// alive after they're flushed and would otherwise be GC'd.
{code}

We could have one queue per unique CF and have a consumer that iterates over 
the set of queues, guaranteeing that each CF gets processed once per cycle. A 
simpler solution is probably preferable though if we can think of one.


  was (Author: scode):
{code}
Are we sure that what we want is a SynchronousQueue with task rejected? After 
all, there is only on global memoryMeter, so we could end up failing to 
updateLiveRatio just based on a race, even if calculations are fast. I'd 
suggest instead a bounded queue (but maybe not infinite and we could indeed 
just skip task if that queue gets full).
{code}

I agree it's fishy, though I'd suggest a separate ticket. This patch is 
intended to make the code behave the way the original commit intended.

This (from the code, not my patch) seems legit though:

{code}
// we're careful to only allow one count to run at a time because counting 
is slow
// (can be minutes, for a large memtable and a busy server), so we could 
keep memtables
// alive after they're flushed and would otherwise be GC'd.
{code}

We could have one queue per unique CF and have a consumer that iterates over 
the set of queues, guaranteeing that each CF gets processed once per cycle. A 
simpler solution is probably preferable though if we can think of one.

  
 memtable.updateLiveRatio() is blocking, causing insane latencies for writes
 ---

 Key: CASSANDRA-4032
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4032
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
 Fix For: 1.1.0

 Attachments: CASSANDRA-4032-1.1.0-v1.txt


 Reproduce by just starting a fresh cassandra with a heap large enough for 
 live ratio calculation (which is {{O(n)}}) to be insanely slow, and then 
 running {{./bin/stress -d host -n1 -t10}}. With a large enough heap 
 and default flushing behavior this is bad enough that stress gets timeouts.
 Example (blocked for is my debug log added around submit()):
 {code}
  INFO [MemoryMeter:1] 2012-03-09 15:07:30,857 Memtable.java (line 198) 
 CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 
 8.89014894083727 (just-counted was 8.89014894083727).  calculation took 
 28273ms for 1320245 columns
  WARN [MutationStage:8] 2012-03-09 15:07:30,857 Memtable.java (line 209) 
 submit() blocked for: 231135
 {code}
 The calling code was written assuming a RejectedExecutionException is thrown, 
 but it's not because {{DebuggableThreadPoolExecutor}} installs a blocking 
 rejection handler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (CASSANDRA-3952) avoid quadratic startup time in LeveledManifest

2012-03-06 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223091#comment-13223091
 ] 

Peter Schuller edited comment on CASSANDRA-3952 at 3/6/12 8:41 AM:
---

Committed with an additional assertion and the map renamed to 
{{sstableGenerations}}, and including 1.1.0. It was marked for 1.1.1 but 
frankly if this *does* introduce some kind of bug, it feels more dangerous to 
have that crop up in an upgrade to 1.1.1 than to have it in the initial release.

  was (Author: scode):
Committed, including 1.1.0. It was marked for 1.1.1 but frankly if this 
*does* introduce some kind of bug, it feels more dangerous to have that crop up 
in an upgrade to 1.1.1 than to have it in the initial release.
  
 avoid quadratic startup time in LeveledManifest
 ---

 Key: CASSANDRA-3952
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3952
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Dave Brosius
Priority: Minor
  Labels: lhf
 Fix For: 1.1.0

 Attachments: speed_up_level_of.diff


 Checking that each sstable is in the manifest on startup is O(N**2) in the 
 number of sstables:
 {code}
 .   // ensure all SSTables are in the manifest
 for (SSTableReader ssTableReader : cfs.getSSTables())
 {
 if (manifest.levelOf(ssTableReader)  0)
 manifest.add(ssTableReader);
 }
 {code}
 {code}
 private int levelOf(SSTableReader sstable)
 {
 for (int level = 0; level  generations.length; level++)
 {
 if (generations[level].contains(sstable))
 return level;
 }
 return -1;
 }
 {code}
 Note that the contains call is a linear List.contains.
 We need to switch to a sorted list and bsearch, or a tree, to support 
 TB-levels of data in LeveledCompactionStrategy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes

2012-02-27 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217415#comment-13217415
 ] 

Peter Schuller edited comment on CASSANDRA-3294 at 2/27/12 7:21 PM:


{quote}
This sounds like reinventing the existing failure detector to me.
{quote}

Except we don't use it that way at all (see CASSANDRA-3927). Even if we did 
though, I personally think it's totally the wrong solution to this problem 
since we have the *perfect* measurement - whether the TCP connection is up.

It's fine if we have other information that actively indicates we shouldn't 
send messages to it (whether it's the FD or the fact that we have 500 000 
messages queued to the node), but if we *know* the TCP connection is down, we 
should just not send messages to it, period. With the only caveat being that of 
course we'd have to make sure TCP connections are in fact pro-actively kept up 
under all circumstances (I'd have to look at code to figure out what issues 
there are, if any, in detail).

{quote}
The main idea of the algorithm I have mentioned is to make sure that we always 
do operations (write/read etc.) on the nodes that have the highest probability 
to be alive determined by live traffic going there instead of passively relying 
on the failure detector.
{quote}

I have an unfiled ticket to suggest making the proximity sorting probabilistic 
to avoid the binary either we get traffic or we dont (or either we get data 
or we get digest) situation. That would certainly help. As would 
least-requests-outstanding.

You can totally make it so that this ticket is irrelevant by just making the 
general case well-supported enough that there is no reason to special case 
this. This was originally filed since we had none of that, and we still don't, 
and it seemed like a very trivial case to handle for the TCP connection to be 
actively reset by the other side.

{quote}
After reading CASSANDRA-3722 it seems we can implement required logic at the 
snitch level taking latency measurements into account. I think we can close 
this one in favor of CASSANDRA-3722 and continue work/discussion there. What do 
you think, Brandon, Peter?
{quote}

I think CASSANDRA-3722's original premise doesn't address the concerns I see in 
real life (I don't want special cases trying to communicate X is happening), 
but towards the end I start agreeing with the ticket more.

In any case, feel free to close if you want. If I ever get to actually 
implementing this (if at that point there is no other mechanism to remove the 
need) I'll just re-file or re-open with a patch. We don't need to track this if 
others aren't interested.

  was (Author: scode):
{quote}
This sounds like reinventing the existing failure detector to me.
{quote}

Except we don't use it that way at all (see CASSANDRA-3927). Even if we did 
though, I personally think it's totally the wrong solution to this problem 
since we have the *perfect* measurement - whether the TCP connection is up.

It's fine if we have other information that actively indicates we shouldn't 
send messages to it (whether it's the FD or the fact that we have 500 000 
messages queued to the node), but if we *know* the TCP connection is down, we 
should just not send messages to it, period. With the only caveat being that of 
course we'd have to make sure TCP connections are in fact pro-actively kept up 
under all circumstances (I'd have to look at code to figure out what issues 
there are, if any, in detail).

{quote}
The main idea of the algorithm I have mentioned is to make sure that we always 
do operations (write/read etc.) on the nodes that have the highest probability 
to be alive determined by live traffic going there instead of passively relying 
on the failure detector.
{quote}

I have an unfiled ticket to suggest making the proximity sorting probabilistic 
to avoid the binary either we get traffic or we dont (or either we get data 
or we get digest) situation. That would certainly help. As would 
least-requests-outstanding.

You can totally make it so that this ticket is irrelevant by just making the 
general case well-supported enough that there is no reason to special case 
this. This was originally filed since we had none of that, and we still don't, 
and it seemed like a very trivial case to handle for the TCP connection to be 
actively reset by the other side.

{code}
After reading CASSANDRA-3722 it seems we can implement required logic at the 
snitch level taking latency measurements into account. I think we can close 
this one in favor of CASSANDRA-3722 and continue work/discussion there. What do 
you think, Brandon, Peter?
{code}

I think CASSANDRA-3722's original premise doesn't address the concerns I see in 
real life (I don't want special cases trying to communicate X is happening), 
but towards the end I start agreeing with the 

[jira] [Issue Comment Edited] (CASSANDRA-3722) Send Hints to Dynamic Snitch when Compaction or repair is going on for a node.

2012-02-27 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217445#comment-13217445
 ] 

Peter Schuller edited comment on CASSANDRA-3722 at 2/27/12 7:38 PM:


I'm -0 on the original bit of this ticket, but +1 on more generic changes that 
covers the original use case as good if not better anyway. I think that instead 
of trying to predict exactly the behavior of some particular event like 
compaction, we should just be better at actually responding to what is actually 
going on:

* We have CASSANDRA-2540 which can help avoid blocking uselessly on a dropped 
or slow request even if we haven't had the opportunity to react to overall 
behavior yet (I have a partial patch that breaks read repair, I haven't had 
time to finish it).
* Taking into account the number of outstanding requests is IMO a necessity. 
There is plenty of precedent for anyone who wants that (least used connections 
policies in various LB:s), but more importantly it would so clearly help in 
several situations, including:
** Sudden GC pause of a node
** Sudden death of a node
** Sudden page cache eviction and slowness of a node, before snitching figures 
it out
** Constantly overloaded node; even with the dynsnitch it would improve the 
situation as the number of requests affected by a dynsnitch reset is lessened
** Packet loss/hiccup/whatever across DC:s

There is some potential for foot shooting in the sense that if a node is broken 
in a way that it responds with incorrect data, but responds faster than anyone 
else, it will tend to swallow all the traffic. But honestly, that feels like 
a minor concern to me based on what I've seen actually happen in production 
clusters. If we ever start sending non-successes back over inter-node RPC, this 
would change however.

My only major concern is potential performance impacts of keeping track of the 
number of outstanding requests, but if that *does* become a problem one can 
make it probabilistic - have N % of all requests be tracked. Less impact, but 
also less immediate response to what's happening.

This will also have the side-effect of mitigating sudden bursts of promotion 
into old-gen if we combine it with pro-actively dropping read-repair messages 
for nodes that are overloaded (effectively prioritizing data reads), hence 
helping for CASSANDRA-3853.

{quote}
Should we T (send additional requests which are not part of the normal 
operations) the requests until the other node recovers?
{quote}

In the absence of read repair, we'd have to do speculative reads as Stu has 
previously noted. With read repair turned on, this is not an issue because the 
node will still receive requests and eventually warm up. Only with read repair 
turned off do we not send requests to more than the first N of endpoints, with 
N being what is required by CL.

Semi-relatedly, I think it would be a good idea to make the proximity sorting 
probabilistic in nature so that we don't do a binary flip back and fourth 
between who gets data vs. digest reads or who doesn't get reads at all. That 
might mitigate this problem, but not help fundamentally since the rate of 
warm-up would decrease with a node being slow.

I do want to make this point though: *Every single production cluster* I have 
ever been involved with so far, has been such that you basically never want to 
turn read repair off. Not because of read repair itself, but because of the 
traffic it generates. Having nodes not receive traffic is extremely dangerous 
under most circumstances as it leaves nodes cold, only to suddenly explode and 
cause timeouts and other bad behavior as soon as e.g. some neighbor goes down 
and it suddenly starts taking traffic. This is an easy way to make production 
clusters fall over. If your workload is entirely in memory or otherwise not 
reliant on caching the problem is much less pronounced, but even then I would 
generally recommend that you keep it turned on if only because your nodes will 
have to be able to take the additional load *anyway* if you are to survive 
other nodes in the neighborhood going down. It just makes clusters much more 
easy to reason about.

  was (Author: scode):
I'm -0 on the original bit of this ticket, but +1 on more generic changes 
that covers the original use case as good if not better anyway. I think that 
instead of trying to predict exactly the behavior of some particular event like 
compaction, we should just be better at actually responding to what is actually 
going on:

* We have CASSANDRA-2540 which can help avoid blocking uselessly on a dropped 
or slow request even if we haven't had the opportunity to react to overall 
behavior yet (I have a partial patch that breaks read repair, I haven't had 
time to finish it).
* Taking into account the number of outstanding requests is IMO a necessity. 
There is 

[jira] [Issue Comment Edited] (CASSANDRA-3797) StorageProxy static initialization not triggered until thrift requests come in

2012-02-27 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217491#comment-13217491
 ] 

Peter Schuller edited comment on CASSANDRA-3797 at 2/27/12 8:09 PM:


Looks like {{3797-forname.txt}} is the same file as the original patch. In any 
case, suppose we just go for Class.forName() to avoid introducing that annoying 
method, and assuming it makes the metrics from CASSANDRA-3671 work, can I get a 
+1?

  was (Author: scode):
Looks like {{3797-forname.txt}} is the same file as the original patch. In 
any case, suppose we just go for Class.forName() to avoid introducing that 
annoying method, and assuming it makes the metrics from CASSANDRA-3671, can I 
get a +1?
  
 StorageProxy static initialization not triggered until thrift requests come in
 --

 Key: CASSANDRA-3797
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3797
 Project: Cassandra
  Issue Type: Bug
Reporter: Peter Schuller
Assignee: Peter Schuller
Priority: Minor
 Fix For: 1.1.0

 Attachments: 3797-forname.txt, CASSANDRA-3797-trunk-v1.txt


 While plugging in the metrics library for CASSANDRA-3671 I realized (because 
 the metrics library was trying to add a shutdown hook on metric creation) 
 that starting cassandra and simply shutting it down, causes StorageProxy to 
 not be initialized until the drain shutdown hook.
 Effects:
 * StorageProxy mbean missing in visualvm/jconsole after initial startup 
 (seriously, I thought I was going nuts ;))
 * And in general anything that makes assumptions about running early, or at 
 least not during JVM shutdown, such as the metrics library, will be 
 problematic

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (CASSANDRA-3912) support incremental repair controlled by external agent

2012-02-15 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209117#comment-13209117
 ] 

Peter Schuller edited comment on CASSANDRA-3912 at 2/16/12 5:22 AM:


Agreed.

The good news is that the actual commands necessary ({{getprimaryrange}} and 
{{repairrange}}) are easy patches.

The bad news is that it turns out the AntiEntropyService does not support 
arbitrary ranges.

Attaching {{CASSANDRA\-3912\-v2\-001\-add\-nodetool\-commands.txt}} and 
{{CASSANDRA\-3912\-v2\-002\-fix\-antientropyservice.txt}}.

Had it not been for AES I'd want to propose we commit this to 1.1 since it 
would be additive only, but given the AES fix I don't know... I guess probably 
not?

It's a shame because I think it would be a boon to users with large nodes 
struggling with repair (despite the fact that, as you point out, each repair 
implies a flush).



  was (Author: scode):
Agreed.

The good news is that the actual commands necessary ({{getprimaryrange}} and 
{{repairrange}}) are easy patches.

The bad news is that it turns out the AntiEntropyService does not support 
arbitrary ranges.

Attaching {{CASSANDRA\-3912\-v2\-001\-add\-nodetool\-commands.txt}} and 
{{CASSANDRA-3912-v2-002-fix-antientropyservice.txt}}.

Had it not been for AES I'd want to propose we commit this to 1.1 since it 
would be additive only, but given the AES fix I don't know... I guess probably 
not?

It's a shame because I think it would be a boon to users with large nodes 
struggling with repair (despite the fact that, as you point out, each repair 
implies a flush).


  
 support incremental repair controlled by external agent
 ---

 Key: CASSANDRA-3912
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3912
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
 Attachments: CASSANDRA-3912-trunk-v1.txt, 
 CASSANDRA-3912-v2-001-add-nodetool-commands.txt, 
 CASSANDRA-3912-v2-002-fix-antientropyservice.txt


 As a poor man's pre-cursor to CASSANDRA-2699, exposing the ability to repair 
 small parts of a range is extremely useful because it allows (with external 
 scripting logic) to slowly repair a node's content over time. Other than 
 avoiding the bulkyness of complete repairs, it means that you can safely do 
 repairs even if you absolutely cannot afford e.g. disk spaces spikes (see 
 CASSANDRA-2699 for what the issues are).
 Attaching a patch that exposes a repairincremental command to nodetool, 
 where you specify a step and the number of total steps. Incrementally 
 performing a repair in 100 steps, for example, would be done by:
 {code}
 nodetool repairincremental 0 100
 nodetool repairincremental 1 100
 ...
 nodetool repairincremental 99 100
 {code}
 An external script can be used to keep track of what has been repaired and 
 when. This should allow (1) allow incremental repair to happen now/soon, and 
 (2) allow experimentation and evaluation for an implementation of 
 CASSANDRA-2699 which I still think is a good idea. This patch does nothing to 
 help the average deployment, but at least makes incremental repair possible 
 given sufficient effort spent on external scripting.
 The big no-no about the patch is that it is entirely specific to 
 RandomPartitioner and BigIntegerToken. If someone can suggest a way to 
 implement this command generically using the Range/Token abstractions, I'd be 
 happy to hear suggestions.
 An alternative would be to provide a nodetool command that allows you to 
 simply specify the specific token ranges on the command line. It makes using 
 it a bit more difficult, but would mean that it works for any partitioner and 
 token type.
 Unless someone can suggest a better way to do this, I think I'll provide a 
 patch that does this. I'm still leaning towards supporting the simple step N 
 out of M form though.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (CASSANDRA-3892) improve TokenMetadata abstraction, naming - audit current uses

2012-02-12 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206391#comment-13206391
 ] 

Peter Schuller edited comment on CASSANDRA-3892 at 2/12/12 10:39 AM:
-

Attaching {{CASSANDRA\-3892\-draft.txt}} which is a draft/work in progress. 
Mainly I'm asking for a stop right there if these types of changes seem like 
something that will never be accepted (they're semi-significant even though 
most of it constitute non-functional changes). I'm not asking nor suggesting 
for careful review, as it's better that I submit a more finished patch before 
that happens. Any requests for patch splitting strategies or overall don't do 
this/don't do that would be helpful though, if someone has them.

Other than what's there in the current version, I want to move pending range 
calculation into token meta data (it will need to be given a strategy), and 
things like {{StorageService.handleStateNormal()}} being responsible for 
keeping the internal state of tokenmetadata (removing from moving) up-to-date I 
want gone.

I've begun making naming and concepts a bit more consistent; the token meta 
data is now more consistently (but not fully yet) talking about endpoints as 
the main abstraction rather than mixing endpoints and tokens, and we have 
joining endpoints instead of bootstrap tokens.

Moving endpoints is now also a map with O(n) access, and kept up to date in 
{{removeEndpoint()}} (may be other places that need fixing).

I adjusted comments for {{calculatePendingRanges}} to be clear:er; for example 
the old comments made it sound like we were sending writes to places for good 
measure because we're in doubt, rather than because it is strictly necessary.

Unless I hear objections I'll likely continue this on Sunday and submit another 
patch.


  was (Author: scode):
Attaching {{CASSANDRA-3892-draft.txt}} which is a draft/work in progress. 
Mainly I'm asking for a stop right there if these types of changes seem like 
something that will never be accepted (they're semi-significant even though 
most of it constitute non-functional changes). I'm not asking nor suggesting 
for careful review, as it's better that I submit a more finished patch before 
that happens. Any requests for patch splitting strategies or overall don't do 
this/don't do that would be helpful though, if someone has them.

Other than what's there in the current version, I want to move pending range 
calculation into token meta data (it will need to be given a strategy), and 
things like {{StorageService.handleStateNormal()}} being responsible for 
keeping the internal state of tokenmetadata (removing from moving) up-to-date I 
want gone.

I've begun making naming and concepts a bit more consistent; the token meta 
data is now more consistently (but not fully yet) talking about endpoints as 
the main abstraction rather than mixing endpoints and tokens, and we have 
joining endpoints instead of bootstrap tokens.

Moving endpoints is now also a map with O(n) access, and kept up to date in 
removeEndpoint() (may be other places that need fixing).

I adjusted comments for {{calculatePendingRanges}} to be clear:er; for example 
the old comments made it sound like we were sending writes to places for good 
measure because we're in doubt, rather than because it is strictly necessary.

Unless I hear objections I'll likely continue this on Sunday and submit another 
patch.

  
 improve TokenMetadata abstraction, naming - audit current uses
 --

 Key: CASSANDRA-3892
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3892
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
 Attachments: CASSANDRA-3892-draft.txt


 CASSANDRA-3417 has some background. I want to make the distinction more clear 
 between looking at the ring from different perspectives (reads, writes, 
 others) and adjust naming to be more clear on this. I also want to go through 
 each use case and try to spot any subtle pre-existing bugs that I almost 
 introduced in CASSANDRA-3417, had not Jonathan caught me.
 I will submit a patch soonish.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (CASSANDRA-3892) improve TokenMetadata abstraction, naming - audit current uses

2012-02-12 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206391#comment-13206391
 ] 

Peter Schuller edited comment on CASSANDRA-3892 at 2/12/12 10:39 AM:
-

Attaching {{CASSANDRA\-3892\-draft.txt}} which is a draft/work in progress. 
Mainly I'm asking for a stop right there if these types of changes seem like 
something that will never be accepted (they're semi-significant even though 
most of it constitute non-functional changes). I'm not asking nor suggesting 
for careful review, as it's better that I submit a more finished patch before 
that happens. Any requests for patch splitting strategies or overall don't do 
this/don't do that would be helpful though, if someone has them.

Other than what's there in the current version, I want to move pending range 
calculation into token meta data (it will need to be given a strategy), and 
things like {{StorageService.handleStateNormal()}} being responsible for 
keeping the internal state of tokenmetadata (removing from moving) up-to-date I 
want gone.

I've begun making naming and concepts a bit more consistent; the token meta 
data is now more consistently (but not fully yet) talking about endpoints as 
the main abstraction rather than mixing endpoints and tokens, and we have 
joining endpoints instead of bootstrap tokens.

Moving endpoints is now also a map with O(n) access, and kept up to date in 
{{removeEndpoint()}} (may be other places that need fixing).

I adjusted comments for {{calculatePendingRanges}} to be clearer; for example 
the old comments made it sound like we were sending writes to places for good 
measure because we're in doubt, rather than because it is strictly necessary.

Unless I hear objections I'll likely continue this on Sunday and submit another 
patch.


  was (Author: scode):
Attaching {{CASSANDRA\-3892\-draft.txt}} which is a draft/work in progress. 
Mainly I'm asking for a stop right there if these types of changes seem like 
something that will never be accepted (they're semi-significant even though 
most of it constitute non-functional changes). I'm not asking nor suggesting 
for careful review, as it's better that I submit a more finished patch before 
that happens. Any requests for patch splitting strategies or overall don't do 
this/don't do that would be helpful though, if someone has them.

Other than what's there in the current version, I want to move pending range 
calculation into token meta data (it will need to be given a strategy), and 
things like {{StorageService.handleStateNormal()}} being responsible for 
keeping the internal state of tokenmetadata (removing from moving) up-to-date I 
want gone.

I've begun making naming and concepts a bit more consistent; the token meta 
data is now more consistently (but not fully yet) talking about endpoints as 
the main abstraction rather than mixing endpoints and tokens, and we have 
joining endpoints instead of bootstrap tokens.

Moving endpoints is now also a map with O(n) access, and kept up to date in 
{{removeEndpoint()}} (may be other places that need fixing).

I adjusted comments for {{calculatePendingRanges}} to be clear:er; for example 
the old comments made it sound like we were sending writes to places for good 
measure because we're in doubt, rather than because it is strictly necessary.

Unless I hear objections I'll likely continue this on Sunday and submit another 
patch.

  
 improve TokenMetadata abstraction, naming - audit current uses
 --

 Key: CASSANDRA-3892
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3892
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
 Attachments: CASSANDRA-3892-draft.txt


 CASSANDRA-3417 has some background. I want to make the distinction more clear 
 between looking at the ring from different perspectives (reads, writes, 
 others) and adjust naming to be more clear on this. I also want to go through 
 each use case and try to spot any subtle pre-existing bugs that I almost 
 introduced in CASSANDRA-3417, had not Jonathan caught me.
 I will submit a patch soonish.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (CASSANDRA-3892) improve TokenMetadata abstraction, naming - audit current uses

2012-02-12 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206596#comment-13206596
 ] 

Peter Schuller edited comment on CASSANDRA-3892 at 2/13/12 1:00 AM:


Attaching {{CASSANDRA\-3892\-draft\-v2.txt}} with some more changes. I still 
consider it a draft because I have not yet done any testing, but it's more ripe 
for review now.

A few of the sub-tasks I created are IMO serious as well.

  was (Author: scode):
Attaching {{CASSANDRA-3892-draft-v2.txt}} with some more changes. I still 
consider it a draft because I have not yet done any testing, but it's more ripe 
for review now.

A few of the sub-tasks I created are IMO serious as well.
  
 improve TokenMetadata abstraction, naming - audit current uses
 --

 Key: CASSANDRA-3892
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3892
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
 Attachments: CASSANDRA-3892-draft-v2.txt, CASSANDRA-3892-draft.txt


 CASSANDRA-3417 has some background. I want to make the distinction more clear 
 between looking at the ring from different perspectives (reads, writes, 
 others) and adjust naming to be more clear on this. I also want to go through 
 each use case and try to spot any subtle pre-existing bugs that I almost 
 introduced in CASSANDRA-3417, had not Jonathan caught me.
 I will submit a patch soonish.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (CASSANDRA-3897) StorageService.onAlive() only schedules hints for joined endpoints

2012-02-12 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206600#comment-13206600
 ] 

Peter Schuller edited comment on CASSANDRA-3897 at 2/13/12 1:08 AM:


Why would this be the case? They are supposed to receive writes; I see no 
reason why hints should not be delivered. Hints is just a way to more quickly 
delivery writes in cases where nodes are down (i.e., more quickly when they go 
up) and avoid AES need. I don't see why a node actively bootstrapping into the 
ring should be discriminated against, in terms of seeing as reliable delivery 
of writes as other nodes.

In other words, I don't buy your first sentence unless you explain why. I don't 
accept it axiomatically :)

Obviously sending hints requires that hints are *there* first too, but the same 
argument applies. If a node is supposed to see a certain writes and it's 
considered down - hint it.

Statistically I can see the argument that if a node is bootstrapping and down, 
it might be practically more likely that the node is just going to be down for 
a longer period, and/or that the node will completely re-bootstrap anyway 
(since normally a node is down because it's being restarted, which would imply 
re-bootstrap if the node is bootstrapping).


  was (Author: scode):
Why would this be the case? They are supposed to receive writes; I see no 
reason why hints should not be delivered. Hints is just a way to more quickly 
delivery writes in cases where nodes are down (i.e., more quickly when they go 
up) and avoid AES need. I don't see why a node actively bootstrapping into the 
ring should be discriminated against, in terms of seeing as reliable delivery 
of writes as other nodes.

In other words, I don't by your first sentence unless you explain why. I don't 
accept it axiomatically :)

Obviously sending hints requires that hints are *there* first too, but the same 
argument applies. If a node is supposed to see a certain writes and it's 
considered down - hint it.

Statistically I can see the argument that if a node is bootstrapping and down, 
it might be practically more likely that the node is just going to be down for 
a longer period, and/or that the node will completely re-bootstrap anyway 
(since normally a node is down because it's being restarted, which would imply 
re-bootstrap if the node is bootstrapping).

  
 StorageService.onAlive() only schedules hints for joined endpoints
 --

 Key: CASSANDRA-3897
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3897
 Project: Cassandra
  Issue Type: Sub-task
  Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
Priority: Minor

 It seems incorrect to not do hint delivery for nodes that are bootstrapping, 
 as that would cause sudden spikes in read repair need or inconsistent reads 
 when a node joins the ring. Particularly if the user is expecting to rely on 
 the new hinted handoff code making AES much less needed. It would be a POLA 
 violation for bootstrapping nodes to be an exception to that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (CASSANDRA-3895) Gossiper.doStatusCheck() uses isMember() suspiciously

2012-02-12 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206627#comment-13206627
 ] 

Peter Schuller edited comment on CASSANDRA-3895 at 2/13/12 2:29 AM:


{quote}
If fat clients disappear, no one really cares because they were never ring 
members.
{quote}

Ok. Well, they care in the joining (bootstrapping) node case since they are 
taking writes. But all nodes will naturally drop them by themselves without 
needing others to propagate. As long as we consider aVeryLongTime (3 days) long 
enough that it's safe to assume that this only triggers in legitimate cases, 
we're fine (if not, we're not as soon as CASSANDRA-3892 is fixed or if I am 
wrong and it's not a bug to begin with).

I'll resolve this as INVALID then, though I'm skeptical about the 3 day very 
long time.


  was (Author: scode):
{quote}
If fat clients disappear, no one really cares because they were never ring 
members.
{quote}

Ok. Well, they care in the joining (bootstrapping) node case since they are 
taking writes. But all nodes will naturally drop them by themselves without 
needing others to propagate. As long as we consider aVeryLongTime (3 days) long 
enough that it's safe to assume that this only triggers in legitimate cases, 
we're fine (if not, we're not as soon as CASSANDRA-3892 is fixed or if I am 
wrong and it's not a bug to begin with).

I'll resolve this as WONTFIX then, though I'm skeptical about the 3 day very 
long time.

  
 Gossiper.doStatusCheck() uses isMember() suspiciously
 -

 Key: CASSANDRA-3895
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3895
 Project: Cassandra
  Issue Type: Sub-task
  Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
Priority: Minor

 There is code for fat client removal and old endpoint (non-fat) removal 
 which uses {{TokenMetadata.isMember()}} which only considers nodes that are 
 joined (takes reads) in the cluster.
 aVeryLongTime is set to 3 days.
 I could very well be wrong, but the fat client identification code, the way I 
 interpret it, is using isMember() to check basically whether a node is part 
 of the cluster (in the most vague/broad sense) in order to differentiate a 
 real node (part of the cluster) from just a fat client. But a node that is 
 boot strapping is not a fat client, nor will be me a member according to 
 isMember().
 I'm also a bit scared of, even in the case of there not being a fat client 
 identification, simply forgetting an endpoint. It seems that an operator 
 request should be relied upon to actively forget an endpoint (i.e., forced 
 remove token).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (CASSANDRA-3735) Fix Unable to create hard link SSTableReaderTest error messages

2012-02-04 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200642#comment-13200642
 ] 

Peter Schuller edited comment on CASSANDRA-3735 at 2/5/12 2:54 AM:
---

Attaching new version of 0002* that works (but still with the left-overs 
already mentioned by jbellis/sylvain) post CASSANDRA-2794.

  was (Author: scode):
Attaching new version of 0002* that works post CASSANDRA-2794.
  
 Fix Unable to create hard link SSTableReaderTest error messages
 -

 Key: CASSANDRA-3735
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3735
 Project: Cassandra
  Issue Type: Bug
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Attachments: 0001-fix-generation-update-in-loadNewSSTables.patch, 
 0002-remove-incremental-backups-before-reloading-sstables-v2.patch, 
 0002-remove-incremental-backups-before-reloading-sstables.patch


 Sample failure (on Windows):
 {noformat}
 [junit] java.io.IOException: Exception while executing the command: cmd 
 /c mklink /H 
 C:\Users\Jonathan\projects\cassandra\git\build\test\cassandra\data\Keyspace1\backups\Standard1-hc-1-Index.db
  
 c:\Users\Jonathan\projects\cassandra\git\build\test\cassandra\data\Keyspace1\Standard1-hc-1-Index.db,command
  error Code: 1, command output: Cannot create a file when that file already 
 exists.
 [junit]
 [junit] at org.apache.cassandra.utils.CLibrary.exec(CLibrary.java:213)
 [junit] at 
 org.apache.cassandra.utils.CLibrary.createHardLinkWithExec(CLibrary.java:188)
 [junit] at 
 org.apache.cassandra.utils.CLibrary.createHardLink(CLibrary.java:151)
 [junit] at 
 org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:833)
 [junit] at 
 org.apache.cassandra.db.DataTracker$1.runMayThrow(DataTracker.java:161)
 [junit] at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
 [junit] at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 [junit] at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 [junit] at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 [junit] at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
 [junit] at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
 [junit] at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 [junit] at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 [junit] at java.lang.Thread.run(Thread.java:662)
 [junit] ERROR 17:10:17,111 Fatal exception in thread 
 Thread[NonPeriodicTasks:1,5,main]
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (CASSANDRA-3831) scaling to large clusters in GossipStage impossible due to calculatePendingRanges

2012-02-03 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200251#comment-13200251
 ] 

Peter Schuller edited comment on CASSANDRA-3831 at 2/4/12 1:25 AM:
---

I am attaching {{CASSANDRA\-3831\-memoization\-not\-for\-inclusion.txt}} as an 
FYI and in case it helps others. It's against 0.8, and implements memoization 
of calculate pending ranges.

The correct/clean fix is probably to change behavior so that it doesn't get 
called unnecessarily to begin with (and to make sure the computational 
complexity is reasonable when it does get called). This patch was made 
specifically to address the production issue we are having in a minimally 
dangerous fashion, and is not to be taken as a suggested fix.

  was (Author: scode):
I am attaching {{CASSANDRA\-3831\-memoization\-not\-for\-inclusion.txt}} as 
an FYI and in case it helps others. It's against 0.8, and implements 
memoization of calculate pending ranges.

The correct/clean fix is probably to change behavior so that it doesn't get 
called unnecessarily to begin with. This patch was made specifically to address 
the production issue we are having in a minimally dangerous fashion, and is not 
to be taken as a suggested fix.
  
 scaling to large clusters in GossipStage impossible due to 
 calculatePendingRanges 
 --

 Key: CASSANDRA-3831
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3831
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
Priority: Critical
 Attachments: CASSANDRA-3831-memoization-not-for-inclusion.txt


 (most observations below are from 0.8, but I just now tested on
 trunk and I can trigger this problem *just* by bootstrapping a ~180
 nod cluster concurrently, presumably due to the number of nodes that
 are simultaneously in bootstrap state)
 It turns out that:
 * (1) calculatePendingRanges is not just expensive, it's computationally 
 complex - cubic or worse
 * (2) it gets called *NOT* just once per node being bootstrapped/leaving etc, 
 but is called repeatedly *while* nodes are in these states
 As a result, clusters start exploding when you start reading 100-300
 nodes. The GossipStage will get backed up because a single
 calculdatePenginRanges takes seconds, and depending on what the
 average heartbeat interval is in relation to this, this can lead to
 *massive* cluster-wide flapping.
 This all started because we hit this in production; several nodes
 would start flapping several other nodes as down, with many nodes
 seeing the entire cluster, or a large portion of it, as down. Logging
 in to some of these nodes you would see that they would be constantly
 flapping up/down for minutes at a time until one became lucky and it
 stabilized.
 In the end we had to perform an emergency full-cluster restart with
 gossip patched to force-forget certain nodes in bootstrapping state.
 I can't go into all details here from the post-mortem (just the
 write-up would take a day), but in short:
 * We graphed the number of hosts in the cluster that had more than 5
   Down (in a cluster that should have 0 down) on a minutely timeline.
 * We also graphed the number of hosts in the cluster that had GossipStage 
 backed up.
 * The two graphs correlated *extremely* well
 * jstack sampling showed it being CPU bound doing mostly sorting under 
 calculatePendingRanges
 * We were never able to exactly reproduce it with normal RING_DELAY and 
 gossip intervals, even on a 184 node cluster (the production cluster is 
 around 180).
 * Dropping RING_DELAY and in particular dropping gossip interval to 10 ms 
 instead of 1000 ms, we were able to observe all of the behavior we saw in 
 production.
 So our steps to reproduce are:
 * Launch 184 node cluster w/ gossip interval at 10ms and RING_DELAY at 1 
 second.
 * Do something like: {{while [ 1 ] ; do date ; echo decom ; nodetool 
 decommission ; date ; echo done leaving decommed for a while ; sleep 3 ; date 
 ; echo done restarting; sudo rm -rf /data/disk1/commitlog/* ; sudo rm -rf 
 /data/diskarray/tables/* ; sudo monit restart cassandra ;date ; echo 
 restarted waiting for a while ; sleep 40; done}} (or just do a manual 
 decom/bootstrap once, it triggers every time)
 * Watch all nodes flap massively and not recover at all, or maybe after a 
 *long* time.
 I observed the flapping using a python script that every 5 second
 (randomly spread out) asked for unreachable nodes from *all* nodes in
 the cluster, and printed any nodes and their counts when they had
 unreachables  5. The cluster can be observed instantly going into
 massive flapping when leaving/bootstrap is initiated. Script 

[jira] [Issue Comment Edited] (CASSANDRA-3820) Columns missing after upgrade from 0.8.5 to 1.0.7.

2012-01-31 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197496#comment-13197496
 ] 

Peter Schuller edited comment on CASSANDRA-3820 at 2/1/12 1:53 AM:
---

Check whether the .bf files contain all zeroes above roughly 235 mb or so. If 
you have lots of rows, your BF will be that large.

We encountered a bug internally whereby all bloom filters larger than 2^31 bits 
were large on disk, but everything after the first 2^31 bits were all zeroes.

Unfortunately I don't know whether this is specific to patches made to our 
branch, and I have been so busy I haven't been able to follow up to figure out 
whether it affects the upstream version.

But - just tail -c 1000 | hexdump. If you only have zeroes, this is the bug. 
Make sure to tail on a large .bf file (take the largest, easiest).



  was (Author: scode):
Check whether the .bf files contain all zeroes above roughly 235 mb or so. 
If you have lots of rows, your BF will be that large.

We encountered a bug internally whereby all bloom filters larger than 2^31 bits 
were large on disk, but everything afger the first 2^31 bits were all zeroes.

Unfortunately I don't know whether this is specific to patches made to our 
branch, and I have been so busy I haven't been able to follow up to figure out 
whether it affects the upstream version.

But - just tail -c 1000 | hexdump. If you only have zeroes, this is the bug. 
Make sure to tail on a large .bf file (take the largest, easiest).


  
 Columns missing after upgrade from 0.8.5 to 1.0.7.
 --

 Key: CASSANDRA-3820
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3820
 Project: Cassandra
  Issue Type: Bug
Affects Versions: 1.0.7
Reporter: Jason Harvey

 After an upgrade, one of our CFs had a lot of rows with missing columns. I've 
 been able to reproduce in test conditions. Working on getting the tables to 
 DataStax(data is private).
 0.8 results:
 {code}
 [default@reddit] get CommentVote[36353467625f6837336f32];
 = (column=date, value=313332333932323930392e3531, timestamp=1323922909506508)
 = (column=ip, value=REDACTED, timestamp=1327048432717348, ttl=2592000)
 = (column=name, value=31, timestamp=1327048433000740)
 = (column=REDACTED, value=30, timestamp=1323922909506432)
 = (column=thing1_id, value=REDACTED, timestamp=1323922909506475)
 = (column=thing2_id, value=REDACTED, timestamp=1323922909506486)
 = (column=REDACTED, value=31, timestamp=1323922909506518)
 = (column=REDACTED, value=30, timestamp=1323922909506497)
 {code}
 1.0 results:
 {code}
 [default@reddit] get CommentVote[36353467625f6837336f32];
 = (column=ip, value=REDACTED, timestamp=1327048432717348, ttl=2592000)
 = (column=name, value=31, timestamp=1327048433000740)
 {code}
 A few notes:
 * The rows with missing data were fully restored after scrubbing the sstables.
 * The row which I reproduced on happened to be split across multiple sstables.
 * When I copied the first sstable I found the row on, I was able to 'list' 
 rows from the sstable, but any and all 'get' calls failed.
 * These SStables were natively created on 0.8.5; they did not come from any 
 previous upgrade.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (CASSANDRA-3670) provide red flags JMX instrumentation

2012-01-27 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195216#comment-13195216
 ] 

Peter Schuller edited comment on CASSANDRA-3670 at 1/27/12 11:13 PM:
-

CodaHale Metrics being evaluated in CASSANDRA-3671. If there's a +1 there, will 
go for same here.

  was (Author: scode):
CodaHale Metrics being evaluated in CASSANDRA-3671. If there's a +1 here, 
will go for same here.
  
 provide red flags JMX instrumentation
 ---

 Key: CASSANDRA-3670
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3670
 Project: Cassandra
  Issue Type: Improvement
Reporter: Peter Schuller
Assignee: Peter Schuller
Priority: Minor

 As discussed in CASSANDRA-3641, it would be nice to expose through JMX 
 certain information which is almost without exception indicative of something 
 being wrong with the node or cluster.
 In the CASSANDRA-3641 case, it was the detection of corrupt counter shards. 
 Other examples include:
 * Number of times the selection of files to compact was adjusted due to disk 
 space heuristics
 * Number of times compaction has failed
 * Any I/O error reading from or writing to disk (the work here is collecting, 
 not exposing, so maybe not in an initial version)
 * Any data skipped due to checksum mismatches (when checksumming is being 
 used); e.g., number of skips.
 * Any arbitrary exception at least in certain code paths (compaction, scrub, 
 cleanup for starters)
 Probably other things.
 The motivation is that if we have clear and obvious indications that 
 something truly is wrong, it seems suboptimal to just leave that information 
 in the log somewhere, for someone to discover later when something else broke 
 as a result and a human investigates. You might argue that one should use 
 non-trivial log analysis to detect these things, but I highly doubt a lot of 
 people do this and it seems very wasteful to require that in comparison to 
 just providing the MBean.
 It is important to note that the *lack* of a certain problem being advertised 
 in this MBean is not supposed to be indicative of a *lack* of a problem. 
 Rather, the point is that to the extent we can easily do so, it is nice to 
 have a clear method of communicating to monitoring systems where there *is* a 
 clear indication of something being wrong.
 The main part of this ticket is not to cover everything under the sun, but 
 rather to reach agreement on adding an MBean where these types of indicators 
 can be collected. Individual counters can then be added over time as one 
 thinks of them.
 I propose:
 * Create an org.apache.cassandra.db.RedFlags MBean
 * Populate with a few things to begin with.
 I'll submit the patch if there is agreement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (CASSANDRA-3070) counter repair

2011-12-09 Thread Peter Schuller (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13166617#comment-13166617
 ] 

Peter Schuller edited comment on CASSANDRA-3070 at 12/9/11 10:19 PM:
-

This may be relevant, quoting myself from IRC:

{code}
21:20:01  scode pcmanus: Hey, are you there?  

21:20:21  scode pcmanus: I am 
investigating something which might be 
https://issues.apache.org/jira/browse/CASSANDRA-3070
 21:20:37  scode pcmanus: And 
I could use the help of someone with his brain all over counters, and Stu isn't 
here atm. :)
 21:21:16  scode pcmanus: 
https://gist.github.com/8202cb46c8bd00c8391b

 21:21:37  scode pcmanus: I am investigating why with CL.ALL and 
CL.QUORUM, I get seemingly random/varying results when I read a counter.
  21:21:53  scode 
pcmanus: I have the offending sstables on a three-node test setup and am 
inserting debug printouts in the code to trace the reconiliation.   
 21:21:57  scode pcmanus: The gist above shows 
what's happening.   

21:22:11  scode pcmanus: The latter is the wrong one, and the former is the 
correct one.
  21:22:28  scode pcmanus: The 
interesting bit is that I see shards with the same node_id *AND* clock, but 
*DIFFERENT* counts. 
 21:22:53  scode pcmanus: My understanding of counters is that 
there should never (globally across an entire cluster in all sstables) exist 
two shards for the same node_id+clock but with different  
counts. 

  21:22:57  scode pcmanus: Is my understanding correct 
there?  
 21:25:10  
scode pcmanus: There is one node out of the three that has the offending 
card (with a count of 2 instead of 1). Like with 3070, we observed this after 
having expanded a cluster (though I'm not sure how that would cause it, and we 
don't know if there existed a problem before the expansion).
 {code}


  was (Author: scode):
This may be relevant, quoting myself from IRC:

{quote}
21:20:01  scode pcmanus: Hey, are you there?  

21:20:21  scode pcmanus: I am 
investigating something which might be 
https://issues.apache.org/jira/browse/CASSANDRA-3070
 21:20:37  scode pcmanus: And 
I could use the help of someone with his brain all over counters, and Stu isn't 
here atm. :)
 21:21:16  scode pcmanus: 
https://gist.github.com/8202cb46c8bd00c8391b

 21:21:37  scode pcmanus: I am investigating why with CL.ALL and 
CL.QUORUM, I get seemingly random/varying results when I read a counter.
  21:21:53  scode 
pcmanus: I have the offending sstables on a three-node test setup and am 
inserting debug printouts in the code to trace the reconiliation.   
 21:21:57  scode pcmanus: The gist above shows 
what's happening.   

21:22:11  scode pcmanus: The latter is the wrong one, and the former is the 
correct one.
  21:22:28  scode pcmanus: The 
interesting bit is that I see shards with the same node_id *AND* clock, but 
*DIFFERENT*