Re: Hinted handoff throttled even after "nodetool sethintedhandoffthrottlekb 0"

2017-10-27 Thread Andrew Bialecki
Bit more information. Using jmxterm and inspecting the state of a node when
it's "slow" playing hints, I can see the following from the node that has
hints to play:

$>get MaxHintsInProgress
#mbean = org.apache.cassandra.db:type=StorageProxy:
MaxHintsInProgress = 2048;

$>get HintsInProgress
#mbean = org.apache.cassandra.db:type=StorageProxy:
HintsInProgress = 0;

$>get TotalHints
#mbean = org.apache.cassandra.db:type=StorageProxy:
TotalHints = 129687;

Is there some throttling that would cause hints to not be played at all if,
for instance, the cluster has enough load or something related to a timeout
setting?

On Fri, Oct 27, 2017 at 1:49 AM, Andrew Bialecki <
andrew.biale...@klaviyo.com> wrote:

> We have a 96 node cluster running 3.11 with 256 vnodes each. We're running
> a rolling restart. As we restart nodes, we notice that each node takes a
> while to have all other nodes be marked as up and this corresponds to nodes
> that haven't finished playing hints.
>
> We looked at the hinted handoff throttling, noticed it was still the
> default of 1024, so we tried to turn it off by setting it to zero. Reading
> the source, it looks like that rate limiting won't take affect until the
> current set of hints have finished. So we made that change cluster wide and
> then restarted the next node. However, we still saw the same issue.
>
> Looking at iftop and network throughput, it's very low (~10kB/s) and
> therefore the few 100k of hints that accumulate while the node is restart
> end up take several minutes to get sent.
>
> Any other knobs we should be tuning to increase hinted handoff throughput?
> Or other reasons why hinted handoff runs so slowly?
>
> --
> Andrew Bialecki
>



-- 
Andrew Bialecki

<https://www.klaviyo.com/>


Hinted handoff throttled even after "nodetool sethintedhandoffthrottlekb 0"

2017-10-26 Thread Andrew Bialecki
We have a 96 node cluster running 3.11 with 256 vnodes each. We're running
a rolling restart. As we restart nodes, we notice that each node takes a
while to have all other nodes be marked as up and this corresponds to nodes
that haven't finished playing hints.

We looked at the hinted handoff throttling, noticed it was still the
default of 1024, so we tried to turn it off by setting it to zero. Reading
the source, it looks like that rate limiting won't take affect until the
current set of hints have finished. So we made that change cluster wide and
then restarted the next node. However, we still saw the same issue.

Looking at iftop and network throughput, it's very low (~10kB/s) and
therefore the few 100k of hints that accumulate while the node is restart
end up take several minutes to get sent.

Any other knobs we should be tuning to increase hinted handoff throughput?
Or other reasons why hinted handoff runs so slowly?

-- 
Andrew Bialecki


Re: cassandra python driver routing requests to one node?

2016-11-14 Thread Andrew Bialecki
Is the node selection based on key deterministic across multiple clients?
If it is, that sounds plausible. For this particular workload it's
definitely possible to have a hot key / spot, but it was surprising it
wasn't three nodes that got hot, it was just one.

On Mon, Nov 14, 2016 at 6:26 PM, Alex Popescu <al...@datastax.com> wrote:

> I'm wondering if what you are seeing is https://datastax-oss.
> atlassian.net/browse/PYTHON-643 (that could still be a sign of a
> potential data hotspot)
>
> On Sun, Nov 13, 2016 at 10:57 PM, Andrew Bialecki <
> andrew.biale...@klaviyo.com> wrote:
>
>> We're using the "default" TokenAwarePolicy. Our nodes are spread across
>> different racks within one datacenter. I've turned on debug logging for the
>> Python driver, but it doesn't look like it logs which Casandra node each
>> request goes to, but maybe I haven't got the right logging set to debug.
>>
>> On Mon, Nov 14, 2016 at 12:39 AM, Ben Slater <ben.sla...@instaclustr.com>
>> wrote:
>>
>>> What load balancing policies are you using in your client code (
>>> https://datastax.github.io/python-driver/api/cassandra/policies.html)?
>>>
>>> Cheers
>>> Ben
>>>
>>> On Mon, 14 Nov 2016 at 16:22 Andrew Bialecki <
>>> andrew.biale...@klaviyo.com> wrote:
>>>
>>>> We have an odd situation where all of a sudden of our cluster started
>>>> seeing a disproportionate number of writes go to one node. We're using the
>>>> Python driver version 3.7.1. I'm not sure if this is a driver issue or
>>>> possibly a network issue causing requests to get routed in an odd way. It's
>>>> not absolute, there are requests going to all nodes.
>>>>
>>>> Tried restarting the problematic node, no luck (those are the quiet
>>>> periods). Tried restarting the clients, also no luck. Checked nodetool
>>>> status and ownership is even across the cluster.
>>>>
>>>> Curious if anyone's seen this behavior before. Seems like the next step
>>>> will be to debug the client and see why it's choosing that node.
>>>>
>>>> [image: Inline image 1]
>>>>
>>>>
>>>> --
>>>> AB
>>>>
>>>
>>
>>
>> --
>> AB
>>
>
>
>
> --
> Bests,
>
> Alex Popescu | @al3xandru
> Sen. Product Manager @ DataStax
>
>
>
>


-- 
AB


Re: cassandra python driver routing requests to one node?

2016-11-13 Thread Andrew Bialecki
We're using the "default" TokenAwarePolicy. Our nodes are spread across
different racks within one datacenter. I've turned on debug logging for the
Python driver, but it doesn't look like it logs which Casandra node each
request goes to, but maybe I haven't got the right logging set to debug.

On Mon, Nov 14, 2016 at 12:39 AM, Ben Slater <ben.sla...@instaclustr.com>
wrote:

> What load balancing policies are you using in your client code (
> https://datastax.github.io/python-driver/api/cassandra/policies.html)?
>
> Cheers
> Ben
>
> On Mon, 14 Nov 2016 at 16:22 Andrew Bialecki <andrew.biale...@klaviyo.com>
> wrote:
>
>> We have an odd situation where all of a sudden of our cluster started
>> seeing a disproportionate number of writes go to one node. We're using the
>> Python driver version 3.7.1. I'm not sure if this is a driver issue or
>> possibly a network issue causing requests to get routed in an odd way. It's
>> not absolute, there are requests going to all nodes.
>>
>> Tried restarting the problematic node, no luck (those are the quiet
>> periods). Tried restarting the clients, also no luck. Checked nodetool
>> status and ownership is even across the cluster.
>>
>> Curious if anyone's seen this behavior before. Seems like the next step
>> will be to debug the client and see why it's choosing that node.
>>
>> [image: Inline image 1]
>>
>>
>> --
>> AB
>>
>


-- 
AB


cassandra python driver routing requests to one node?

2016-11-13 Thread Andrew Bialecki
We have an odd situation where all of a sudden of our cluster started
seeing a disproportionate number of writes go to one node. We're using the
Python driver version 3.7.1. I'm not sure if this is a driver issue or
possibly a network issue causing requests to get routed in an odd way. It's
not absolute, there are requests going to all nodes.

Tried restarting the problematic node, no luck (those are the quiet
periods). Tried restarting the clients, also no luck. Checked nodetool
status and ownership is even across the cluster.

Curious if anyone's seen this behavior before. Seems like the next step
will be to debug the client and see why it's choosing that node.

[image: Inline image 1]

-- 
AB


High number of ReplicateOnWriteStage All timed blocked, counter CF

2013-10-22 Thread Andrew Bialecki
Hey everyone,

We're stress testing writes for a few counter CFs and noticed one one node
we got to the point where the ReplicateOnWriteStage thread pool was backed
up and it started blocking those tasks. This cluster is six nodes, RF=3,
running 1.2.9. All CFs have LCS with 160 MB sstables. All writes were
CL.ONE.

Few questions:

   1. What causes a RoW (replicate of write) task to be blocked? The queue
   maxes out at 4128, which seems to be 32 * (128 + 1). 32 is the number of
   concurrent_writers we have.

   2. Given this is a counter CF, can those dropped RoWs be repaired with a
   nodetool repair? From my understanding of how counter writes work, until
   we run that repair, if we're not using CL.ALL / read_repair_chance = 1, we
   will get some incorrect reads, but a repair will fix things. Is that right?

   3. The CPU on the node where we started seeing the number of blocked
   tasks increase was pegged, but I/O was not saturated. There were
   compactions running on those column families as well. Is there a setting we
   could consider altering that might prevent that back up or is the answer
   likely, increase the number of nodes to get more throughput.


Thanks in advance for any insights!

Andrew


Re: Counters and replication

2013-08-05 Thread Andrew Bialecki
We've seen high CPU in tests on stress tests with counters. With our
workload, we had some hot counters (e.g. ones with 100s increments/sec)
with RF = 3, which caused the load to spike and replicate on write tasks to
back up on those three nodes. Richard already gave a good overview of why
this happens. As he said, changing the consistency level won't help you.
It'll decrease the write latency because the write will ack once it queues
the replicate on write task, but the node will still queue a task to
replicate the write to the other replicas. As mentioned, the only Cassandra
config fix is setting replicate_on_write to false, but that's definitely
not recommended unless you don't mind losing your counter values if a node
goes down.

Other options which will require work outside of Cassandra:

1. Partition hot counters and write to each randomly and then aggregate
them together at read-time. This is basically the same trick as writing to
a hot time series.
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
2. Absorb and aggregate increments and only increment in Cassandra every so
often. For instance, if a counter needs to incremented 100 times/sec,
increment an in-memory counter and then flush those increments at once by
issuing one increment/sec that has the sum of all the aggregates for that
time period. I believe Twitter does something like this (
http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011slide
26).
3. Faster disks.

All that said, you say you're seeing low disk utilization, which is
inconsistent with what what we saw. In our tests, we saw ~100% disk
utilization on the nodes for the hot counters, which made it easy to
determine what was going on. If disk isn't your bottleneck, then you
probably have a different issue.


On Mon, Aug 5, 2013 at 3:30 PM, Richard Low rich...@wentnet.com wrote:

 On 5 August 2013 20:04, Christopher Wirt chris.w...@struq.com wrote:

 Hello,

 ** **

 Question about counters, replication and the ReplicateOnWriteStage

 ** **

 I’ve recently turned on a new CF which uses a counter column. 

 ** **

 We have a three DC setup running Cassandra 1.2.4 with vNodes, hex core
 processors, 32Gb memory.

 DC 1 - 9 nodes with RF 3

 DC 2 - 3 nodes with RF 2 

 DC 3 - 3 nodes with RF 2

 ** **

 DC 1 one receives most of the updates to this counter column. ~3k per sec.
 

 ** **

 I’ve disabled any client reads while I sort out this issue.

 Disk utilization is very low

 Memory is aplenty (while not reading)

 Schema:

 CREATE TABLE cf1 (

   uid uuid,

   id1 int,

   id2 int,

   id3 int,

   ct counter,

   PRIMARY KEY (uid, id1, id2, id3)

 ) WITH …

 ** **

 Three of the machines in DC 1 are reporting very high CPU load.

 Looking at tpstats there is a large number of pending
 ReplicateOnWriteStage just on those machines.

 ** **

 Why would only three of the machines be reporting this? 

 Assuming its distributed by uuid value there should be an even load
 across the cluster, yea?

 Am I missing something about how distributed counters work?


 If you have many different uid values and your cluster is balanced then
 you should see even load.  Were your tokens chosen randomly?  Did you start
 out with num_tokens set high or upgrade from num_tokens=1 or an earlier
 Cassandra version?  Is it possible your workload is incrementing the
 counter for one particular uid much more than the others?

 The distribution of counters works the same as for non-counters in terms
 of which nodes receive which values.  However, there is a read on the
 coordinator (randomly chosen for each inc) to read the current value and
 replicate it to the remaining replicas.  This makes counter increments much
 more expensive than normal inserts, even if all your counters fit in cache.
  This is done in the ReplicateOnWriteStage, which is why you are seeing
 that queue build up.


 **

 Is changing CL to ONE fine if I’m not too worried about 100% consistency?


 Yes, but to make the biggest difference you will need to turn off
 replicate_on_write (alter table cf1 with replicate_on_write = false;) but
 this *guarantees* your counts aren't replicated, even if all replicas are
 up.  It avoids doing the read, so makes a huge difference to performance,
 but means that if a node is unavailable later on, you *will* read
 inconsistent counts.  (Or, worse, if a node fails, you will lose counts
 forever.)  This is in contrast to CL.ONE inserts for normal values when
 inserts are still attempted on all replicas, but only one is required to
 succeed.

 So you might be able to get a temporary performance boost by changing
 replicate_on_write if your counter values aren't important.  But this won't
 solve the root of the problem.

 Richard.



Re: Deletion use more space.

2013-07-16 Thread Andrew Bialecki
I don't think setting gc_grace_seconds to an hour is going to do what you'd
expect. After gc_grace_seconds, if you haven't run a repair within that
hour, the data you deleted will seem to have been undeleted.

Someone correct me if I'm wrong, but in order to order to completely delete
data and regain the space it takes up, you need to delete it, which
creates tombstones, and then run a repair on that column family within
gc_grace_seconds. After that the data is actually gone and the space
reclaimed.


On Tue, Jul 16, 2013 at 6:20 AM, 杨辉强 huiqiangy...@yunrang.com wrote:

 Thank you!
 It should be update column family ScheduleInfoCF with gc_grace = 3600;
 Faint.

 - 原始邮件 -
 发件人: 杨辉强 huiqiangy...@yunrang.com
 收件人: user@cassandra.apache.org
 发送时间: 星期二, 2013年 7 月 16日 下午 6:15:12
 主题: Re: Deletion use more space.

 Hi,
   I use the follow cmd to update gc_grace_seconds. It reports error! Why?

 [default@WebSearch] update column family ScheduleInfoCF with
 gc_grace_seconds = 3600;
 java.lang.IllegalArgumentException: No enum const class
 org.apache.cassandra.cli.CliClient$ColumnFamilyArgument.GC_GRACE_SECONDS


 - 原始邮件 -
 发件人: Michał Michalski mich...@opera.com
 收件人: user@cassandra.apache.org
 发送时间: 星期二, 2013年 7 月 16日 下午 5:51:49
 主题: Re: Deletion use more space.

 Deletion is not really removing data, but it's adding tombstones
 (markers) of deletion. They'll be later merged with existing data during
 compaction and - in the end (see: gc_grace_seconds) - removed, but by
 this time they'll take some space.

 http://wiki.apache.org/cassandra/DistributedDeletes

 M.

 W dniu 16.07.2013 11:46, 杨辉强 pisze:
  Hi, all:
 I use cassandra 1.2.4 and I have 4 nodes ring and use byte order
 partitioner.
 I had inserted about 200G data in the ring previous days.
 
 Today I write a program to scan the ring and then at the same time
 delete the items that are scanned.
 To my surprise, the cassandra cost more disk usage.
 
  Anybody can tell me why? Thanks.
 



Re: node tool ring displays 33.33% owns on 3 node cluster with replication

2013-07-12 Thread Andrew Bialecki
Not sure if it's the best/intended behavior, but you should see it go back
to 100% if you run: nodetool -h 127.0.0.1 -p 8080 ring keyspace.

I think the rationale for showing 33% is that different keyspaces might
have different RFs, so it's unclear what to show for ownership. However, if
you include the keyspace as part of your query, you'll get it weighted by
the RF of that keyspace. I believe the same logic applies for nodetool
status.

Andrew


On Thu, Jul 11, 2013 at 12:58 PM, Jason Tyler jaty...@yahoo-inc.com wrote:

  Thanks Rob!  I was able to confirm with getendpoints.

  Cheers,

  ~Jason

   From: Robert Coli rc...@eventbrite.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Wednesday, July 10, 2013 4:09 PM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Cc: Francois Richard frich...@yahoo-inc.com
 Subject: Re: node tool ring displays 33.33% owns on 3 node cluster with
 replication

   On Wed, Jul 10, 2013 at 4:04 PM, Jason Tyler jaty...@yahoo-inc.comwrote:

  Is this simply a display issue, or have I lost replication?


  Almost certainly just a display issue. Do nodetool -h localhost
 getendpoints keyspace columnfamily 0, which will tell you the
 endpoints for the non-transformed key 0. It should give you 3 endpoints.
 You could also do this test with a known existing key and then go to those
 nodes and verify that they have that data on disk via sstable2json.

  (FWIW, it is an odd display issue/bug if it is one. Because it has
 reverted to pre-1.1 behavior...)

  =Rob



Lots of replicate on write tasks pending, want to investigate

2013-07-03 Thread Andrew Bialecki
In one of our load tests, we're incrementing a single counter column as
well as appending columns to a single row (essentially a timeline). You can
think of it as counting the instances of an event and then keeping a
timeline of those events. The ratio is of increments to appends is 1:1.

When we run this on a test cluster with RF = 3, one node gets backed up
with a lot of replicate on write tasks pending, eventually maxing out at
4128. We think it's a disk I/O issue that's causing the slowdown (lot of
reads), but we're still investigating. A few questions that might speed up
understanding the issue:

1. Is there any way to see metadata about the replicate on write tasks
pending? We're splitting apart the load test to pinpoint which of those
operations is causing an issue, but if there's a way to see that queue,
that might save us some work.

2. I'm assuming in our case the cause is incrementing counters because disk
reads are part of the write path for counters and are not for appending
columns to a row. Does that logic make sense?

Thanks in advance,
Andrew


Re: Lots of replicate on write tasks pending, want to investigate

2013-07-03 Thread Andrew Bialecki
Can someone remind me why replicate on write tasks might be related to the
high disk I/O? My understanding is the replicate on write involves sending
the update to other nodes, so it shouldn't involve any disk activity --
disk activity would be during the mutation/write phase.

The write path (not replicate on write) for counters involves a read, so
that explains the high disk I/O, but for that I'd expect to see many write
requests pending (which we see a bit), but not replicate on writes backing
up. What am I missing?

Andrew


On Wed, Jul 3, 2013 at 1:03 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Jul 3, 2013 at 9:59 AM, Andrew Bialecki andrew.biale...@gmail.com
  wrote:

 2. I'm assuming in our case the cause is incrementing counters because
 disk reads are part of the write path for counters and are not for
 appending columns to a row. Does that logic make sense?


 That's a pretty reasonable assumption if you are not doing any other reads
 and you see your disk busy doing non-compaction related reads. :)

 =Rob



Re: Counter value becomes incorrect after several dozen reads writes

2013-06-25 Thread Andrew Bialecki
If you can reproduce the invalid behavior 10+% of the time with steps to
repro that take 5-10s/iteration, that sounds extremely interesting for
getting to the bottom of the invalid shard issue (if that's what the root
cause ends up being). Would be very interested in the set up to see if the
behavior can be duplicated.

Andrew


On Tue, Jun 25, 2013 at 2:18 PM, Robert Coli rc...@eventbrite.com wrote:

 On Mon, Jun 24, 2013 at 6:42 PM, Josh Dzielak j...@keen.io wrote:
  There is only 1 thread running this sequence, and consistency levels are
 set
  to ALL. The behavior is fairly repeatable - the unexpectation mutation
 will
  happen at least 10% of the time I run this program, but at different
 points.
  When it does not go awry, I can run this loop many thousands of times and
  keep the counter exact. But if it starts happening to a specific counter,
  the counter will never recover and will continue to maintain it's
  incorrect value even after successful subsequent writes.

 Sounds like a corrupt counter shard. Hard to understand how it can
 happen at ALL. If I were you I would file a JIRA including your repro
 path...

 =Rob



Updated sstable size for LCS, ran upgradesstables, file sizes didn't change

2013-06-21 Thread Andrew Bialecki
We're potentially considering increasing the size of our sstables for some
column families from 10MB to something larger.

In test, we've been trying to verify that the sstable file sizes change and
then doing a bit of benchmarking. However when we run alter the column
family and then run nodetool upgradesstables -a keyspace columnfamily,
the files in the data directory have been re-written, but the file sizes
are the same.

Is this the expected behavior? If not, what's the right way to upgrade
them. If this is expected, how can we benchmark the read/write performance
with varying sstable sizes.

Thanks in advance!

Andrew


Re: Does replicate_on_write=true imply that CL.QUORUM for reads is unnecessary?

2013-06-02 Thread Andrew Bialecki
Thanks for the clarifications. For future readers, the details of write
requests are well documented at
http://www.datastax.com/docs/1.2/cluster_architecture/about_client_requests#about-write-requests
.


On Fri, May 31, 2013 at 4:20 AM, Sylvain Lebresne sylv...@datastax.comwrote:

 I agree, the page is clearly misleading in its formulation.

 However, for the sake of being precise, I'll note that it is not untrue
 strictly speaking.
 If replicate_on_write is true (the default that you should probably not
 change unless you consider yourself an expert in the Cassandra counters
 implementation), the a write will be written to all replica, and that does
 not depend of the consistency level of the operation.
 *But*, please note that this is also true for *every* other write in
 Cassandra. I.e. for non-counters writes, we *always* replicate the write to
 every replica regardless of the consistency level. The only thing the CL
 change is how many acks from
 said replica we wait for before returning a success to the client. And it
 works the exact same way for counters with replicate_on_write.

 Or put another way, by default, counters works exactly as normal writes as
 far CL is concerned. So no, replicate_on_write does *not* set the CL to ALL
 regardless of what you set.
 However, if you set replicate_on_write to false, we will only write the
 counter to 1 replica. Which means that the only CL that you will be able to
 use for writes is ONE (we don't allow ANY for counters).

 --
 Sylvain


 On Fri, May 31, 2013 at 9:20 AM, Peter Schuller 
 peter.schul...@infidyne.com wrote:

 This is incorrect. IMO that page is misleading.

 replicate on write should normally always be turned on, or the change
 will only be recorded on one node. Replicate on write is asynchronous
 with respect to the request and doesn't affect consistency level at
 all.


 On Wed, May 29, 2013 at 7:32 PM, Andrew Bialecki
 andrew.biale...@gmail.com wrote:
  To answer my own question, directly from the docs:
 
 http://www.datastax.com/docs/1.0/configuration/storage_configuration#replicate-on-write
 .
  It appears the answer to this is: Yes, CL.QUORUM isn't necessary for
  reads. Essentially, replicate_on_write sets the CL to ALL regardless of
  what you actually set it to (and for good reason).
 
 
  On Wed, May 29, 2013 at 9:47 AM, Andrew Bialecki 
 andrew.biale...@gmail.com
  wrote:
 
  Quick question about counter columns. In looking at the
 replicate_on_write
  setting, assuming you go with the default of true, my understanding
 is it
  writes the increment to all replicas on any increment.
 
  If that's the case, doesn't that mean there's no point in using
 CL.QUORUM
  for reads because all replicas have the same values?
 
  Similarly, what effect does the read_repair_chance have on counter
 columns
  since they should need to read repair on write.
 
  In anticipation a possible answer, that both CL.QUORUM for reads and
  read_repair_chance only end up mattering for counter deletions, it's
 safe to
  only use CL.ONE and disable the read repair if we're never deleting
  counters. (And, of course, if we did start deleting counters, we'd
 need to
  revert those client and column family changes.)
 
 



 --
 / Peter Schuller (@scode, http://worldmodscode.wordpress.com)





Does replicate_on_write=true imply that CL.QUORUM for reads is unnecessary?

2013-05-29 Thread Andrew Bialecki
Quick question about counter columns. In looking at the replicate_on_write
setting, assuming you go with the default of true, my understanding is it
writes the increment to all replicas on any increment.

If that's the case, doesn't that mean there's no point in using CL.QUORUM
for reads because all replicas have the same values?

Similarly, what effect does the read_repair_chance have on counter columns
since they should need to read repair on write.

In anticipation a possible answer, that both CL.QUORUM for reads and
read_repair_chance only end up mattering for counter deletions, it's safe
to only use CL.ONE and disable the read repair if we're never deleting
counters. (And, of course, if we did start deleting counters, we'd need to
revert those client and column family changes.)


Re: Does replicate_on_write=true imply that CL.QUORUM for reads is unnecessary?

2013-05-29 Thread Andrew Bialecki
To answer my own question, directly from the docs:
http://www.datastax.com/docs/1.0/configuration/storage_configuration#replicate-on-write.
It appears the answer to this is: Yes, CL.QUORUM isn't necessary for
reads. Essentially, replicate_on_write sets the CL to ALL regardless of
what you actually set it to (and for good reason).


On Wed, May 29, 2013 at 9:47 AM, Andrew Bialecki
andrew.biale...@gmail.comwrote:

 Quick question about counter columns. In looking at the replicate_on_write
 setting, assuming you go with the default of true, my understanding is it
 writes the increment to all replicas on any increment.

 If that's the case, doesn't that mean there's no point in using CL.QUORUM
 for reads because all replicas have the same values?

 Similarly, what effect does the read_repair_chance have on counter columns
 since they should need to read repair on write.

 In anticipation a possible answer, that both CL.QUORUM for reads and
 read_repair_chance only end up mattering for counter deletions, it's safe
 to only use CL.ONE and disable the read repair if we're never deleting
 counters. (And, of course, if we did start deleting counters, we'd need to
 revert those client and column family changes.)



Re: Observation on shuffling vs adding/removing nodes

2013-03-24 Thread Andrew Bialecki
Wouldn't shock me if shuffle wasn't all that performant (and not knock on
shuffle...our case is somewhat specific).

We added 3 nodes with num_tokens=256 and worked great, the load was evenly
spread.

On Sun, Mar 24, 2013 at 1:14 PM, aaron morton aa...@thelastpickle.comwrote:

 We initially tried to run a shuffle, however it seemed to be going really
 slowly (very little progress by watching cassandra-shuffle ls | wc -l
 after 5-6 hours and no errors in logs),

 My guess is that shuffle not designed to be as efficient as possible as it
 is only used once. Was it continuing to make progress?

 so we cancelled it and instead added 3 nodes to the cluster, waited for
 them to bootstrap, and then decommissioned the first 3 nodes.

 You added 3 nodes with num_tokens set in the yaml file ?
 What does nodetool status say ?

 Cheers
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 24/03/2013, at 9:41 AM, Andrew Bialecki andrew.biale...@gmail.com
 wrote:

 Just curious if anyone has any thoughts on something we've observed in a
 small test cluster.

 We had around 100 GB of data on a 3 node cluster (RF=2) and wanted to
 start using vnodes. We upgraded the cluster to 1.2.2 and then followed the
 instructions for using vnodes. We initially tried to run a shuffle, however
 it seemed to be going really slowly (very little progress by watching
 cassandra-shuffle ls | wc -l after 5-6 hours and no errors in logs), so
 we cancelled it and instead added 3 nodes to the cluster, waited for them
 to bootstrap, and then decommissioned the first 3 nodes. Total process took
 about 3 hours. My assumption is that the final result is the same in terms
 of data distributed somewhat randomly across nodes now (assuming no bias in
 the token ranges selected when bootstrapping a node).

 If that assumption is correct, the observation would be, if possible,
 adding nodes and then removing nodes appears to be a faster way to shuffle
 data for small clusters. Obviously not always possible, but I thought I'd
 just throw this out there in case anyone runs into a similar situation.
 This cluster is unsurprisingly on EC2 instances, which made provisioning
 and shutting down nodes extremely easy.

 Cheers,
 Andrew





Observation on shuffling vs adding/removing nodes

2013-03-23 Thread Andrew Bialecki
Just curious if anyone has any thoughts on something we've observed in a
small test cluster.

We had around 100 GB of data on a 3 node cluster (RF=2) and wanted to start
using vnodes. We upgraded the cluster to 1.2.2 and then followed the
instructions for using vnodes. We initially tried to run a shuffle, however
it seemed to be going really slowly (very little progress by watching
cassandra-shuffle ls | wc -l after 5-6 hours and no errors in logs), so
we cancelled it and instead added 3 nodes to the cluster, waited for them
to bootstrap, and then decommissioned the first 3 nodes. Total process took
about 3 hours. My assumption is that the final result is the same in terms
of data distributed somewhat randomly across nodes now (assuming no bias in
the token ranges selected when bootstrapping a node).

If that assumption is correct, the observation would be, if possible,
adding nodes and then removing nodes appears to be a faster way to shuffle
data for small clusters. Obviously not always possible, but I thought I'd
just throw this out there in case anyone runs into a similar situation.
This cluster is unsurprisingly on EC2 instances, which made provisioning
and shutting down nodes extremely easy.

Cheers,
Andrew


Bootstrapping a node in 1.2.2

2013-03-19 Thread Andrew Bialecki
I've got a 3 node cluster in 1.2.2 and just bootstrapped a new node into
it. For each of the existing nodes, I had num tokens set to 256 and for the
new node I also had it set to 256, however after bootstrapping into the
cluster, nodetool status keyspace for my main keyspace which has RF=2
now reports:

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address   Load   Tokens  Owns   Host ID
  Rack
UN  10.0.0.100   51.74 GB   256 34.1%  xxx  rack1
UN  10.0.0.103  75.04 GB   256 97.5%  yyy  rack1
UN  10.0.0.101  77.61 GB   256 34.1%  zzz  rack1
UN  10.0.0.102 126.93 GB  256 34.3%  www  rack1

Why does the bootstrapped node now own half the data? I would've expected
66.6% each. Any idea why the bootstrapped node is taking on a larger share
and how to spread the load evenly?

By the way, this test cluster is using the SimpleSnitch, so it shouldn't be
a topology issue.


Re: Nodetool drain automatically shutting down node?

2013-03-08 Thread Andrew Bialecki
If it's helps, here's the log with debug log statements. Possibly issue
with that exception?

INFO [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:32,402
StorageService.java (line 774) DRAINING: starting drain process
 INFO [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:32,403
CassandraDaemon.java (line 218) Stop listening to thrift clients
 INFO [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:32,404
Gossiper.java (line 1133) Announcing shutdown
DEBUG [GossipTasks:1] 2013-03-09 03:54:33,328
DebuggableThreadPoolExecutor.java (line 190) Task cancelled
java.util.concurrent.CancellationException
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:220)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.extractThrowable(DebuggableThreadPoolExecutor.java:182)
at
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.logExceptionsAfterExecute(DebuggableThreadPoolExecutor.java:146)
at
org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor.afterExecute(DebuggableScheduledThreadPoolExecutor.java:50)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:888)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
DEBUG [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,406
StorageService.java (line 776) DRAINING: shutting down MessageService
 INFO [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,406
MessagingService.java (line 534) Waiting for messaging service to quiesce
 INFO [ACCEPT-ip-10-116-111-143.ec2.internal/10.116.111.143] 2013-03-09
03:54:33,407 MessagingService.java (line 690) MessagingService shutting
down server thread.
DEBUG [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,408
StorageService.java (line 776) DRAINING: waiting for streaming
DEBUG [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,408
StorageService.java (line 776) DRAINING: clearing mutation stage
DEBUG [Thread-5] 2013-03-09 03:54:33,408 Gossiper.java (line 221) Reseting
version for /10.83.55.44
DEBUG [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,409
StorageService.java (line 776) DRAINING: flushing column families
DEBUG [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,409
ColumnFamilyStore.java (line 713) forceFlush requested but everything is
clean in Counter1
DEBUG [Thread-6] 2013-03-09 03:54:33,410 Gossiper.java (line 221) Reseting
version for /10.80.187.124
DEBUG [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,410
ColumnFamilyStore.java (line 713) forceFlush requested but everything is
clean in Super1
DEBUG [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,410
ColumnFamilyStore.java (line 713) forceFlush requested but everything is
clean in SuperCounter1
DEBUG [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,410
ColumnFamilyStore.java (line 713) forceFlush requested but everything is
clean in Standard1
 INFO [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,510
StorageService.java (line 774) DRAINED


On Fri, Mar 8, 2013 at 10:36 PM, Andrew Bialecki
andrew.biale...@gmail.comwrote:

 Hey all,

 We're getting ready to upgrade our cluster to 1.2.2 from 1.1.5 and we're
 testing the upgrade process on our dev cluster. We turned off all client
 access to the cluster and then ran nodetool drain on the first instance
 with the intention of running nodetool snapshot once it finished.
 However, after running the drain, didn't see any errors, but the Cassandra
 process was no longer running. Is that expected? From everything I've read
 it doesn't seem like it, but maybe I'm mistaken.

 Here's the relevant portion of the log from that node (notice it says it's
 shutting down the server thread in there):

 INFO [RMI TCP Connection(38)-10.116.111.143] 2013-03-09 03:26:48,288
 StorageService.java (line 774) DRAINING: starting drain process
  INFO [RMI TCP Connection(38)-10.116.111.143] 2013-03-09 03:26:48,288
 CassandraDaemon.java (line 218) Stop listening to thrift clients
  INFO [RMI TCP Connection(38)-10.116.111.143] 2013-03-09 03:26:48,315
 Gossiper.java (line 1133) Announcing shutdown
  INFO [RMI TCP Connection(38)-10.116.111.143] 2013-03-09 03:26:49,318
 MessagingService.java (line 534) Waiting for messaging service to quiesce
  INFO [ACCEPT-ip-10-116-111-143.ec2.internal/10.116.111.143] 2013-03-09
 03:26:49,319 MessagingService.java (line 690) MessagingService shutting
 down server thread.
  INFO [RMI TCP Connection(38)-10.116.111.143] 2013-03-09 03:26:49,338
 ColumnFamilyStore.java (line 659) Enqueuing flush of
 Memtable-Counter1@177255852(14810190/60139556 serialized/live bytes,
 243550 ops)
  INFO [FlushWriter:7] 2013-03-09 03:26:49,338 Memtable.java (line 264)
 Writing Memtable-Counter1@177255852(14810190/60139556 serialized/live
 bytes, 243550 ops)
  INFO [FlushWriter:7] 2013-03-09 03:26:49,899 Memtable.java (line 305

Running Cassandra 1.1, how can I see the efficiency of the key cache?

2012-12-22 Thread Andrew Bialecki
Since it's not in cfstats anymore, is there another way to monitor this?

I'm working with a dev cluster and I've got Opscenter set up, so I tried
taking a look through that, but it just shows NO DATA. Does that mean the
key cache isn't enabled? I haven't changed the defaults there, so the key
cache setting in cassandra.yaml is still blank.

Thanks for any help and happy holidays,
Andrew


Need to run nodetool repair on a cluster running 1.1.6 if no deletes

2012-12-22 Thread Andrew Bialecki
Hey everyone,

I'm seeing some conflicting advice out there about whether you need to run
nodetool repair within GCGraceSeconds with 1.x. Can someone clarify two
things:

(1) Do I need to run repair if I'm running 1.x?
(2) Should I bother running repair if I don't have any deletes? Anything
drawbacks to not running it?


Thanks,
Andrew


Re: Simulating a failed node

2012-10-29 Thread Andrew Bialecki
Thanks, extremely helpful. The key bit was I wasn't flushing the old
Keyspace before re-running the stress test, so I was stuck at RF = 1 from a
previous run despite passing RF = 2 to the stress tool.

On Sun, Oct 28, 2012 at 2:49 AM, Peter Schuller peter.schul...@infidyne.com
 wrote:

  Operation [158320] retried 10 times - error inserting key 0158320
 ((UnavailableException))

 This means that at the point where the thrift request to write data
 was handled, the co-ordinator node (the one your client is connected
 to) believed that, among the replicas responsible for the key, too
 many were down to satisfy the consistency level. Most likely causes
 would be that you're in fact not using RF  2 (e.g., is the RF really
  1 for the keyspace you're inserting into), or you're in fact not
 using ONE.

  I'm sure my naive setup is flawed in some way, but what I was hoping for
 was when the node went down it would fail to write to the downed node and
 instead write to one of the other nodes in the clusters. So question is why
 are writes failing even after a retry? It might be the stress client
 doesn't pool connections (I took

 Write always go to all responsible replicas that are up, and when
 enough return (according to consistency level), the insert succeeds.

 If replicas fail to respond you may get a TimeoutException.

 UnavailableException means it didn't even try because it didn't have
 enough replicas to even try to write to.

 (Note though: Reads are a bit of a different story and if you want to
 test behavior when nodes go down I suggest including that. See
 CASSANDRA-2540 and CASSANDRA-3927.)

 --
 / Peter Schuller (@scode, http://worldmodscode.wordpress.com)



Simulating a failed node

2012-10-27 Thread Andrew Bialecki
Hey everyone,

I'm trying to simulate what happens when a node goes down to make sure my
cluster can gracefully handle node failures. For my setup I have a 3 node
cluster running 1.1.5. I'm then using the stress tool included in 1.1.5
coming from an external server and running it with the following arguments:

tools/bin/cassandra-stress -d server1,server2,server3 -n 100


I start up the stress test and then down one of the nodes. The stress test
instantly fails with the following errors (which of course are the same
error from different threads) looking like:

  ...

Operation [158320] retried 10 times - error inserting key 0158320
((UnavailableException))
Operation [158429] retried 10 times - error inserting key 0158429
((UnavailableException))
Operation [158439] retried 10 times - error inserting key 0158439
((UnavailableException))
Operation [158470] retried 10 times - error inserting key 0158470
((UnavailableException))
158534,0,0,NaN,43
FAILURE


I'm sure my naive setup is flawed in some way, but what I was hoping for
was when the node went down it would fail to write to the downed node and
instead write to one of the other nodes in the clusters. So question is why
are writes failing even after a retry? It might be the stress client
doesn't pool connections (I took a quick look, but might've not looked
deeply enough), however I also tried only specifying the first two server
nodes and then downing the third with the same failure.

Thanks in advance.

Andrew


Re: Simulating a failed node

2012-10-27 Thread Andrew Bialecki
The default replication factor and consistency level for the stress tool is
one, so that's what I'm using. I've also experimented and seen the same
behavior with RF=2, but I haven't tried a different CL.

On Sun, Oct 28, 2012 at 12:36 AM, Watanabe Maki watanabe.m...@gmail.comwrote:

 What RF and CL are you using?


 On 2012/10/28, at 13:13, Andrew Bialecki andrew.biale...@gmail.com
 wrote:

 Hey everyone,

 I'm trying to simulate what happens when a node goes down to make sure my
 cluster can gracefully handle node failures. For my setup I have a 3 node
 cluster running 1.1.5. I'm then using the stress tool included in 1.1.5
 coming from an external server and running it with the following arguments:

 tools/bin/cassandra-stress -d server1,server2,server3 -n 100


 I start up the stress test and then down one of the nodes. The stress test
 instantly fails with the following errors (which of course are the same
 error from different threads) looking like:

   ...

 Operation [158320] retried 10 times - error inserting key 0158320
 ((UnavailableException))
 Operation [158429] retried 10 times - error inserting key 0158429
 ((UnavailableException))
 Operation [158439] retried 10 times - error inserting key 0158439
 ((UnavailableException))
 Operation [158470] retried 10 times - error inserting key 0158470
 ((UnavailableException))
 158534,0,0,NaN,43
 FAILURE


 I'm sure my naive setup is flawed in some way, but what I was hoping for
 was when the node went down it would fail to write to the downed node and
 instead write to one of the other nodes in the clusters. So question is why
 are writes failing even after a retry? It might be the stress client
 doesn't pool connections (I took a quick look, but might've not looked
 deeply enough), however I also tried only specifying the first two server
 nodes and then downing the third with the same failure.

 Thanks in advance.

 Andrew