Re: Hinted handoff throttled even after "nodetool sethintedhandoffthrottlekb 0"
Bit more information. Using jmxterm and inspecting the state of a node when it's "slow" playing hints, I can see the following from the node that has hints to play: $>get MaxHintsInProgress #mbean = org.apache.cassandra.db:type=StorageProxy: MaxHintsInProgress = 2048; $>get HintsInProgress #mbean = org.apache.cassandra.db:type=StorageProxy: HintsInProgress = 0; $>get TotalHints #mbean = org.apache.cassandra.db:type=StorageProxy: TotalHints = 129687; Is there some throttling that would cause hints to not be played at all if, for instance, the cluster has enough load or something related to a timeout setting? On Fri, Oct 27, 2017 at 1:49 AM, Andrew Bialecki < andrew.biale...@klaviyo.com> wrote: > We have a 96 node cluster running 3.11 with 256 vnodes each. We're running > a rolling restart. As we restart nodes, we notice that each node takes a > while to have all other nodes be marked as up and this corresponds to nodes > that haven't finished playing hints. > > We looked at the hinted handoff throttling, noticed it was still the > default of 1024, so we tried to turn it off by setting it to zero. Reading > the source, it looks like that rate limiting won't take affect until the > current set of hints have finished. So we made that change cluster wide and > then restarted the next node. However, we still saw the same issue. > > Looking at iftop and network throughput, it's very low (~10kB/s) and > therefore the few 100k of hints that accumulate while the node is restart > end up take several minutes to get sent. > > Any other knobs we should be tuning to increase hinted handoff throughput? > Or other reasons why hinted handoff runs so slowly? > > -- > Andrew Bialecki > -- Andrew Bialecki <https://www.klaviyo.com/>
Hinted handoff throttled even after "nodetool sethintedhandoffthrottlekb 0"
We have a 96 node cluster running 3.11 with 256 vnodes each. We're running a rolling restart. As we restart nodes, we notice that each node takes a while to have all other nodes be marked as up and this corresponds to nodes that haven't finished playing hints. We looked at the hinted handoff throttling, noticed it was still the default of 1024, so we tried to turn it off by setting it to zero. Reading the source, it looks like that rate limiting won't take affect until the current set of hints have finished. So we made that change cluster wide and then restarted the next node. However, we still saw the same issue. Looking at iftop and network throughput, it's very low (~10kB/s) and therefore the few 100k of hints that accumulate while the node is restart end up take several minutes to get sent. Any other knobs we should be tuning to increase hinted handoff throughput? Or other reasons why hinted handoff runs so slowly? -- Andrew Bialecki
Re: cassandra python driver routing requests to one node?
Is the node selection based on key deterministic across multiple clients? If it is, that sounds plausible. For this particular workload it's definitely possible to have a hot key / spot, but it was surprising it wasn't three nodes that got hot, it was just one. On Mon, Nov 14, 2016 at 6:26 PM, Alex Popescu <al...@datastax.com> wrote: > I'm wondering if what you are seeing is https://datastax-oss. > atlassian.net/browse/PYTHON-643 (that could still be a sign of a > potential data hotspot) > > On Sun, Nov 13, 2016 at 10:57 PM, Andrew Bialecki < > andrew.biale...@klaviyo.com> wrote: > >> We're using the "default" TokenAwarePolicy. Our nodes are spread across >> different racks within one datacenter. I've turned on debug logging for the >> Python driver, but it doesn't look like it logs which Casandra node each >> request goes to, but maybe I haven't got the right logging set to debug. >> >> On Mon, Nov 14, 2016 at 12:39 AM, Ben Slater <ben.sla...@instaclustr.com> >> wrote: >> >>> What load balancing policies are you using in your client code ( >>> https://datastax.github.io/python-driver/api/cassandra/policies.html)? >>> >>> Cheers >>> Ben >>> >>> On Mon, 14 Nov 2016 at 16:22 Andrew Bialecki < >>> andrew.biale...@klaviyo.com> wrote: >>> >>>> We have an odd situation where all of a sudden of our cluster started >>>> seeing a disproportionate number of writes go to one node. We're using the >>>> Python driver version 3.7.1. I'm not sure if this is a driver issue or >>>> possibly a network issue causing requests to get routed in an odd way. It's >>>> not absolute, there are requests going to all nodes. >>>> >>>> Tried restarting the problematic node, no luck (those are the quiet >>>> periods). Tried restarting the clients, also no luck. Checked nodetool >>>> status and ownership is even across the cluster. >>>> >>>> Curious if anyone's seen this behavior before. Seems like the next step >>>> will be to debug the client and see why it's choosing that node. >>>> >>>> [image: Inline image 1] >>>> >>>> >>>> -- >>>> AB >>>> >>> >> >> >> -- >> AB >> > > > > -- > Bests, > > Alex Popescu | @al3xandru > Sen. Product Manager @ DataStax > > > > -- AB
Re: cassandra python driver routing requests to one node?
We're using the "default" TokenAwarePolicy. Our nodes are spread across different racks within one datacenter. I've turned on debug logging for the Python driver, but it doesn't look like it logs which Casandra node each request goes to, but maybe I haven't got the right logging set to debug. On Mon, Nov 14, 2016 at 12:39 AM, Ben Slater <ben.sla...@instaclustr.com> wrote: > What load balancing policies are you using in your client code ( > https://datastax.github.io/python-driver/api/cassandra/policies.html)? > > Cheers > Ben > > On Mon, 14 Nov 2016 at 16:22 Andrew Bialecki <andrew.biale...@klaviyo.com> > wrote: > >> We have an odd situation where all of a sudden of our cluster started >> seeing a disproportionate number of writes go to one node. We're using the >> Python driver version 3.7.1. I'm not sure if this is a driver issue or >> possibly a network issue causing requests to get routed in an odd way. It's >> not absolute, there are requests going to all nodes. >> >> Tried restarting the problematic node, no luck (those are the quiet >> periods). Tried restarting the clients, also no luck. Checked nodetool >> status and ownership is even across the cluster. >> >> Curious if anyone's seen this behavior before. Seems like the next step >> will be to debug the client and see why it's choosing that node. >> >> [image: Inline image 1] >> >> >> -- >> AB >> > -- AB
cassandra python driver routing requests to one node?
We have an odd situation where all of a sudden of our cluster started seeing a disproportionate number of writes go to one node. We're using the Python driver version 3.7.1. I'm not sure if this is a driver issue or possibly a network issue causing requests to get routed in an odd way. It's not absolute, there are requests going to all nodes. Tried restarting the problematic node, no luck (those are the quiet periods). Tried restarting the clients, also no luck. Checked nodetool status and ownership is even across the cluster. Curious if anyone's seen this behavior before. Seems like the next step will be to debug the client and see why it's choosing that node. [image: Inline image 1] -- AB
High number of ReplicateOnWriteStage All timed blocked, counter CF
Hey everyone, We're stress testing writes for a few counter CFs and noticed one one node we got to the point where the ReplicateOnWriteStage thread pool was backed up and it started blocking those tasks. This cluster is six nodes, RF=3, running 1.2.9. All CFs have LCS with 160 MB sstables. All writes were CL.ONE. Few questions: 1. What causes a RoW (replicate of write) task to be blocked? The queue maxes out at 4128, which seems to be 32 * (128 + 1). 32 is the number of concurrent_writers we have. 2. Given this is a counter CF, can those dropped RoWs be repaired with a nodetool repair? From my understanding of how counter writes work, until we run that repair, if we're not using CL.ALL / read_repair_chance = 1, we will get some incorrect reads, but a repair will fix things. Is that right? 3. The CPU on the node where we started seeing the number of blocked tasks increase was pegged, but I/O was not saturated. There were compactions running on those column families as well. Is there a setting we could consider altering that might prevent that back up or is the answer likely, increase the number of nodes to get more throughput. Thanks in advance for any insights! Andrew
Re: Counters and replication
We've seen high CPU in tests on stress tests with counters. With our workload, we had some hot counters (e.g. ones with 100s increments/sec) with RF = 3, which caused the load to spike and replicate on write tasks to back up on those three nodes. Richard already gave a good overview of why this happens. As he said, changing the consistency level won't help you. It'll decrease the write latency because the write will ack once it queues the replicate on write task, but the node will still queue a task to replicate the write to the other replicas. As mentioned, the only Cassandra config fix is setting replicate_on_write to false, but that's definitely not recommended unless you don't mind losing your counter values if a node goes down. Other options which will require work outside of Cassandra: 1. Partition hot counters and write to each randomly and then aggregate them together at read-time. This is basically the same trick as writing to a hot time series. http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra 2. Absorb and aggregate increments and only increment in Cassandra every so often. For instance, if a counter needs to incremented 100 times/sec, increment an in-memory counter and then flush those increments at once by issuing one increment/sec that has the sum of all the aggregates for that time period. I believe Twitter does something like this ( http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011slide 26). 3. Faster disks. All that said, you say you're seeing low disk utilization, which is inconsistent with what what we saw. In our tests, we saw ~100% disk utilization on the nodes for the hot counters, which made it easy to determine what was going on. If disk isn't your bottleneck, then you probably have a different issue. On Mon, Aug 5, 2013 at 3:30 PM, Richard Low rich...@wentnet.com wrote: On 5 August 2013 20:04, Christopher Wirt chris.w...@struq.com wrote: Hello, ** ** Question about counters, replication and the ReplicateOnWriteStage ** ** I’ve recently turned on a new CF which uses a counter column. ** ** We have a three DC setup running Cassandra 1.2.4 with vNodes, hex core processors, 32Gb memory. DC 1 - 9 nodes with RF 3 DC 2 - 3 nodes with RF 2 DC 3 - 3 nodes with RF 2 ** ** DC 1 one receives most of the updates to this counter column. ~3k per sec. ** ** I’ve disabled any client reads while I sort out this issue. Disk utilization is very low Memory is aplenty (while not reading) Schema: CREATE TABLE cf1 ( uid uuid, id1 int, id2 int, id3 int, ct counter, PRIMARY KEY (uid, id1, id2, id3) ) WITH … ** ** Three of the machines in DC 1 are reporting very high CPU load. Looking at tpstats there is a large number of pending ReplicateOnWriteStage just on those machines. ** ** Why would only three of the machines be reporting this? Assuming its distributed by uuid value there should be an even load across the cluster, yea? Am I missing something about how distributed counters work? If you have many different uid values and your cluster is balanced then you should see even load. Were your tokens chosen randomly? Did you start out with num_tokens set high or upgrade from num_tokens=1 or an earlier Cassandra version? Is it possible your workload is incrementing the counter for one particular uid much more than the others? The distribution of counters works the same as for non-counters in terms of which nodes receive which values. However, there is a read on the coordinator (randomly chosen for each inc) to read the current value and replicate it to the remaining replicas. This makes counter increments much more expensive than normal inserts, even if all your counters fit in cache. This is done in the ReplicateOnWriteStage, which is why you are seeing that queue build up. ** Is changing CL to ONE fine if I’m not too worried about 100% consistency? Yes, but to make the biggest difference you will need to turn off replicate_on_write (alter table cf1 with replicate_on_write = false;) but this *guarantees* your counts aren't replicated, even if all replicas are up. It avoids doing the read, so makes a huge difference to performance, but means that if a node is unavailable later on, you *will* read inconsistent counts. (Or, worse, if a node fails, you will lose counts forever.) This is in contrast to CL.ONE inserts for normal values when inserts are still attempted on all replicas, but only one is required to succeed. So you might be able to get a temporary performance boost by changing replicate_on_write if your counter values aren't important. But this won't solve the root of the problem. Richard.
Re: Deletion use more space.
I don't think setting gc_grace_seconds to an hour is going to do what you'd expect. After gc_grace_seconds, if you haven't run a repair within that hour, the data you deleted will seem to have been undeleted. Someone correct me if I'm wrong, but in order to order to completely delete data and regain the space it takes up, you need to delete it, which creates tombstones, and then run a repair on that column family within gc_grace_seconds. After that the data is actually gone and the space reclaimed. On Tue, Jul 16, 2013 at 6:20 AM, 杨辉强 huiqiangy...@yunrang.com wrote: Thank you! It should be update column family ScheduleInfoCF with gc_grace = 3600; Faint. - 原始邮件 - 发件人: 杨辉强 huiqiangy...@yunrang.com 收件人: user@cassandra.apache.org 发送时间: 星期二, 2013年 7 月 16日 下午 6:15:12 主题: Re: Deletion use more space. Hi, I use the follow cmd to update gc_grace_seconds. It reports error! Why? [default@WebSearch] update column family ScheduleInfoCF with gc_grace_seconds = 3600; java.lang.IllegalArgumentException: No enum const class org.apache.cassandra.cli.CliClient$ColumnFamilyArgument.GC_GRACE_SECONDS - 原始邮件 - 发件人: Michał Michalski mich...@opera.com 收件人: user@cassandra.apache.org 发送时间: 星期二, 2013年 7 月 16日 下午 5:51:49 主题: Re: Deletion use more space. Deletion is not really removing data, but it's adding tombstones (markers) of deletion. They'll be later merged with existing data during compaction and - in the end (see: gc_grace_seconds) - removed, but by this time they'll take some space. http://wiki.apache.org/cassandra/DistributedDeletes M. W dniu 16.07.2013 11:46, 杨辉强 pisze: Hi, all: I use cassandra 1.2.4 and I have 4 nodes ring and use byte order partitioner. I had inserted about 200G data in the ring previous days. Today I write a program to scan the ring and then at the same time delete the items that are scanned. To my surprise, the cassandra cost more disk usage. Anybody can tell me why? Thanks.
Re: node tool ring displays 33.33% owns on 3 node cluster with replication
Not sure if it's the best/intended behavior, but you should see it go back to 100% if you run: nodetool -h 127.0.0.1 -p 8080 ring keyspace. I think the rationale for showing 33% is that different keyspaces might have different RFs, so it's unclear what to show for ownership. However, if you include the keyspace as part of your query, you'll get it weighted by the RF of that keyspace. I believe the same logic applies for nodetool status. Andrew On Thu, Jul 11, 2013 at 12:58 PM, Jason Tyler jaty...@yahoo-inc.com wrote: Thanks Rob! I was able to confirm with getendpoints. Cheers, ~Jason From: Robert Coli rc...@eventbrite.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Wednesday, July 10, 2013 4:09 PM To: user@cassandra.apache.org user@cassandra.apache.org Cc: Francois Richard frich...@yahoo-inc.com Subject: Re: node tool ring displays 33.33% owns on 3 node cluster with replication On Wed, Jul 10, 2013 at 4:04 PM, Jason Tyler jaty...@yahoo-inc.comwrote: Is this simply a display issue, or have I lost replication? Almost certainly just a display issue. Do nodetool -h localhost getendpoints keyspace columnfamily 0, which will tell you the endpoints for the non-transformed key 0. It should give you 3 endpoints. You could also do this test with a known existing key and then go to those nodes and verify that they have that data on disk via sstable2json. (FWIW, it is an odd display issue/bug if it is one. Because it has reverted to pre-1.1 behavior...) =Rob
Lots of replicate on write tasks pending, want to investigate
In one of our load tests, we're incrementing a single counter column as well as appending columns to a single row (essentially a timeline). You can think of it as counting the instances of an event and then keeping a timeline of those events. The ratio is of increments to appends is 1:1. When we run this on a test cluster with RF = 3, one node gets backed up with a lot of replicate on write tasks pending, eventually maxing out at 4128. We think it's a disk I/O issue that's causing the slowdown (lot of reads), but we're still investigating. A few questions that might speed up understanding the issue: 1. Is there any way to see metadata about the replicate on write tasks pending? We're splitting apart the load test to pinpoint which of those operations is causing an issue, but if there's a way to see that queue, that might save us some work. 2. I'm assuming in our case the cause is incrementing counters because disk reads are part of the write path for counters and are not for appending columns to a row. Does that logic make sense? Thanks in advance, Andrew
Re: Lots of replicate on write tasks pending, want to investigate
Can someone remind me why replicate on write tasks might be related to the high disk I/O? My understanding is the replicate on write involves sending the update to other nodes, so it shouldn't involve any disk activity -- disk activity would be during the mutation/write phase. The write path (not replicate on write) for counters involves a read, so that explains the high disk I/O, but for that I'd expect to see many write requests pending (which we see a bit), but not replicate on writes backing up. What am I missing? Andrew On Wed, Jul 3, 2013 at 1:03 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Jul 3, 2013 at 9:59 AM, Andrew Bialecki andrew.biale...@gmail.com wrote: 2. I'm assuming in our case the cause is incrementing counters because disk reads are part of the write path for counters and are not for appending columns to a row. Does that logic make sense? That's a pretty reasonable assumption if you are not doing any other reads and you see your disk busy doing non-compaction related reads. :) =Rob
Re: Counter value becomes incorrect after several dozen reads writes
If you can reproduce the invalid behavior 10+% of the time with steps to repro that take 5-10s/iteration, that sounds extremely interesting for getting to the bottom of the invalid shard issue (if that's what the root cause ends up being). Would be very interested in the set up to see if the behavior can be duplicated. Andrew On Tue, Jun 25, 2013 at 2:18 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Jun 24, 2013 at 6:42 PM, Josh Dzielak j...@keen.io wrote: There is only 1 thread running this sequence, and consistency levels are set to ALL. The behavior is fairly repeatable - the unexpectation mutation will happen at least 10% of the time I run this program, but at different points. When it does not go awry, I can run this loop many thousands of times and keep the counter exact. But if it starts happening to a specific counter, the counter will never recover and will continue to maintain it's incorrect value even after successful subsequent writes. Sounds like a corrupt counter shard. Hard to understand how it can happen at ALL. If I were you I would file a JIRA including your repro path... =Rob
Updated sstable size for LCS, ran upgradesstables, file sizes didn't change
We're potentially considering increasing the size of our sstables for some column families from 10MB to something larger. In test, we've been trying to verify that the sstable file sizes change and then doing a bit of benchmarking. However when we run alter the column family and then run nodetool upgradesstables -a keyspace columnfamily, the files in the data directory have been re-written, but the file sizes are the same. Is this the expected behavior? If not, what's the right way to upgrade them. If this is expected, how can we benchmark the read/write performance with varying sstable sizes. Thanks in advance! Andrew
Re: Does replicate_on_write=true imply that CL.QUORUM for reads is unnecessary?
Thanks for the clarifications. For future readers, the details of write requests are well documented at http://www.datastax.com/docs/1.2/cluster_architecture/about_client_requests#about-write-requests . On Fri, May 31, 2013 at 4:20 AM, Sylvain Lebresne sylv...@datastax.comwrote: I agree, the page is clearly misleading in its formulation. However, for the sake of being precise, I'll note that it is not untrue strictly speaking. If replicate_on_write is true (the default that you should probably not change unless you consider yourself an expert in the Cassandra counters implementation), the a write will be written to all replica, and that does not depend of the consistency level of the operation. *But*, please note that this is also true for *every* other write in Cassandra. I.e. for non-counters writes, we *always* replicate the write to every replica regardless of the consistency level. The only thing the CL change is how many acks from said replica we wait for before returning a success to the client. And it works the exact same way for counters with replicate_on_write. Or put another way, by default, counters works exactly as normal writes as far CL is concerned. So no, replicate_on_write does *not* set the CL to ALL regardless of what you set. However, if you set replicate_on_write to false, we will only write the counter to 1 replica. Which means that the only CL that you will be able to use for writes is ONE (we don't allow ANY for counters). -- Sylvain On Fri, May 31, 2013 at 9:20 AM, Peter Schuller peter.schul...@infidyne.com wrote: This is incorrect. IMO that page is misleading. replicate on write should normally always be turned on, or the change will only be recorded on one node. Replicate on write is asynchronous with respect to the request and doesn't affect consistency level at all. On Wed, May 29, 2013 at 7:32 PM, Andrew Bialecki andrew.biale...@gmail.com wrote: To answer my own question, directly from the docs: http://www.datastax.com/docs/1.0/configuration/storage_configuration#replicate-on-write . It appears the answer to this is: Yes, CL.QUORUM isn't necessary for reads. Essentially, replicate_on_write sets the CL to ALL regardless of what you actually set it to (and for good reason). On Wed, May 29, 2013 at 9:47 AM, Andrew Bialecki andrew.biale...@gmail.com wrote: Quick question about counter columns. In looking at the replicate_on_write setting, assuming you go with the default of true, my understanding is it writes the increment to all replicas on any increment. If that's the case, doesn't that mean there's no point in using CL.QUORUM for reads because all replicas have the same values? Similarly, what effect does the read_repair_chance have on counter columns since they should need to read repair on write. In anticipation a possible answer, that both CL.QUORUM for reads and read_repair_chance only end up mattering for counter deletions, it's safe to only use CL.ONE and disable the read repair if we're never deleting counters. (And, of course, if we did start deleting counters, we'd need to revert those client and column family changes.) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Does replicate_on_write=true imply that CL.QUORUM for reads is unnecessary?
Quick question about counter columns. In looking at the replicate_on_write setting, assuming you go with the default of true, my understanding is it writes the increment to all replicas on any increment. If that's the case, doesn't that mean there's no point in using CL.QUORUM for reads because all replicas have the same values? Similarly, what effect does the read_repair_chance have on counter columns since they should need to read repair on write. In anticipation a possible answer, that both CL.QUORUM for reads and read_repair_chance only end up mattering for counter deletions, it's safe to only use CL.ONE and disable the read repair if we're never deleting counters. (And, of course, if we did start deleting counters, we'd need to revert those client and column family changes.)
Re: Does replicate_on_write=true imply that CL.QUORUM for reads is unnecessary?
To answer my own question, directly from the docs: http://www.datastax.com/docs/1.0/configuration/storage_configuration#replicate-on-write. It appears the answer to this is: Yes, CL.QUORUM isn't necessary for reads. Essentially, replicate_on_write sets the CL to ALL regardless of what you actually set it to (and for good reason). On Wed, May 29, 2013 at 9:47 AM, Andrew Bialecki andrew.biale...@gmail.comwrote: Quick question about counter columns. In looking at the replicate_on_write setting, assuming you go with the default of true, my understanding is it writes the increment to all replicas on any increment. If that's the case, doesn't that mean there's no point in using CL.QUORUM for reads because all replicas have the same values? Similarly, what effect does the read_repair_chance have on counter columns since they should need to read repair on write. In anticipation a possible answer, that both CL.QUORUM for reads and read_repair_chance only end up mattering for counter deletions, it's safe to only use CL.ONE and disable the read repair if we're never deleting counters. (And, of course, if we did start deleting counters, we'd need to revert those client and column family changes.)
Re: Observation on shuffling vs adding/removing nodes
Wouldn't shock me if shuffle wasn't all that performant (and not knock on shuffle...our case is somewhat specific). We added 3 nodes with num_tokens=256 and worked great, the load was evenly spread. On Sun, Mar 24, 2013 at 1:14 PM, aaron morton aa...@thelastpickle.comwrote: We initially tried to run a shuffle, however it seemed to be going really slowly (very little progress by watching cassandra-shuffle ls | wc -l after 5-6 hours and no errors in logs), My guess is that shuffle not designed to be as efficient as possible as it is only used once. Was it continuing to make progress? so we cancelled it and instead added 3 nodes to the cluster, waited for them to bootstrap, and then decommissioned the first 3 nodes. You added 3 nodes with num_tokens set in the yaml file ? What does nodetool status say ? Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 24/03/2013, at 9:41 AM, Andrew Bialecki andrew.biale...@gmail.com wrote: Just curious if anyone has any thoughts on something we've observed in a small test cluster. We had around 100 GB of data on a 3 node cluster (RF=2) and wanted to start using vnodes. We upgraded the cluster to 1.2.2 and then followed the instructions for using vnodes. We initially tried to run a shuffle, however it seemed to be going really slowly (very little progress by watching cassandra-shuffle ls | wc -l after 5-6 hours and no errors in logs), so we cancelled it and instead added 3 nodes to the cluster, waited for them to bootstrap, and then decommissioned the first 3 nodes. Total process took about 3 hours. My assumption is that the final result is the same in terms of data distributed somewhat randomly across nodes now (assuming no bias in the token ranges selected when bootstrapping a node). If that assumption is correct, the observation would be, if possible, adding nodes and then removing nodes appears to be a faster way to shuffle data for small clusters. Obviously not always possible, but I thought I'd just throw this out there in case anyone runs into a similar situation. This cluster is unsurprisingly on EC2 instances, which made provisioning and shutting down nodes extremely easy. Cheers, Andrew
Observation on shuffling vs adding/removing nodes
Just curious if anyone has any thoughts on something we've observed in a small test cluster. We had around 100 GB of data on a 3 node cluster (RF=2) and wanted to start using vnodes. We upgraded the cluster to 1.2.2 and then followed the instructions for using vnodes. We initially tried to run a shuffle, however it seemed to be going really slowly (very little progress by watching cassandra-shuffle ls | wc -l after 5-6 hours and no errors in logs), so we cancelled it and instead added 3 nodes to the cluster, waited for them to bootstrap, and then decommissioned the first 3 nodes. Total process took about 3 hours. My assumption is that the final result is the same in terms of data distributed somewhat randomly across nodes now (assuming no bias in the token ranges selected when bootstrapping a node). If that assumption is correct, the observation would be, if possible, adding nodes and then removing nodes appears to be a faster way to shuffle data for small clusters. Obviously not always possible, but I thought I'd just throw this out there in case anyone runs into a similar situation. This cluster is unsurprisingly on EC2 instances, which made provisioning and shutting down nodes extremely easy. Cheers, Andrew
Bootstrapping a node in 1.2.2
I've got a 3 node cluster in 1.2.2 and just bootstrapped a new node into it. For each of the existing nodes, I had num tokens set to 256 and for the new node I also had it set to 256, however after bootstrapping into the cluster, nodetool status keyspace for my main keyspace which has RF=2 now reports: Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.0.0.100 51.74 GB 256 34.1% xxx rack1 UN 10.0.0.103 75.04 GB 256 97.5% yyy rack1 UN 10.0.0.101 77.61 GB 256 34.1% zzz rack1 UN 10.0.0.102 126.93 GB 256 34.3% www rack1 Why does the bootstrapped node now own half the data? I would've expected 66.6% each. Any idea why the bootstrapped node is taking on a larger share and how to spread the load evenly? By the way, this test cluster is using the SimpleSnitch, so it shouldn't be a topology issue.
Re: Nodetool drain automatically shutting down node?
If it's helps, here's the log with debug log statements. Possibly issue with that exception? INFO [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:32,402 StorageService.java (line 774) DRAINING: starting drain process INFO [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:32,403 CassandraDaemon.java (line 218) Stop listening to thrift clients INFO [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:32,404 Gossiper.java (line 1133) Announcing shutdown DEBUG [GossipTasks:1] 2013-03-09 03:54:33,328 DebuggableThreadPoolExecutor.java (line 190) Task cancelled java.util.concurrent.CancellationException at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:220) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.extractThrowable(DebuggableThreadPoolExecutor.java:182) at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.logExceptionsAfterExecute(DebuggableThreadPoolExecutor.java:146) at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor.afterExecute(DebuggableScheduledThreadPoolExecutor.java:50) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:888) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) DEBUG [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,406 StorageService.java (line 776) DRAINING: shutting down MessageService INFO [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,406 MessagingService.java (line 534) Waiting for messaging service to quiesce INFO [ACCEPT-ip-10-116-111-143.ec2.internal/10.116.111.143] 2013-03-09 03:54:33,407 MessagingService.java (line 690) MessagingService shutting down server thread. DEBUG [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,408 StorageService.java (line 776) DRAINING: waiting for streaming DEBUG [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,408 StorageService.java (line 776) DRAINING: clearing mutation stage DEBUG [Thread-5] 2013-03-09 03:54:33,408 Gossiper.java (line 221) Reseting version for /10.83.55.44 DEBUG [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,409 StorageService.java (line 776) DRAINING: flushing column families DEBUG [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,409 ColumnFamilyStore.java (line 713) forceFlush requested but everything is clean in Counter1 DEBUG [Thread-6] 2013-03-09 03:54:33,410 Gossiper.java (line 221) Reseting version for /10.80.187.124 DEBUG [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,410 ColumnFamilyStore.java (line 713) forceFlush requested but everything is clean in Super1 DEBUG [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,410 ColumnFamilyStore.java (line 713) forceFlush requested but everything is clean in SuperCounter1 DEBUG [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,410 ColumnFamilyStore.java (line 713) forceFlush requested but everything is clean in Standard1 INFO [RMI TCP Connection(2)-10.116.111.143] 2013-03-09 03:54:33,510 StorageService.java (line 774) DRAINED On Fri, Mar 8, 2013 at 10:36 PM, Andrew Bialecki andrew.biale...@gmail.comwrote: Hey all, We're getting ready to upgrade our cluster to 1.2.2 from 1.1.5 and we're testing the upgrade process on our dev cluster. We turned off all client access to the cluster and then ran nodetool drain on the first instance with the intention of running nodetool snapshot once it finished. However, after running the drain, didn't see any errors, but the Cassandra process was no longer running. Is that expected? From everything I've read it doesn't seem like it, but maybe I'm mistaken. Here's the relevant portion of the log from that node (notice it says it's shutting down the server thread in there): INFO [RMI TCP Connection(38)-10.116.111.143] 2013-03-09 03:26:48,288 StorageService.java (line 774) DRAINING: starting drain process INFO [RMI TCP Connection(38)-10.116.111.143] 2013-03-09 03:26:48,288 CassandraDaemon.java (line 218) Stop listening to thrift clients INFO [RMI TCP Connection(38)-10.116.111.143] 2013-03-09 03:26:48,315 Gossiper.java (line 1133) Announcing shutdown INFO [RMI TCP Connection(38)-10.116.111.143] 2013-03-09 03:26:49,318 MessagingService.java (line 534) Waiting for messaging service to quiesce INFO [ACCEPT-ip-10-116-111-143.ec2.internal/10.116.111.143] 2013-03-09 03:26:49,319 MessagingService.java (line 690) MessagingService shutting down server thread. INFO [RMI TCP Connection(38)-10.116.111.143] 2013-03-09 03:26:49,338 ColumnFamilyStore.java (line 659) Enqueuing flush of Memtable-Counter1@177255852(14810190/60139556 serialized/live bytes, 243550 ops) INFO [FlushWriter:7] 2013-03-09 03:26:49,338 Memtable.java (line 264) Writing Memtable-Counter1@177255852(14810190/60139556 serialized/live bytes, 243550 ops) INFO [FlushWriter:7] 2013-03-09 03:26:49,899 Memtable.java (line 305
Running Cassandra 1.1, how can I see the efficiency of the key cache?
Since it's not in cfstats anymore, is there another way to monitor this? I'm working with a dev cluster and I've got Opscenter set up, so I tried taking a look through that, but it just shows NO DATA. Does that mean the key cache isn't enabled? I haven't changed the defaults there, so the key cache setting in cassandra.yaml is still blank. Thanks for any help and happy holidays, Andrew
Need to run nodetool repair on a cluster running 1.1.6 if no deletes
Hey everyone, I'm seeing some conflicting advice out there about whether you need to run nodetool repair within GCGraceSeconds with 1.x. Can someone clarify two things: (1) Do I need to run repair if I'm running 1.x? (2) Should I bother running repair if I don't have any deletes? Anything drawbacks to not running it? Thanks, Andrew
Re: Simulating a failed node
Thanks, extremely helpful. The key bit was I wasn't flushing the old Keyspace before re-running the stress test, so I was stuck at RF = 1 from a previous run despite passing RF = 2 to the stress tool. On Sun, Oct 28, 2012 at 2:49 AM, Peter Schuller peter.schul...@infidyne.com wrote: Operation [158320] retried 10 times - error inserting key 0158320 ((UnavailableException)) This means that at the point where the thrift request to write data was handled, the co-ordinator node (the one your client is connected to) believed that, among the replicas responsible for the key, too many were down to satisfy the consistency level. Most likely causes would be that you're in fact not using RF 2 (e.g., is the RF really 1 for the keyspace you're inserting into), or you're in fact not using ONE. I'm sure my naive setup is flawed in some way, but what I was hoping for was when the node went down it would fail to write to the downed node and instead write to one of the other nodes in the clusters. So question is why are writes failing even after a retry? It might be the stress client doesn't pool connections (I took Write always go to all responsible replicas that are up, and when enough return (according to consistency level), the insert succeeds. If replicas fail to respond you may get a TimeoutException. UnavailableException means it didn't even try because it didn't have enough replicas to even try to write to. (Note though: Reads are a bit of a different story and if you want to test behavior when nodes go down I suggest including that. See CASSANDRA-2540 and CASSANDRA-3927.) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Simulating a failed node
Hey everyone, I'm trying to simulate what happens when a node goes down to make sure my cluster can gracefully handle node failures. For my setup I have a 3 node cluster running 1.1.5. I'm then using the stress tool included in 1.1.5 coming from an external server and running it with the following arguments: tools/bin/cassandra-stress -d server1,server2,server3 -n 100 I start up the stress test and then down one of the nodes. The stress test instantly fails with the following errors (which of course are the same error from different threads) looking like: ... Operation [158320] retried 10 times - error inserting key 0158320 ((UnavailableException)) Operation [158429] retried 10 times - error inserting key 0158429 ((UnavailableException)) Operation [158439] retried 10 times - error inserting key 0158439 ((UnavailableException)) Operation [158470] retried 10 times - error inserting key 0158470 ((UnavailableException)) 158534,0,0,NaN,43 FAILURE I'm sure my naive setup is flawed in some way, but what I was hoping for was when the node went down it would fail to write to the downed node and instead write to one of the other nodes in the clusters. So question is why are writes failing even after a retry? It might be the stress client doesn't pool connections (I took a quick look, but might've not looked deeply enough), however I also tried only specifying the first two server nodes and then downing the third with the same failure. Thanks in advance. Andrew
Re: Simulating a failed node
The default replication factor and consistency level for the stress tool is one, so that's what I'm using. I've also experimented and seen the same behavior with RF=2, but I haven't tried a different CL. On Sun, Oct 28, 2012 at 12:36 AM, Watanabe Maki watanabe.m...@gmail.comwrote: What RF and CL are you using? On 2012/10/28, at 13:13, Andrew Bialecki andrew.biale...@gmail.com wrote: Hey everyone, I'm trying to simulate what happens when a node goes down to make sure my cluster can gracefully handle node failures. For my setup I have a 3 node cluster running 1.1.5. I'm then using the stress tool included in 1.1.5 coming from an external server and running it with the following arguments: tools/bin/cassandra-stress -d server1,server2,server3 -n 100 I start up the stress test and then down one of the nodes. The stress test instantly fails with the following errors (which of course are the same error from different threads) looking like: ... Operation [158320] retried 10 times - error inserting key 0158320 ((UnavailableException)) Operation [158429] retried 10 times - error inserting key 0158429 ((UnavailableException)) Operation [158439] retried 10 times - error inserting key 0158439 ((UnavailableException)) Operation [158470] retried 10 times - error inserting key 0158470 ((UnavailableException)) 158534,0,0,NaN,43 FAILURE I'm sure my naive setup is flawed in some way, but what I was hoping for was when the node went down it would fail to write to the downed node and instead write to one of the other nodes in the clusters. So question is why are writes failing even after a retry? It might be the stress client doesn't pool connections (I took a quick look, but might've not looked deeply enough), however I also tried only specifying the first two server nodes and then downing the third with the same failure. Thanks in advance. Andrew