1. Check for Nagle/delayed-ack, but probably nodelay is getting set by the
driver so it shouldn't be a problem.
2. Check for network latency (just regular old ping among hosts, during
traffic)
3. Check your GC metrics and see if garbage collections line up with
outliers. Some tuning can help
, Apr 12, 2023 at 11:36 AM Elliott Sims wrote:
> A few weeks ago, we rolled out TLS among hosts in our clusters (running
> 4.0.7). More recently we also rolled out TLS between Cassandra clients and
> the cluster. Today, we started seeing a lot of dropped actions in one
> cluster th
A few weeks ago, we rolled out TLS among hosts in our clusters (running
4.0.7). More recently we also rolled out TLS between Cassandra clients and
the cluster. Today, we started seeing a lot of dropped actions in one
cluster that correlate with warnings like this:
WARN
A quick search shows SLES 15 provides Java 11 (java-11-openjdk), which is
just fine for Cassandra 4.x.
On Wed, Mar 8, 2023 at 2:56 PM Eric Ferrenbach <
eric.ferrenb...@milliporesigma.com> wrote:
> We are running Cassandra 4.0.7.
>
> We are preparing to migrate our nodes from Centos to SUSE
For dealing with allocate_tokens_for_keyspace in datacenter migrations,
I've just created a dummy keyspace in the new DC with the desired topology,
then removed it once everything's done.
On Mon, Jan 30, 2023 at 3:36 PM Doug Whitfield
wrote:
> Hi folks,
>
> In our 3.11 deployments we are using
Consistently 200ms, during the back-and-forth negotiation rather than the
handshake? That sounds suspiciously like Nagle interacting with Delayed
ACK.
On Wed, Jan 11, 2023 at 8:41 AM MyWorld wrote:
> Hi all,
> We are facing a connection latency of 200ms between API server and db
> server
If multiple things are dying under load, you'll want to check "dmesg" and
see if the oom-killer is getting triggered. Something like "atop" can be
good for figuring out what was using all of the memory when it was
triggered if the kernel logs don't have enough info.
On Thu, Dec 15, 2022 at 12:41
s *not recommended for clusters over 50
> nodes*.
>
> 16
>
> Best for heavily elastic clusters which expand and shrink regularly, but
> may have issues availability with larger clusters. Not recommended for
> clusters over 50 nodes.
>
> On Sun, Mar 13, 2022 at 11:34 PM Ell
fect
> efficiency if the token figure were the same across all nodes?
>
>
>
> *From:* Elliott Sims
> *Sent:* Thursday, June 16, 2022 12:24 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Configuration for new(expanding) cluster and new admins.
>
>
>
> EXTERNAL
>
If you set a different num_tokens value for new hosts (the value should
never be changed on an existing host), the amount of data moved to that
host will be proportional to the num_tokens value. So, if the new hosts
are set to 32 when they're added to the cluster, those hosts will get twice
as
In terms of turning it into Ansible, it's going to depend a lot on how you
manage the physical layer as well as replication/consistency. Currently, I
just use groups per "rack". If you have an API-accessible CMDB you could
probably pull the physical location from there and translate that to
I think this has a much simpler answer: GNU tar interprets inode changes
as "changes" as well as block contents. This includes the hardlink count.
I actually ended up working around it by using bsdtar, which doesn't
interpret hardlink count changes as a change to be concerned about.
On Tue, Mar
More tokens: better data distribution, more expensive repairs, higher
probability of a multi-host outage taking some data offline and affecting
availability.
I think with >100 nodes the repair times and availability improvements make
a strong case for 16 tokens even though it means you'll need
CMS has a higher risk of a long stop-the-world full GC that will cause a
burst of timeouts, but if you're not getting that or don't mind if it
happens now and then CMS is probably the way to go. It's generally
lower-overhead than G1. If you really don't care about latency it might
even be worth
Ansible here as well with a similar setup. A play at the end of the
playbook that waits until all nodes in the cluster are "UN" before moving
on to the next node to change.
On Mon, Oct 18, 2021 at 10:01 AM vytenis silgalis
wrote:
> Yep, also use Ansible with configs living in git here.
>
> On
Won't option 2 in that list potentially cause some pretty severe load
imbalance in most cases? The last node with 256 tokens will end up with
16x as much data on it as the 16 token nodes, right?
You'd have to mitigate it either by adding 16 new nodes for every one you
replace except the last
Depends on your availability requirements, but in general I'd say if you're
going with N replicas, you'd want N failure domains (where one blade
chassis is a failure domain).
On Tue, Aug 10, 2021 at 11:16 PM Erick Ramirez
wrote:
> That's 430TB of eggs in the one 4U basket so consider that
Shouldn't cause GCs.
You can usually think of heap memory separately from the rest. It's
already allocated as far as the OS is concerned, and it doesn't know
anything about GC going on inside of that allocation. You can set
"-XX:+AlwaysPreTouch" to make sure it's physically allocated on
Your partition key determines your partition size. Reducing retention
sounds like it would help some in your case, but really you'd have to split
it up somehow. If it fits your query pattern, you could potentially have a
compound key of userid+datetime, or some other time-based split. You could
As more general advice, I'd strongly encourage you to update to 3.11.x from
2.2.8. My personal experience is that it's significantly faster and more
space-efficient, and the garbage collection behavior under pressure is
drastically better. There's also improved tooling for diagnosing
performance
I'm not sure I'd suggest building a single DIY Backblaze pod. The SATA
port multipliers are a pain both from a supply chain and systems management
perspective. Can be worth it when you're amortizing that across a lot of
servers and can exert some leverage over wholesale suppliers, but less so
I'm a big fan of this one about LWTs:
https://www.youtube.com/watch?v=wcxQM3ZN20c
Not only if you want to understand LWTs, but also to get a better
understanding of the sometimes-unintuitive consistency promises made and
not made for non-LWT queries.
On Tue, Mar 16, 2021 at 11:53 PM wrote:
> I
I'm not too familiar with the details on what's happened more recently, but
I do remember that while Rocksandra was very favorably compared to
Cassandra 2.x, the improvements looked fairly similar in nature and
magnitude to what Cassandra got from the move to the 3.x sstable format and
increased
TO start, I'd try to figure out what your slowdown is. Surely GCP has far,
far more than 17Mbps available.
You don't want to cut it close on this, because for stuff like repairs,
rebuilds, interruptions, etc you'll want to be able to catch up and not
just keep up.
Generally speaking, Cassandra
To start with, maybe update to beta4. There's an absolute massive list of
fixes since alpha4. I don't think the alphas are expected to be in a
usable/low-bug state necessarily, where beta4 is approaching RC status.
On Tue, Jan 26, 2021, 10:44 PM Attila Wind wrote:
> Hey All,
>
> I'm coming
The main downside I see is that you're hitting a less-tested codepath. I
think very few installations have compression disabled today.
On Mon, Jan 25, 2021 at 7:06 AM Lapo Luchini wrote:
> Hi,
> I'm using a fairly standard install of Cassandra 3.11 on FreeBSD
> 12, by default filesystem
1% packet loss can definitely lead to drops. At higher speeds, that's
enough to limit TCP throughput to the point that cross-node communication
can't keep up. TCP_BBR will do better than other strategies at maintaining
high throughput despite single-digit packet loss, but you'll also want to
At least by default, Cassandra has pretty short timeouts. I don't know of
a way to kill an in-flight query, but by the time you did it would have
timed out anyways. I don't know of any way to stop it from repeating other
than tracking down the source and stopping it.
On Wed, Jan 6, 2021, 5:41
Are you running with RF=3 and QUORUM on both read and write?
If so, I think as long as your fill job reports errors and retries you can
probably get away without repairing.
You can also hedge your bets by doing the data load with ALL, though of
course that has an availability tradeoff.
Is the heap larger on the M5.4x instance?
Are you sure it's Cassandra generating the read traffic vs just evicting
files read by other systems?
In general, I'd call "more RAM means fewer drive reads" a very expected
result regardless of the details, especially when it's the difference
between
Tracing fully on rather than sampling will definitely add substantial load,
even with shorter TTLs. That's a lot of extra writes.
If it's just on for specific sessions, or is enabled but with low sampling,
that's not bad in terms of load.
On Mon, Nov 16, 2020 at 6:25 AM Shalom Sagges
wrote:
>
You want to look for full or long GCs in the logs, as well as how much
total time it's spending on GCing as a percentage. Probably more the
latter, since you're not seeing long pauses with one core pegged and the
rest idle. G1 handles oversized heaps well, so it's worth bumping to
20-27GB just
I've found there to be some behavior differences in practice as well going
from 2.2 to 3.11 with a high token count, but all differences for the
better. 3.x seems noticeably less likely to crater or GC-thrash during
repairs compared to 2.x, probably due to the sum of small changes rather
than any
There's also a slightly older mailing list discussion on this subject that
goes into detail on this sort of strategy:
https://www.mail-archive.com/user@cassandra.apache.org/msg60006.html
I've been approximately following it, repeating steps 3-6 for the first
host in each "rack(replica, since I
The short answer is that CQL isn't SQL. It looks a bit like it, but the
structure of the data is totally different. Essentially (ignoring
secondary indexes, which have some issues in practice and I think are
generally not recommended) the only way to look the data up is by the
partition key.
If you're upgrading the whole cluster, I'd recommend going ahead and
upgrading all the way to 3.11.6 if possible. In my experience it's been
noticeably faster, more reliable, and easier to manage compared to 3.0.x.
On Thu, Apr 16, 2020 at 6:37 PM Ashika Umagiliya
wrote:
> Thank you for the
I definitely saw a noticeable decrease in GC activity somewhere between
3.11.0 and 3.11.4. I'm not sure which change did it, but I can't think of
any good reason to use 3.11.0 vs 3.11.6.
I would enable and look through GC logs (or just the slow-GC entries in the
default log) to see if the
Async-profiler (https://github.com/jvm-profiling-tools/async-profiler )
flamegraphs can also be a really good tool to figure out the exact
callgraph that's leading to the futex_wait, both in and out of the JVM.
In addition to extra space, queries can potentially be more expensive
because more dead rows and tombstones will need to be scanned. How much of
a difference this makes will depend drastically on the schema and access
pattern, but I wouldn't expect going from 5 days to 8 to be very noticeable.
On the systems side of things, I've found that using the new BBR TCP
congestion algorithm results in far better behavior in cases of low to
moderate packet loss compared to any of the older strategies. It can't fix
totally broken, but it takes good advantage of "usable but lossy". 0.5-2%
loss
On Mon, Oct 21, 2019 at 1:53 PM Sergio wrote:
> Thanks Elliott!
>
> How do you know if there is too much RAM used for those settings?
>
> Which metrics do you keep track of?
>
> What would you recommend instead?
>
> Best,
>
> Sergio
>
> On Mon, Oct 21, 2019
Based on my experiences, if you have a new enough kernel I'd strongly
suggest switching the TCP scheduler algorithm to BBR. I've found the rest
tend to be extremely sensitive to even small amounts of packet loss among
cluster members where BBR holds up well.
High ulimits for basically everything
The tar error is because tar also looks for metadata changes. In this
case, it's the refcount that's changing and causing the error. I just
switched to using bsdtar instead as a workaround.
On Tue, Oct 1, 2019, 5:37 PM James A. Robinson
wrote:
> Hi folks,
>
>
> I took a nodetool snapshot of a
There's a concurrent_compactors parameter in cassandra.yml that does
exactly what the name says. You may also find
compaction_throughput_mb_per_sec useful.
On Tue, Oct 1, 2019 at 8:16 AM Matthias Pfau
wrote:
> Hi there,
> we recently upgraded from 2.2 to 3.11.4.
>
> Unfortunately, we are
Datastax might be a better resource for this. This mailing list is pretty
good about questions that apply to DSE and Apache Cassandra, but the SOLR
integration is pretty specific to DSE.
On Wed, Sep 25, 2019 at 7:15 PM kumar bharath
wrote:
> Hi All,
>
> We are having a 6 node cluster with two
A container of some sort gives you better isolation and less risk of a
mistake that could cause the instances to conflict in some way. Might be
better for balancing resources between them as well, though using cgroups
directly can also accomplish that.
On Fri, Sep 20, 2019, 8:27 AM Nitan Kainth
It may also be worth upgrading to Cassandra 3.11.4. There's some changes
in 3.6+ that significantly reduce heap pressure from very large partitions.
On Mon, Aug 12, 2019 at 9:13 AM Gabriel Giussi
wrote:
> I've found a huge partion (~9GB) in my cassandra cluster because I'm
> loosing 3 nodes
It's not really something that can be easily calculated based on write
rate, but more something you have to find empirically and adjust
periodically.
Generally speaking, I'd start by running "nodetool gcstats" or similar and
just see what the GC pause stats look like. If it's not pausing much or
1. Do the same people where you work operate the cluster and write
the code to develop the application?
Mostly. Ops vs dev, although there's some overlap
2. Do you have a metrics stack that allows you to see graphs of
various metrics with all the nodes displayed together?
Yes,
I use G1, and I think it's actually the default now for newer Cassandra
versions. For G1, I've done very little custom config/tuning. I increased
heap to 16GB (out of 64GB physical), but most of the rest is at or near
default. For the most part, it's been "feed it more RAM, and it works"
When a snapshot is taken, it includes a "schema.cql" file. That should be
sufficient to restore whatever you need to restore. I'd argue that neither
automatically resurrecting a dropped table nor silently failing to restore
it is a good behavior, so it's not unreasonable to have the user
I would strongly suggest you consider an upgrade to 3.11.x. I found it
decreased space needed by about 30% in addition to significantly lowering
GC.
As a first step, though, why not just revert to CMS for now if that was
working ok for you? Then you can convert one host for diagnosis/tuning so
:05 PM Eunsu Kim Thank you for your response.
>
> I will run repair from datacenter2 with your advice. Do I have to run
> repair on every node in datacenter2?
>
> There is no snapshot when checked with nodetool listsnaphosts.
>
> Thank you.
>
> On 29 Nov 2018, at 4:31 AM,
I think you answered your own question, sort of.
When you expand a cluster, it copies the appropriate rows to the new
node(s) but doesn't automatically remove them from the old nodes. When you
ran cleanup on datacenter1, it cleared out those old extra copies. I would
suggest running a repair
As far as I know, it's not possible to change it live. You have to create
a new "datacenter" with new hosts using the new num_tokens value, then
switch everything to use the new DC and tear down the old.
On Thu, Nov 1, 2018 at 6:16 PM Goutham reddy
wrote:
> Hi team,
> Can someone help me out I
I'll second that - we had some weird inconsistent reads for a long time
that we finally tracked to a small number of clients with significant clock
skew. Make very sure all your client (not just C*) machines have
tightly-synced clocks.
On Fri, Oct 12, 2018 at 7:40 PM maitrayee shah
wrote:
> We
A few reasons I can think of offhand why your test setup might not see
problems from large readahead:
Your sstables are <4MB or your reads are typically <4MB from the end of the
file
Your queries tend to use the 4MB of data anyways
Your dataset is small enough that most of it fits in the VM cache,
It's interesting and a bit surprising that 256 write threads isn't enough.
Even with a lot of cores, I'd expect you to be able to saturate CPU with
that many threads. I'd make sure you don't have other bottlenecks, like
GC, IOPs, network, or "microbursts" where your load is actually fluctuating
At the time that Facebook chose HBase, Cassandra was drastically less
mature than it is now and I think the original creators had already left.
There were already various Hadoop variants running for data analytics etc,
so lots of operational and engineering experience around it available. So,
t;
>> On Thursday, August 16, 2018, 12:02:55 AM PDT, Behnam B.Marandi <
>> behnam.b.mara...@gmail.com> wrote:
>>
>>
>> Actually I did. It seems this is a cross node traffic from one node to
>> port 7000 (storage_port) of the other node.
>>
>> On Sun, Aug 12,
gt;> Is this a one-time or occasional load or more frequently?
>>
>> Is the data located in the same physical data center as the cluster? (any
>> network latency?)
>>
>>
>>
>> On the client side, prepared statements and ExecuteAsync can really speed
>
Assuming this isn't an existing cluster, the easiest method is probably to
use logical "racks" to explicitly control which hosts have a full replica
of the data. with RF3 and 3 "racks", each "rack" has one complete replica.
If you're not using the logical racks, I think the replicas are spread
Step one is always to measure your bottlenecks. Are you spending a lot of
time compacting? Garbage collecting? Are you saturating CPU? Or just a
few cores? Or I/O? Are repairs using all your I/O? Are you just running
out of write threads?
On Wed, Aug 15, 2018 at 5:48 AM, Abdul Patel
Might be a silly question, but did you run "nodetool upgradesstables" and
convert to the 3.0 format? Also, which 3.0? Newest, or an earlier 3.0.x?
On Fri, Aug 10, 2018 at 3:05 PM, kooljava2
wrote:
> Hello,
>
> We recently upgrade C* from 2.1 to 3.0. After the upgrade we are seeing
> increase
Since it's at a consistent time, maybe just look at it with iftop to see
where the traffic's going and what port it's coming from?
On Fri, Aug 10, 2018 at 1:48 AM, Behnam B.Marandi <
behnam.b.mara...@gmail.com> wrote:
> I don't have any external process or planed repair in that time period.
> In
Deflate instead of LZ4 will probably give you somewhat better compression
at the cost of a lot of CPU. Larger chunk length might also help, but in
most cases you probably won't see much benefit above 64K (and it will
increase I/O load).
On Wed, Aug 8, 2018 at 11:18 PM, Eunsu Kim wrote:
> Hi
You might have more luck trying to analyze at the Java level, either via a
(Java) stack dump and the "ttop" tool from Swiss Java Knife, or Cassandra
tools like "nodetool tpstats"
On Wed, Aug 1, 2018 at 2:08 AM, nokia ceph wrote:
> Hi,
>
> i'm having a 5 node cluster with cassandra 3.0.13.
>
> i
Among the hosts in a cluster? It depends on how much data you're trying to
read and write. In general, you're going to want a lot more bandwidth
among hosts in the cluster than you have external-facing. Otherwise things
like repairs and bootstrapping new nodes can get slow/difficult. To put it
; All queries use cluster key, so I'm not accidentally reading a whole
> partition.
> The last place I'm looking - which maybe should be the first - is
> tombstones.
>
> sorry for the afternoon rant! thanks for your eyes!
>
> On Thu, Jun 28, 2018 at 5:54 PM, Elliott Sims
>
It depends a bit on which collector you're using, but fairly normal. Heap
grows for a while, then the JVM decides via a variety of metrics that it's
time to run a collection. G1GC is usually a bit steadier and less sawtooth
than the Parallel Mark Sweep , but if your heap's a lot bigger than
Do you have an actual performance issue anywhere at the application level?
If not, I wouldn't spend too much time on it - load avg is a sort of odd
indirect metric that may or may not mean anything depending on the
situation.
On Fri, Jun 15, 2018 at 6:49 AM, Igor Leão wrote:
> Hi there,
>
> I
If this is data that expires after a certain amount of time, you probably
want to look into using TWCS and TTLs to minimize the number of tombstones.
Decreasing gc_grace_seconds then compacting will reduce the number of
tombstones, but at the cost of potentially resurrecting deleted data if the
It's possible that it's something more subtle, but keep in mind that
sstables don't include the schema. If you've made schema changes, you need
to apply/revert those first or C* probably doesn't know what to do with
those columns in the sstable.
On Sun, Jun 10, 2018 at 11:38 PM, wrote:
> Dear
Are you seeing significant issues in terms of performance? Increased
garbage collection, long pauses, or even OutOfMemory? Which garbage
collector are you using and with what settings/thresholds? Since the JVM's
garbage-collected, a bigger heap can mean a problem or it can just mean
"hasn't
I'd say for a large write-heavy workload like, Cassandra is a pretty clear
winner over MongoDB. I agree with the commenters about understanding your
query patterns a bit better before choosing, though. Cassandra's queries
are a bit limited, and if you're loading all new data every day and
:-/
>
> Thanks Jeff & others for your responses.
>
> - Max
>
> On May 25, 2018, at 5:05pm, Elliott Sims wrote:
>
> I've run across this problem before - it seems like GNU tar interprets
> changes in the link count as changes to the file, so if the file gets
> compacted mid
I've run across this problem before - it seems like GNU tar interprets
changes in the link count as changes to the file, so if the file gets
compacted mid-backup it freaks out even if the file contents are
unchanged. I worked around it by just using bsdtar instead.
On Thu, May 24, 2018 at 6:08
JVM GC tuning can be pretty complex, but the simplest solution to OOM is
probably switching to G1GC and feeding it a rather large heap.
Theoretically a smaller heap and carefully-tuned CMS collector is more
efficient, but CMS is kind of fragile and tuning it is more of a black art,
where you can
Looks like no major table version changes since 3.0, and a couple of minor
changes in 3.0.7/3.7 and 3.0.8/3.8:
https://github.com/apache/cassandra/blob/48a539142e9e318f9177ad8cec4781
9d1adc3df9/doc/source/architecture/storage_engine.rst
So, I suppose whether a revert is safe or not depends on
79 matches
Mail list logo