Upgrading Cassandra 3.11.14 → 4.1

2023-01-16 Thread Lapo Luchini

Hi all,
is upgrading Cassandra 3.11.14 → 4.1 supported, or is it better to 
follow the 3.11.14 → 4.0 → 4.1 path?


(I think it is okay as i found no record of deprecated old SSTable 
formats, but I couldn't manage to find any official documentation 
regarding upgrade paths… forgive me if it is around)


--
Lapo Luchini
l...@lapo.it



Re: Reading data from disk

2023-01-04 Thread Lapo Luchini

Hi,
I'm not part of the team, I reply as a fellow user.

Columns which are part of the PRIMARY KEY are always indexed and used to 
optimize the query, but it also depends in how the partition key is defined.


Details here in the docs:
https://cassandra.apache.org/doc/latest/cassandra/cql/ddl.html#primary-key

In the example given:

CREATE TABLE t (
a int,
b int,
c int,
d int,
PRIMARY KEY ((a, b), c, d)
);

…this means that (a, b) is the partition key (always needed in WHERE, as 
it is used to calculate the hash and thus the node that is owner of the 
specific row) and (c, d) are the clustering key, which are not needed in 
the WHERE clause, but can be used in both WHERE or ORDER BY.


Generally speaking Cassandra forbids "slow queries" (they need an extra 
parameter to be used) so every query you can do is always either fast 
(and using indexes) or forbidden (returning a blocking error) so you 
don't need to fear about slow queries.


On 2023-01-03 13:07, Inquistive allen wrote:

Hello Team,

Here is a simple query. Whenever a select query is being run with 
cluster columns in where clause, does it happen that the entire 
partition is being read from disk to memory and then iterated over to 
fetch the required result set.


Or there are indexes in place which help read only specific data from 
the disk


Thanks
Allen





Re: Best compaction strategy for rarely used data

2022-12-30 Thread Lapo Luchini

On 2022-12-29 21:54, Durity, Sean R via user wrote:
At some point you will end up with large sstables (like 1 TB) that won’t 
compact because there are not 4 similar-sized ones able to be compacted 


Yes, that's exactly what's happening.

I'll see maybe just one more compaction, since the biggest sstable is 
already more than 20% of residual free space.



For me, the backup strategy shouldn’t drive the rest.


Mhh, yes, that makes sense.

And if your data is ever-growing 
and never deleted, you will be adding nodes to handle the extra data as 
time goes by (and running clean-up on the existing nodes).


What will happen when adding new nodes, as you say, though?
If I have a 1GB sstable with 250GB of data that will be no longer useful 
(as a new node will be the new owner) will that sstable be reduced to 
750GB by "cleanup" or will it retain old data?


Thanks,

--
Lapo Luchini
l...@lapo.it



Best compaction strategy for rarely used data

2022-12-29 Thread Lapo Luchini
Hi, I have a table which gets (a lot of) data that is written once and 
very rarely read (it is used for data that is mandatory for regulatory 
reasons), and almost never deleted.


I'm using the default SCTS as at the time I didn't know any better, but 
SSTables size are getting huge, which is a problem because they both are 
getting to the size of the available disk and both because I'm using a 
snapshot-based system to backup the node (and thus compacting a huge 
SSTable into an even bigger one generates a lot of traffic for 
mostly-old data).


I'm thinking about switching to LCS (mainly to solve the size issue), 
but I read that it is "optimized for read heavy workloads […] not a good 
choice for immutable time series data". Given that I don't really care 
about write nor read speed, but would like SSTables size to have a upper 
limit, would this strategy still be the best?


PS: Googling around a strategy called "incremental compaction" (ICS) 
keeps getting in results, but that's only available in ScyllaDB, right?


--
Lapo Luchini
l...@lapo.it



Re: Change IP address (on 3.11.14)

2022-12-06 Thread Lapo Luchini

On 2022-12-06 14:21, Gábor Auth wrote:
No! Just start it and the other nodes in the cluster will acknowledge 
the new IP, they recognize the node by id, stored in the data folder of 
the node.


Thanks Gábor and Erick!

It worked flawlessly.

--
Lapo Luchini
l...@lapo.it



Change IP address (on 3.11.14)

2022-12-06 Thread Lapo Luchini

Hi all,
I'm trying to change IP address of an existing live node (possibly 
without deleting data and streaming terabytes all over again) following 
these steps:


https://stackoverflow.com/a/57455035/166524
1. echo 'auto_bootstrap: false' >> cassandra.yaml
2. add "-Dcassandra.replace_address=oldAddress" in cassandra-env.sh
3. restart node

But I get this error:

  Cannot replace address with a node that is already bootstrapped

So I guess that answer is outdated for 3.11.
(or was always wrong, given it is from 2019?)

Is there a way to do it?
Or should I delete all the DB on disk and bootstrap from scratch?

thanks,

--
Lapo Luchini
l...@lapo.it



Re: Adding an IPv6-only server to a dual-stack cluster

2022-11-18 Thread Lapo Luchini
So basically listen_address=:: (which should accept both IPv4 and IPv6) 
is fine, as long as broadcast_address reports the same single IPv4 
address that the node always reported previously?


The presence of broadcast_address removes the "different nodes in the 
cluster pick different addresses for you" case?


On 2022-11-16 14:03, Bowen Song via user wrote:
I would expect that you'll need NAT64 in order to have a cluster with 
mixed nodes between IPv6-only servers and dual-stack servers that's 
broadcasting their IPv4 addresses. Once all IPv4-broadcasting dual-stack 
nodes are replaced with nodes either IPv6-only or dual-stack but 
broadcasting IPv6 instead, the NAT64 can be removed.



On 09/11/2022 17:27, Lapo Luchini wrote:
I have a (3.11) cluster running on IPv4 addresses on a set of 
dual-stack servers; I'd like to add a new IPv6-only server to the 
cluster… is it possible to have the dual-stack ones answer on IPv6 
addresses as well (while keeping the single IPv4 address as 
broadcast_address, I guess)?


This sentence in cassandra.yaml suggests it's impossible:

    Setting listen_address to 0.0.0.0 is always wrong.

FAQ #1 also confirms that (is this true also with broadcast_address?):

    if different nodes in the cluster pick different addresses for you,
    Bad Things happen.

Is it possible to do this, or is my only chance to shutdown the entire 
cluster and launch it again as IPv6-only?

(IPv6 is available on each and every host)

And even in that case, is it possible for a cluster to go down from a 
set of IPv4 address and be recovered on a parallel set of IPv6 
addresses? (I guess gossip does not expect that)


thanks in advance for any suggestion,








Adding an IPv6-only server to a dual-stack cluster

2022-11-09 Thread Lapo Luchini
I have a (3.11) cluster running on IPv4 addresses on a set of dual-stack 
servers; I'd like to add a new IPv6-only server to the cluster… is it 
possible to have the dual-stack ones answer on IPv6 addresses as well 
(while keeping the single IPv4 address as broadcast_address, I guess)?


This sentence in cassandra.yaml suggests it's impossible:

Setting listen_address to 0.0.0.0 is always wrong.

FAQ #1 also confirms that (is this true also with broadcast_address?):

if different nodes in the cluster pick different addresses for you,
Bad Things happen.

Is it possible to do this, or is my only chance to shutdown the entire 
cluster and launch it again as IPv6-only?

(IPv6 is available on each and every host)

And even in that case, is it possible for a cluster to go down from a 
set of IPv4 address and be recovered on a parallel set of IPv6 
addresses? (I guess gossip does not expect that)


thanks in advance for any suggestion,

--
Lapo Luchini
l...@lapo.it



Huge single-node DCs (?)

2021-04-08 Thread Lapo Luchini
Hi, one project I wrote is using Cassandra to back the huge amount of 
data it needs (data is written only once and read very rarely, but needs 
to be accessible for years, so the storage needs become huge in time and 
I chose Cassandra mainly for its horizontal scalability regarding disk 
size) and a client of mine needs to install that on his hosts.


Problem is, while I usually use a cluster of 6 "smallish" nodes (which 
can grow in time), he only has big ESX servers with huge disk space 
(which is already RAID-6 redundant) but wouldn't have the possibility to 
have 3+ nodes per DC.


This is out of my usual experience with Cassandra and, as far as I read 
around, out of most use-cases found on the website or this mailing list, 
so the question is:
does it make sense to use Cassandra with a big (let's talk 6TB today, up 
to 20TB in a few years) single-node DataCenter, and another single-node 
DataCenter (to act as disaster recovery)?


Thanks in advance for any suggestion or comment!

--
Lapo Luchini
l...@lapo.it



Re: Repair on a slow node (or is it?)

2021-03-31 Thread Lapo Luchini

Thanks for all your suggestions!

I'm looking into it and so far it seems to be mainly a problem of disk 
I/O, as the host is running on spindle disks and being a DR of an entire 
cluster gives it many changes to follow.


First (easy) try will be to add an SSD as ZFS cache (ZIL + L2ARC).
Should make a huge difference alrady.

I will then later on study Medusa/tablesnap too, thanks.

cheers,
Lapo

On 2021-03-29 12:32, Kane Wilson wrote:
Check what your compactionthroughput is set to, as it will impact the 
validation compactions. also what kind of disks does the DR node have? 
The validation compaction sizes are likely fine, I'm not sure of the 
exact details but it's normal to expect very large validations.


Rebuilding would not be an ideal mechanism for repairing, and would 
likely be slower and chew up a lot of disk space. It's also not 
guaranteed to give you data that will be consistent with the other DC, 
as replicas will only be streamed from one node.


  I think you're better off looking at setting up regular backups and if 
you really need it commitlog backups. The storage would be cheaper and 
more reliable, plus less impactful on your production DC. Restoring will 
also be a lot easier and faster as well, as restoring from a single node 
DC will be network bottlenecked. There are various tools around that do 
this for you such as medusa or tablesnap.




Repair on a slow node (or is it?)

2021-03-29 Thread Lapo Luchini

Hi all,
I have a 6 nodes production cluster with 1.5 TiB load (RF=3) and a 
single-node DC dedicated as a "remote disaster recovery copy" 2.7 TiB.


Doing repairs only on the production cluster takes a semi-decent time 
(24h for the biggest keyspace, which takes 90% of the space), but by 
doing repair across the two DCs takes forever, and segments often fail 
even if I increased Reaper segment time limit to 2h.


In trying to debug the issue, I noticed that "compactionstats -H" on the 
DR node shows huge (and very very slow) validations:


compaction completed  total  unit  progress
Validation 2.78 GiB   8.11 GiB   bytes 34.33%
Validation 0 bytes2.67 TiB   bytes 0.00%
Validation 1.7 TiB2.43 TiB   bytes 69.75%
Validation 124.26 GiB 2.67 TiB   bytes 4.55%
Validation 536.67 GiB 2.67 TiB   bytes 19.63%

Such validations take a few hours to complete, and as far as I 
understood segment repair always fails on the first try do to those, and 
only has success after a few tries when the original validation executed 
in the first try has ended.


My question is this: is it normal to have to validate all of the 
keyspace content on each segment's validation?

Is the DB in a "strange" state?
Would it be useful to issue a "rebuild" on that node, in order to send 
all missing data anyways, and this skipping the lenghty validations?


thanks!

--
Lapo Luchini
l...@lapo.it



Re: Changing num_tokens and migrating to 4.0

2021-03-22 Thread Lapo Luchini

On 2021-03-22 01:27, Kane Wilson wrote:
You should be able to get repairs working fine if you use a tool such as 
cassandra-reaper to manage it for you for such a small cluster. I would 
look into that before doing major cluster topology changes, as these can 
be complex and risky.


I was looking into a migration as I'm already using Cassanrea Reaper and 
on the biggest keyspace is often taking more than 7 days to complete (I 
set the segment timeout up to 2h, and most of 138 segments take more 
than 1h, sometimes even failing the 2h limits due to, AFAICT, lengthy 
compactions).


--
Lapo Luchini
l...@lapo.it


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Changing num_tokens and migrating to 4.0

2021-03-20 Thread Lapo Luchini

Hi, thanks for suggestions!
I'll definitely migrate to 4.0 after all this is done, then.

Old prod DC I fear can't suffer losing a node right now (a few nodes 
have the disk 70% full), but I can maybe find a third node for the new 
DC right away.


BTW the new nodes have got 3× the disk space, but are not so much 
different regarding CPU and RAM: does it make any sense giving them a 
bit more num_tokens (maybe 20-30 instead of 16) than the rest of the old 
DC hosts or "asymmetrical" clusters lead to problems?


No real need to do that anyways, moving from 6 nodes to (eventually) 8 
should be enough lessen the load on the disks, and before more space is 
needed I will probably have more nodes.


Lapo

On 2021-03-20 16:23, Alex Ott wrote:
I personally maybe would go following way (need to calculate how many 
joins/decommissions will be at the end):


  * Decommission one node from prod DC
  * Form new DC from two new machines and decommissioned one.
  * Rebuild DC from existing one, make sure that repair finished, etc.
  * Switch traffic
  * Remove old DC
  * Add nodes from old DC one by one into new DC





-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Changing num_tokens and migrating to 4.0

2021-03-20 Thread Lapo Luchini
I have a 6 nodes production cluster running 3.11.9 with the default 
num_tokens=256… which is fine but I later discovered is a bit of a 
hassle to do repairs and is probably better to lower that to 16.


I'm adding two new nodes with much higher space storage and I was 
wondering which migration strategy is better.


If I got it correct I was thinking about this:
1. add the 2 new nodes as a new "temporary DC", with num_token=16 RF=3
2. repair it all, then test it a bit
3. switch production applications to "DC-temp"
4. drop the old 6-node DC
5. re-create it from scratch with num_token=16 RF=3
6. switch production applications to "main DC" again
7. drop "DC-temp", eventually integrate nodes into "main DC"

I'd also like to migrate from 3.11.9 to 4.0-beta2 (I'm running on 
FreeBSD so those are the options), does it make sense to do it during 
the mentioned "num_tokens migration" (at step 1, or 5) or does it make 
more sense to do it at step 8, as a in-place rolling upgrade of each of 
the 6 (or 8) nodes?


Did I get it correctly?
Can it be done "better"?

Thanks in advance for any suggestion or correction!

--
Lapo Luchini
l...@lapo.it


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: How to debug node load unbalance

2021-03-05 Thread Lapo Luchini

Thanks for the explanation, Kane!

In case anyone is curious I decommissioned node7 and things re-balanced 
themselves automatically: https://i.imgur.com/EOxzJu9.png

(node8 received 422 GiB, while the others did receive 82-153 GiB,
as reported by "nodetool netstats -H")

Lapo

On 2021-03-03 23:59, Kane Wilson wrote:
Well, that looks like your problem. They are logical racks and they come 
into play when NetworkTopologyStrategy is deciding which replicas to put 
data on. NTS will ensure a replica goes on the first node in a different 
rack when traversing the ring, with the idea of keeping only one set of 
replicas on a rack (so that a whole rack can go down without you losing 
QUORUM).



-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: How to debug node load unbalance

2021-03-03 Thread Lapo Luchini

Hi! The nodes are all in different racks… except for node7 and node8!
Which is something more that makes them similar (which I didn't notice 
at first), other than timeline of being added to the cluster.


About the token ring calculation… I'll retry that in NodeJS instead of 
awk as a double check… yes, it checks out:


0.14603899130502265 node1 HEL1-DC2
0.14885298279986256 node2 FSN1-DC6
0.13917538352395356 node3 FSN1-DC12
0.13593194981676893 node4 FSN1-DC10
0.14054248949667470 node5 FSN1-DC11
0.14387515909570683 node7 FSN1-DC7
0.14558304396201086 node8 FSN1-DC7

(rack names are actually Hetzner datacenters, which is as accurate info 
as I know… maybe those two FSN1-DC7 nodes are in different racks too)


Yeah, I'm willing to "live with it", my question was more about on the 
line of "is there something I didn't understand well in Cassandra?" than 
being preoccupied about the difference in usage.


All the involved keyspaces have an identical RF of:
{'class': 'NetworkTopologyStrategy', 'Hetzner': 3, 'DR': 1}

Hypothetically speaking, how would I obtain "the tokens that will 
balance the ring"? Are there tools out there?


thanks in advance!
cheers,
Lapo

On 2021-03-03 11:41, Kane Wilson wrote:
The load calculation always has issues so I wouldn't count on it, 
although in this case it does seem to roughly line up. Are you sure your 
ring calculation was accurate? It doesn't really seem to line up with 
the owns % for the 33% node, and it is feasible (although unlikely) that 
you could roll a node with a bunch of useless tokens and end up in this 
scenario.



-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



How to debug node load unbalance

2021-03-02 Thread Lapo Luchini
I had a 5 nodes cluster, then increased to 6, then to 7, then to 8, then 
back to 7. I installed 3.11.6 back when node_tokens defaulted to 256, so 
as far as I understand at the expense of long repairs it should have an 
excellent capacity to scale to new nodes, but I get this status:


Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  AddressLoad Tokens  Owns
UN  node1  1.08 TiB 256 46.4%
UN  node2  1.06 TiB 256 45.8%
UN  node3  1.02 TiB 256 45.1%
UN  node4  1.01 TiB 256 46.6%
UN  node5  994.92 GiB   256 44.0%
UN  node7  1.04 TiB 256 38.1%
UN  node8  882.03 GiB   256 33.9%

(I renamed nodes and sorted them to represent the date they entered the 
cluster; notice node6 was decommissioned and later replaced by node8)


This is a Prometheus+Grafana graph of the process of population of a new 
table (created when the cluster was already stable with node8):


https://i.imgur.com/CLDLENU.png

I don't understand why node7 (in blue) and node8 (in red) are way less 
loaded with data than the others.

(as correctly reported both by "owns" and the graph)
PS: the purple node at the top is the disaster recovery node in a remote 
location, and is alone instead of being a cluster, so it's right that it 
has way more load than the others.


I tried summing all the token ranges from "nodetool ring" and they are 
quite balanced (as expected with 256 virtual tokens, I guess):


% nodetool ring | awk '/^=/ { prev = -1 } /^[0-9]/ { ip = $1; pos = $8; 
if (prev != -1) host[ip] +=  pos - prev; prev= pos; } END { tot = 0; for 
(ip in host) if (ip != "nodeDR") tot += host[ip]; for (ip in host) print 
host[ip] / tot, ip; }'

0.992797 nodeDR
0.146039 node1
0.148853 node2
0.139175 node3
0.135932 node4
0.140542 node5
0.143875 node7
0.145583 node8
(yes I know it has a slight bias because it doesn't manage correctly the 
first line, but that's less than 0.8%)


It's true that node8 being newer probably has less "extra data", but 
after adding it and after waiting for Reaper to repair all tables, I did 
"nodetool cleanup" on all other nodes, so that shouldn't be it.


Oh, the tables that account for 99.9% of the used space (included the 
one in the graph above) have millions of records and have a timeuuid 
inside the partition key, so they should distribute perfectly well among 
all tokens.


Is there any other reason for the load unbalance I didn't think of?
Is there a way to force things back to normal?

--
Lapo Luchini
l...@lapo.it


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Cassandra on ZFS: disable compression?

2021-01-25 Thread Lapo Luchini

Hi,
I'm using a fairly standard install of Cassandra 3.11 on FreeBSD 
12, by default filesystem is compressed using LZ4 and Cassandra tables 
are compressed using LZ4 as well.


I was wondering if anybody had data about this already (or else, I will 
probably do some tests myself, eventually): would it be a nice idea to 
disable Cassandra compression and rely only on ZFS one?


In principle I can see some pros:
1. it's done in kernel, might be slightly faster
2. can (probably) compress more data, as I see a 1.02 compression factor
   on filesystem even if I have compressed data in tables already
3. in upcoming ZFS version I will be able to use Zstd compression
   (probably before Cassandra 4.0 is gold)
4. (can inspect  compression directly at filesystem level)

But on the other hand application-level compression could have its 
advantages.


cheers,

--
Lapo Luchini
l...@lapo.it


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Cassandra 4.0-beta1 available on FreeBSD

2020-10-27 Thread Lapo Luchini

Angelo Polo wrote:

Cassandra 4.0-beta1 is now available on FreeBSD.


By the way I'm runinning a 6 nodes production cluster using 3.11.6 on 
FreeBSD 12.1/amd64 and I'm very happy about it (thanks Angelo for 
maintaining the FreeBSD Port!).


I hope your 4.0-beta2 patch will be accepted soon (and that 4.0 goes out 
of beta soon to) so that I'll be able to upgrade that cluster.


--
Lapo Luchini - http://lapo.it/


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Repair shell script

2020-10-27 Thread Lapo Luchini
I'd like to set up Cassandra Reaper soon enough, but right now I'm 
keeping my cluster repaired with the following self-made script.


I created it by reading official wiki and some of this ML's messaes, but 
can anyone confirm I chose the correct options?

(i.e. running "repair -pr" in turn on each and every host)

#!/bin/sh
USER=admin
PASS=password
for h in host1 host2 host3 hos4 host5; do
echo ""
echo "** Host: $h"
echo ""
nodetool -h "$h" -u "$USER" -pw "$PASS" repair -pr | \
egrep -B1 '(Starting repair|Repair command)'
done

--
Lapo Luchini - http://lapo.it/


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org