A few thoughts on this:– 80TB per machine is pretty dense. Consider the amount of data
you'd need to re-replicate in the event of a hardware failure that takes down all 80TB
(DIMM failure requiring replacement, non-reduntant PSU failure, NIC, etc).– 24GB of heap
is also pretty generous. Depending on how you're using Cassandra, you may be able to get
by with ~half of this (though keep in mind the additional direct memory/offheap space
required if you're using off-heap merkle trees).– 40 instances per machine can be a lot
to manage. You can reduce this and address multiple physical drives per instance by
either RAID-0'ing them together; or by using Cassandra in a JBOD configuration (multiple
data dirs per instance).– Remember to consider the ratio of available CPU vs. amount of
storage you're addressing per machine in your configuration. It's easy to spec a box
that maxes out on disk without enough oomph to serve user queries and compaction over
the amount of storage.– You'll want to run some smaller-scale perf testing to determine
this ratio. The good news is that you mostly need to stress is the throughput of a
replica set rather than an entire cluster. Small-scale POCs will generally map well to
larger clusters, so long as the total count of Cassandra processes isn't more than a
couple thousand.– At this scale, small improvements can go a very long way. If your data
is compressible (i.e., not pre-compressed/encrypted prior to being stored in Cassandra),
you'll likely want to use ZStandard rather than LZ4 - and possibly at a higher-ratio
than the default. Test a set of input data with different ZStandard compression levels.
You may save > 10% of storage relative to LZ4 by doing so without sacrificing much in
terms of CPU.On Aug 17, 2023, at 7:46 AM, Joe Obernberger
<joseph.obernber...@gmail.com> wrote:Thanks for this - yeah - duh - forgot about
replication in my example!So - is 2TBytes per Cassandra instance advisable? Better to
use more/less? Modern 2u servers can be had with 24 3.8TBtyte SSDs; so assume 80Tbytes
per server, you could do:(1024*3)/80 = 39 servers, but you'd have to run 40 instances of
Cassandra on each server; maybe 24G of heap per instance, so a server with 1TByte of RAM
would work.Is this what folks would do?-JoeOn 8/17/2023 9:13 AM, Bowen Song via user
wrote:Just pointing out the obvious, for 1PB of data on nodes with 2TB disk each, you
will need far more than 500 nodes.1, it is unwise to run Cassandra with replication
factor 1. It usually makes sense to use RF=3, so 1PB data will cost 3PB of storage
space, minimal of 1500 such nodes.2, depending on the compaction strategy you use and
the write access pattern, there's a disk space amplification to consider. For example,
with STCS, the disk usage can be many times of the actual live data size.3, you will
need some extra free disk space as temporary space for running compactions.4, the data
is rarely going to be perfectly evenly distributed among all nodes, and you need to take
that into consideration and size the nodes based on the node with the most data.5,
enough of bad news, here's a good one. Compression will save you (a lot) of disk
space!With all the above considered, you probably will end up with a lot more than the
500 nodes you initially thought. Your choice of compaction strategy and compression
ratio can dramatically affect this calculation.On 16/08/2023 16:33, Joe Obernberger
wrote:General question on how to configure Cassandra. Say I have 1PByte of data to
store. The general rule of thumb is that each node (or at least instance of Cassandra)
shouldn't handle more than 2TBytes of disk. That means 500 instances of
Cassandra.Assuming you have very fast persistent storage (such as a NetApp, PorterWorx
etc.), would using Kubernetes or some orchestration layer to handle those nodes be a
viable approach? Perhaps the worker nodes would have enough RAM to run 4 instances
(pods) of Cassandra, you would need 125 servers.Another approach is to build your
servers with 5 (or more) SSD devices - one for OS, four for each instance of Cassandra
running on that server. Then build some scripts/ansible/puppet that would manage
Cassandra start/stops, and other maintenance items.Where I think this runs into problems
is with repairs, or sstablescrubs that can take days to run on a single instance. How is
that handled 'in the real world'? With seed nodes, how many would you have in such a
configuration?Thanks for any thoughts!-Joe-- This email has been checked for viruses by
AVG antivirus software.www.avg.com