There's a lot of questionable advice scattered in this thread. Set aside
most of the guidance like 2TB/node, it's old and super nuanced.

If you're bare metal, do what your organization is good at. If you have
millions of dollars in SAN equipment and you know how SANs work and fail
and get backed up, run on a SAN if your organization knows how to properly
operate a SAN. Just make sure you understand it's a single point of failure.

If you're in the cloud, EBS is basically the same concept. You can lose EBS
in an AZ, just like you can lose SAN in a DC. Persist outside of that. Have
backups. Know how to restore them.

The reason the "2TB/node" limit was a thing was around time to recover from
failure more than anything else. I described this in detail here, in 2015,
before faster-streaming in 4.0 was a thing :
https://stackoverflow.com/questions/31563447/cassandra-cluster-data-density-data-size-per-node-looking-for-feedback-and/31690279
. With faster streaming, IF you use LCS (so faster streaming works), you
can probably go at least 4-5x more dense than before, if you understand how
likely your disks are to fail and you can ensure you dont have correlated
failures when they age out (that means if you're on bare metal, measuring
flash life, and ideally mixing vendors to avoid firmware bugs).

You'll still see risks of huge clusters, largely in gossip and schema
propagation. Upcoming CEPs address those. 4.0 is better there (with schema,
especially) than 3.0 was, but for "max nodes in a cluster", what you're
really comparing is "how many gossip speakers and tokens are in the
cluster" (which means your vnode settings matter, for things like pending
range calculators).

Looking at the roadmap, your real question comes down to :
- If you expect to use the transactional features in Accord/5.0 to transact
across rows/keys, you probably want to keep one cluster
- If you dont ever expect to use multi-key transactions, just de-risk by
sharding your cluster into many smaller clusters now, with consistent
hashing to map keys to clusters, and have 4 clusters of the same smaller
size, with whatever node density you think you can do based on your
compaction strategy and streaming rate (and disk type).

If you have time and budget, create a 3 node cluster with whatever disks
you have, fill them, start working on them - expand to 4, treat one as
failed and replace it - simulate the operations you'll do at that size.
It's expensive to mimic a 500 host cluster, but if you've got budget, try
it in AWS and see what happens when you apply your real schema, and then do
a schema change.





On Mon, Aug 21, 2023 at 7:31 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> For our scenario, the goal is to minimize down-time for a single (at least
> initially) data center system.  Data-loss is basically unacceptable.  I
> wouldn't say we have a "rusty slow data center" - we can certainly use SSDs
> and have servers connected via 10G copper to a fast back-plane.  For our
> specific use case with Cassandra (lots of writes, small number of reads),
> the network load is usually pretty low.  I suspect that would change if we
> used Kubernetes + central persistent storage.
> Good discussion.
>
> -Joe
> On 8/17/2023 7:37 PM, daemeon reiydelle wrote:
>
> I started to respond, then realized I and the other OP posters are not
> thinking the same: What is the business case for availability, data
> los/reload/recoverability? You all argue for higher availability and damn
> the cost. But noone asked "can you lose access, for 20 minutes, to a
> portion of the data, 10 times a year, on a 250 node cluster in AWS, if it
> is not lost"? Can you lose access 1-2 times a year for the cost of a 500
> node cluster holding the same data?
>
> Then we can discuss 32/64g JVM and SSD's.
> *.*
> *Arthur C. Clarke famously said that "technology sufficiently advanced is
> indistinguishable from magic." Magic is coming, and it's coming for all of
> us....*
>
> *Daemeon Reiydelle*
> *email: daeme...@gmail.com <daeme...@gmail.com>*
> *LI: https://www.linkedin.com/in/daemeonreiydelle/
> <https://www.linkedin.com/in/daemeonreiydelle/>*
> *San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*
>
>
> On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger <
> joseph.obernber...@gmail.com> wrote:
>
>> Was assuming reaper did incremental?  That was probably a bad assumption.
>>
>> nodetool repair -pr
>> I know it well now!
>>
>> :)
>>
>> -Joe
>>
>> On 8/17/2023 4:47 PM, Bowen Song via user wrote:
>> > I don't have experience with Cassandra on Kubernetes, so I can't
>> > comment on that.
>> >
>> > For repairs, may I interest you with incremental repairs? It will make
>> > repairs hell of a lot faster. Of course, occasional full repair is
>> > still needed, but that's another story.
>> >
>> >
>> > On 17/08/2023 21:36, Joe Obernberger wrote:
>> >> Thank you.  Enjoying this conversation.
>> >> Agree on blade servers, where each blade has a small number of SSDs.
>> >> Yeh/Nah to a kubernetes approach assuming fast persistent storage?  I
>> >> think that might be easier to manage.
>> >>
>> >> In my current benchmarks, the performance is excellent, but the
>> >> repairs are painful.  I come from the Hadoop world where it was all
>> >> about large servers with lots of disk.
>> >> Relatively small number of tables, but some have a high number of
>> >> rows, 10bil + - we use spark to run across all the data.
>> >>
>> >> -Joe
>> >>
>> >> On 8/17/2023 12:13 PM, Bowen Song via user wrote:
>> >>> The optimal node size largely depends on the table schema and
>> >>> read/write pattern. In some cases 500 GB per node is too large, but
>> >>> in some other cases 10TB per node works totally fine. It's hard to
>> >>> estimate that without benchmarking.
>> >>>
>> >>> Again, just pointing out the obvious, you did not count the off-heap
>> >>> memory and page cache. 1TB of RAM for 24GB heap * 40 instances is
>> >>> definitely not enough. You'll most likely need between 1.5 and 2 TB
>> >>> memory for 40x 24GB heap nodes. You may be better off with blade
>> >>> servers than single server with gigantic memory and disk sizes.
>> >>>
>> >>>
>> >>> On 17/08/2023 15:46, Joe Obernberger wrote:
>> >>>> Thanks for this - yeah - duh - forgot about replication in my
>> example!
>> >>>> So - is 2TBytes per Cassandra instance advisable?  Better to use
>> >>>> more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so
>> >>>> assume 80Tbytes per server, you could do:
>> >>>> (1024*3)/80 = 39 servers, but you'd have to run 40 instances of
>> >>>> Cassandra on each server; maybe 24G of heap per instance, so a
>> >>>> server with 1TByte of RAM would work.
>> >>>> Is this what folks would do?
>> >>>>
>> >>>> -Joe
>> >>>>
>> >>>> On 8/17/2023 9:13 AM, Bowen Song via user wrote:
>> >>>>> Just pointing out the obvious, for 1PB of data on nodes with 2TB
>> >>>>> disk each, you will need far more than 500 nodes.
>> >>>>>
>> >>>>> 1, it is unwise to run Cassandra with replication factor 1. It
>> >>>>> usually makes sense to use RF=3, so 1PB data will cost 3PB of
>> >>>>> storage space, minimal of 1500 such nodes.
>> >>>>>
>> >>>>> 2, depending on the compaction strategy you use and the write
>> >>>>> access pattern, there's a disk space amplification to consider.
>> >>>>> For example, with STCS, the disk usage can be many times of the
>> >>>>> actual live data size.
>> >>>>>
>> >>>>> 3, you will need some extra free disk space as temporary space for
>> >>>>> running compactions.
>> >>>>>
>> >>>>> 4, the data is rarely going to be perfectly evenly distributed
>> >>>>> among all nodes, and you need to take that into consideration and
>> >>>>> size the nodes based on the node with the most data.
>> >>>>>
>> >>>>> 5, enough of bad news, here's a good one. Compression will save
>> >>>>> you (a lot) of disk space!
>> >>>>>
>> >>>>> With all the above considered, you probably will end up with a lot
>> >>>>> more than the 500 nodes you initially thought. Your choice of
>> >>>>> compaction strategy and compression ratio can dramatically affect
>> >>>>> this calculation.
>> >>>>>
>> >>>>>
>> >>>>> On 16/08/2023 16:33, Joe Obernberger wrote:
>> >>>>>> General question on how to configure Cassandra.  Say I have
>> >>>>>> 1PByte of data to store.  The general rule of thumb is that each
>> >>>>>> node (or at least instance of Cassandra) shouldn't handle more
>> >>>>>> than 2TBytes of disk.  That means 500 instances of Cassandra.
>> >>>>>>
>> >>>>>> Assuming you have very fast persistent storage (such as a NetApp,
>> >>>>>> PorterWorx etc.), would using Kubernetes or some orchestration
>> >>>>>> layer to handle those nodes be a viable approach? Perhaps the
>> >>>>>> worker nodes would have enough RAM to run 4 instances (pods) of
>> >>>>>> Cassandra, you would need 125 servers.
>> >>>>>> Another approach is to build your servers with 5 (or more) SSD
>> >>>>>> devices - one for OS, four for each instance of Cassandra running
>> >>>>>> on that server.  Then build some scripts/ansible/puppet that
>> >>>>>> would manage Cassandra start/stops, and other maintenance items.
>> >>>>>>
>> >>>>>> Where I think this runs into problems is with repairs, or
>> >>>>>> sstablescrubs that can take days to run on a single instance. How
>> >>>>>> is that handled 'in the real world'?  With seed nodes, how many
>> >>>>>> would you have in such a configuration?
>> >>>>>> Thanks for any thoughts!
>> >>>>>>
>> >>>>>> -Joe
>> >>>>>>
>> >>>>>>
>> >>>>
>> >>
>>
>> --
>> This email has been checked for viruses by AVG antivirus software.
>> www.avg.com
>>
>
>
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
> Virus-free.www.avg.com
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
> <#m_-1698453826366902038_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>

Reply via email to