...and a shameless plug for the Cassandra Summit in December. We have a
talk from somebody that is doing 70TB per node and will be digging into all
the aspects that make that work for them. I hope everyone in this thread is
at that talk! I can't wait to hear all the questions.

Patrick

On Mon, Aug 21, 2023 at 8:01 AM Jeff Jirsa <jji...@gmail.com> wrote:

> There's a lot of questionable advice scattered in this thread. Set aside
> most of the guidance like 2TB/node, it's old and super nuanced.
>
> If you're bare metal, do what your organization is good at. If you have
> millions of dollars in SAN equipment and you know how SANs work and fail
> and get backed up, run on a SAN if your organization knows how to properly
> operate a SAN. Just make sure you understand it's a single point of failure.
>
> If you're in the cloud, EBS is basically the same concept. You can lose
> EBS in an AZ, just like you can lose SAN in a DC. Persist outside of that.
> Have backups. Know how to restore them.
>
> The reason the "2TB/node" limit was a thing was around time to recover
> from failure more than anything else. I described this in detail here, in
> 2015, before faster-streaming in 4.0 was a thing :
> https://stackoverflow.com/questions/31563447/cassandra-cluster-data-density-data-size-per-node-looking-for-feedback-and/31690279
> . With faster streaming, IF you use LCS (so faster streaming works), you
> can probably go at least 4-5x more dense than before, if you understand how
> likely your disks are to fail and you can ensure you dont have correlated
> failures when they age out (that means if you're on bare metal, measuring
> flash life, and ideally mixing vendors to avoid firmware bugs).
>
> You'll still see risks of huge clusters, largely in gossip and schema
> propagation. Upcoming CEPs address those. 4.0 is better there (with schema,
> especially) than 3.0 was, but for "max nodes in a cluster", what you're
> really comparing is "how many gossip speakers and tokens are in the
> cluster" (which means your vnode settings matter, for things like pending
> range calculators).
>
> Looking at the roadmap, your real question comes down to :
> - If you expect to use the transactional features in Accord/5.0 to
> transact across rows/keys, you probably want to keep one cluster
> - If you dont ever expect to use multi-key transactions, just de-risk by
> sharding your cluster into many smaller clusters now, with consistent
> hashing to map keys to clusters, and have 4 clusters of the same smaller
> size, with whatever node density you think you can do based on your
> compaction strategy and streaming rate (and disk type).
>
> If you have time and budget, create a 3 node cluster with whatever disks
> you have, fill them, start working on them - expand to 4, treat one as
> failed and replace it - simulate the operations you'll do at that size.
> It's expensive to mimic a 500 host cluster, but if you've got budget, try
> it in AWS and see what happens when you apply your real schema, and then do
> a schema change.
>
>
>
>
>
> On Mon, Aug 21, 2023 at 7:31 AM Joe Obernberger <
> joseph.obernber...@gmail.com> wrote:
>
>> For our scenario, the goal is to minimize down-time for a single (at
>> least initially) data center system.  Data-loss is basically unacceptable.
>> I wouldn't say we have a "rusty slow data center" - we can certainly use
>> SSDs and have servers connected via 10G copper to a fast back-plane.  For
>> our specific use case with Cassandra (lots of writes, small number of
>> reads), the network load is usually pretty low.  I suspect that would
>> change if we used Kubernetes + central persistent storage.
>> Good discussion.
>>
>> -Joe
>> On 8/17/2023 7:37 PM, daemeon reiydelle wrote:
>>
>> I started to respond, then realized I and the other OP posters are not
>> thinking the same: What is the business case for availability, data
>> los/reload/recoverability? You all argue for higher availability and damn
>> the cost. But noone asked "can you lose access, for 20 minutes, to a
>> portion of the data, 10 times a year, on a 250 node cluster in AWS, if it
>> is not lost"? Can you lose access 1-2 times a year for the cost of a 500
>> node cluster holding the same data?
>>
>> Then we can discuss 32/64g JVM and SSD's.
>> *.*
>> *Arthur C. Clarke famously said that "technology sufficiently advanced is
>> indistinguishable from magic." Magic is coming, and it's coming for all of
>> us....*
>>
>> *Daemeon Reiydelle*
>> *email: daeme...@gmail.com <daeme...@gmail.com>*
>> *LI: https://www.linkedin.com/in/daemeonreiydelle/
>> <https://www.linkedin.com/in/daemeonreiydelle/>*
>> *San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*
>>
>>
>> On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger <
>> joseph.obernber...@gmail.com> wrote:
>>
>>> Was assuming reaper did incremental?  That was probably a bad assumption.
>>>
>>> nodetool repair -pr
>>> I know it well now!
>>>
>>> :)
>>>
>>> -Joe
>>>
>>> On 8/17/2023 4:47 PM, Bowen Song via user wrote:
>>> > I don't have experience with Cassandra on Kubernetes, so I can't
>>> > comment on that.
>>> >
>>> > For repairs, may I interest you with incremental repairs? It will make
>>> > repairs hell of a lot faster. Of course, occasional full repair is
>>> > still needed, but that's another story.
>>> >
>>> >
>>> > On 17/08/2023 21:36, Joe Obernberger wrote:
>>> >> Thank you.  Enjoying this conversation.
>>> >> Agree on blade servers, where each blade has a small number of SSDs.
>>> >> Yeh/Nah to a kubernetes approach assuming fast persistent storage?  I
>>> >> think that might be easier to manage.
>>> >>
>>> >> In my current benchmarks, the performance is excellent, but the
>>> >> repairs are painful.  I come from the Hadoop world where it was all
>>> >> about large servers with lots of disk.
>>> >> Relatively small number of tables, but some have a high number of
>>> >> rows, 10bil + - we use spark to run across all the data.
>>> >>
>>> >> -Joe
>>> >>
>>> >> On 8/17/2023 12:13 PM, Bowen Song via user wrote:
>>> >>> The optimal node size largely depends on the table schema and
>>> >>> read/write pattern. In some cases 500 GB per node is too large, but
>>> >>> in some other cases 10TB per node works totally fine. It's hard to
>>> >>> estimate that without benchmarking.
>>> >>>
>>> >>> Again, just pointing out the obvious, you did not count the off-heap
>>> >>> memory and page cache. 1TB of RAM for 24GB heap * 40 instances is
>>> >>> definitely not enough. You'll most likely need between 1.5 and 2 TB
>>> >>> memory for 40x 24GB heap nodes. You may be better off with blade
>>> >>> servers than single server with gigantic memory and disk sizes.
>>> >>>
>>> >>>
>>> >>> On 17/08/2023 15:46, Joe Obernberger wrote:
>>> >>>> Thanks for this - yeah - duh - forgot about replication in my
>>> example!
>>> >>>> So - is 2TBytes per Cassandra instance advisable?  Better to use
>>> >>>> more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so
>>> >>>> assume 80Tbytes per server, you could do:
>>> >>>> (1024*3)/80 = 39 servers, but you'd have to run 40 instances of
>>> >>>> Cassandra on each server; maybe 24G of heap per instance, so a
>>> >>>> server with 1TByte of RAM would work.
>>> >>>> Is this what folks would do?
>>> >>>>
>>> >>>> -Joe
>>> >>>>
>>> >>>> On 8/17/2023 9:13 AM, Bowen Song via user wrote:
>>> >>>>> Just pointing out the obvious, for 1PB of data on nodes with 2TB
>>> >>>>> disk each, you will need far more than 500 nodes.
>>> >>>>>
>>> >>>>> 1, it is unwise to run Cassandra with replication factor 1. It
>>> >>>>> usually makes sense to use RF=3, so 1PB data will cost 3PB of
>>> >>>>> storage space, minimal of 1500 such nodes.
>>> >>>>>
>>> >>>>> 2, depending on the compaction strategy you use and the write
>>> >>>>> access pattern, there's a disk space amplification to consider.
>>> >>>>> For example, with STCS, the disk usage can be many times of the
>>> >>>>> actual live data size.
>>> >>>>>
>>> >>>>> 3, you will need some extra free disk space as temporary space for
>>> >>>>> running compactions.
>>> >>>>>
>>> >>>>> 4, the data is rarely going to be perfectly evenly distributed
>>> >>>>> among all nodes, and you need to take that into consideration and
>>> >>>>> size the nodes based on the node with the most data.
>>> >>>>>
>>> >>>>> 5, enough of bad news, here's a good one. Compression will save
>>> >>>>> you (a lot) of disk space!
>>> >>>>>
>>> >>>>> With all the above considered, you probably will end up with a lot
>>> >>>>> more than the 500 nodes you initially thought. Your choice of
>>> >>>>> compaction strategy and compression ratio can dramatically affect
>>> >>>>> this calculation.
>>> >>>>>
>>> >>>>>
>>> >>>>> On 16/08/2023 16:33, Joe Obernberger wrote:
>>> >>>>>> General question on how to configure Cassandra.  Say I have
>>> >>>>>> 1PByte of data to store.  The general rule of thumb is that each
>>> >>>>>> node (or at least instance of Cassandra) shouldn't handle more
>>> >>>>>> than 2TBytes of disk.  That means 500 instances of Cassandra.
>>> >>>>>>
>>> >>>>>> Assuming you have very fast persistent storage (such as a NetApp,
>>> >>>>>> PorterWorx etc.), would using Kubernetes or some orchestration
>>> >>>>>> layer to handle those nodes be a viable approach? Perhaps the
>>> >>>>>> worker nodes would have enough RAM to run 4 instances (pods) of
>>> >>>>>> Cassandra, you would need 125 servers.
>>> >>>>>> Another approach is to build your servers with 5 (or more) SSD
>>> >>>>>> devices - one for OS, four for each instance of Cassandra running
>>> >>>>>> on that server.  Then build some scripts/ansible/puppet that
>>> >>>>>> would manage Cassandra start/stops, and other maintenance items.
>>> >>>>>>
>>> >>>>>> Where I think this runs into problems is with repairs, or
>>> >>>>>> sstablescrubs that can take days to run on a single instance. How
>>> >>>>>> is that handled 'in the real world'?  With seed nodes, how many
>>> >>>>>> would you have in such a configuration?
>>> >>>>>> Thanks for any thoughts!
>>> >>>>>>
>>> >>>>>> -Joe
>>> >>>>>>
>>> >>>>>>
>>> >>>>
>>> >>
>>>
>>> --
>>> This email has been checked for viruses by AVG antivirus software.
>>> www.avg.com
>>>
>>
>>
>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>> Virus-free.www.avg.com
>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>> <#m_3592702684312727802_m_-1698453826366902038_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>
>

Reply via email to