- k8s

   1. depending on the version and networking, number of containers per
   node, nodepooling, etc. you can expect to see 1-2% additional storage IO
   latency (depends on whether all are on the same network vs. separate
   storage IO TCP network)
   2. System overhead may be 3-15% depending on what security mitigations
   are in place (if you own the systems and workload is dedicated, turn them
   off!)
   3. c* pod loss recovery is the big win here. pod failure and recovery
   (e.g. to another node) will bring up the SAME c* node as of the node
   failure (so only a few updates). Perhaps 2x replication, or none if the
   storage itself is replicated.

I wonder if you folks have already set out OLA's for "minimum outage" with
no data loss? Write amplification is mostly only a problem when networks
are heavily used. May not even be an issue in your case.
*.*
*Arthur C. Clarke famously said that "technology sufficiently advanced is
indistinguishable from magic." Magic is coming, and it's coming for all of
us....*

*Daemeon Reiydelle*
*email: daeme...@gmail.com <daeme...@gmail.com>*
*LI: https://www.linkedin.com/in/daemeonreiydelle/
<https://www.linkedin.com/in/daemeonreiydelle/>*
*San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*


On Mon, Aug 21, 2023 at 8:49 AM Patrick McFadin <pmcfa...@gmail.com> wrote:

> ...and a shameless plug for the Cassandra Summit in December. We have a
> talk from somebody that is doing 70TB per node and will be digging into all
> the aspects that make that work for them. I hope everyone in this thread is
> at that talk! I can't wait to hear all the questions.
>
> Patrick
>
> On Mon, Aug 21, 2023 at 8:01 AM Jeff Jirsa <jji...@gmail.com> wrote:
>
>> There's a lot of questionable advice scattered in this thread. Set aside
>> most of the guidance like 2TB/node, it's old and super nuanced.
>>
>> If you're bare metal, do what your organization is good at. If you have
>> millions of dollars in SAN equipment and you know how SANs work and fail
>> and get backed up, run on a SAN if your organization knows how to properly
>> operate a SAN. Just make sure you understand it's a single point of failure.
>>
>> If you're in the cloud, EBS is basically the same concept. You can lose
>> EBS in an AZ, just like you can lose SAN in a DC. Persist outside of that.
>> Have backups. Know how to restore them.
>>
>> The reason the "2TB/node" limit was a thing was around time to recover
>> from failure more than anything else. I described this in detail here, in
>> 2015, before faster-streaming in 4.0 was a thing :
>> https://stackoverflow.com/questions/31563447/cassandra-cluster-data-density-data-size-per-node-looking-for-feedback-and/31690279
>> . With faster streaming, IF you use LCS (so faster streaming works), you
>> can probably go at least 4-5x more dense than before, if you understand how
>> likely your disks are to fail and you can ensure you dont have correlated
>> failures when they age out (that means if you're on bare metal, measuring
>> flash life, and ideally mixing vendors to avoid firmware bugs).
>>
>> You'll still see risks of huge clusters, largely in gossip and schema
>> propagation. Upcoming CEPs address those. 4.0 is better there (with schema,
>> especially) than 3.0 was, but for "max nodes in a cluster", what you're
>> really comparing is "how many gossip speakers and tokens are in the
>> cluster" (which means your vnode settings matter, for things like pending
>> range calculators).
>>
>> Looking at the roadmap, your real question comes down to :
>> - If you expect to use the transactional features in Accord/5.0 to
>> transact across rows/keys, you probably want to keep one cluster
>> - If you dont ever expect to use multi-key transactions, just de-risk by
>> sharding your cluster into many smaller clusters now, with consistent
>> hashing to map keys to clusters, and have 4 clusters of the same smaller
>> size, with whatever node density you think you can do based on your
>> compaction strategy and streaming rate (and disk type).
>>
>> If you have time and budget, create a 3 node cluster with whatever disks
>> you have, fill them, start working on them - expand to 4, treat one as
>> failed and replace it - simulate the operations you'll do at that size.
>> It's expensive to mimic a 500 host cluster, but if you've got budget, try
>> it in AWS and see what happens when you apply your real schema, and then do
>> a schema change.
>>
>>
>>
>>
>>
>> On Mon, Aug 21, 2023 at 7:31 AM Joe Obernberger <
>> joseph.obernber...@gmail.com> wrote:
>>
>>> For our scenario, the goal is to minimize down-time for a single (at
>>> least initially) data center system.  Data-loss is basically unacceptable.
>>> I wouldn't say we have a "rusty slow data center" - we can certainly use
>>> SSDs and have servers connected via 10G copper to a fast back-plane.  For
>>> our specific use case with Cassandra (lots of writes, small number of
>>> reads), the network load is usually pretty low.  I suspect that would
>>> change if we used Kubernetes + central persistent storage.
>>> Good discussion.
>>>
>>> -Joe
>>> On 8/17/2023 7:37 PM, daemeon reiydelle wrote:
>>>
>>> I started to respond, then realized I and the other OP posters are not
>>> thinking the same: What is the business case for availability, data
>>> los/reload/recoverability? You all argue for higher availability and damn
>>> the cost. But noone asked "can you lose access, for 20 minutes, to a
>>> portion of the data, 10 times a year, on a 250 node cluster in AWS, if it
>>> is not lost"? Can you lose access 1-2 times a year for the cost of a 500
>>> node cluster holding the same data?
>>>
>>> Then we can discuss 32/64g JVM and SSD's.
>>> *.*
>>> *Arthur C. Clarke famously said that "technology sufficiently advanced
>>> is indistinguishable from magic." Magic is coming, and it's coming for all
>>> of us....*
>>>
>>> *Daemeon Reiydelle*
>>> *email: daeme...@gmail.com <daeme...@gmail.com>*
>>> *LI: https://www.linkedin.com/in/daemeonreiydelle/
>>> <https://www.linkedin.com/in/daemeonreiydelle/>*
>>> *San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*
>>>
>>>
>>> On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger <
>>> joseph.obernber...@gmail.com> wrote:
>>>
>>>> Was assuming reaper did incremental?  That was probably a bad
>>>> assumption.
>>>>
>>>> nodetool repair -pr
>>>> I know it well now!
>>>>
>>>> :)
>>>>
>>>> -Joe
>>>>
>>>> On 8/17/2023 4:47 PM, Bowen Song via user wrote:
>>>> > I don't have experience with Cassandra on Kubernetes, so I can't
>>>> > comment on that.
>>>> >
>>>> > For repairs, may I interest you with incremental repairs? It will
>>>> make
>>>> > repairs hell of a lot faster. Of course, occasional full repair is
>>>> > still needed, but that's another story.
>>>> >
>>>> >
>>>> > On 17/08/2023 21:36, Joe Obernberger wrote:
>>>> >> Thank you.  Enjoying this conversation.
>>>> >> Agree on blade servers, where each blade has a small number of
>>>> SSDs.
>>>> >> Yeh/Nah to a kubernetes approach assuming fast persistent storage?
>>>> I
>>>> >> think that might be easier to manage.
>>>> >>
>>>> >> In my current benchmarks, the performance is excellent, but the
>>>> >> repairs are painful.  I come from the Hadoop world where it was all
>>>> >> about large servers with lots of disk.
>>>> >> Relatively small number of tables, but some have a high number of
>>>> >> rows, 10bil + - we use spark to run across all the data.
>>>> >>
>>>> >> -Joe
>>>> >>
>>>> >> On 8/17/2023 12:13 PM, Bowen Song via user wrote:
>>>> >>> The optimal node size largely depends on the table schema and
>>>> >>> read/write pattern. In some cases 500 GB per node is too large, but
>>>> >>> in some other cases 10TB per node works totally fine. It's hard to
>>>> >>> estimate that without benchmarking.
>>>> >>>
>>>> >>> Again, just pointing out the obvious, you did not count the
>>>> off-heap
>>>> >>> memory and page cache. 1TB of RAM for 24GB heap * 40 instances is
>>>> >>> definitely not enough. You'll most likely need between 1.5 and 2 TB
>>>> >>> memory for 40x 24GB heap nodes. You may be better off with blade
>>>> >>> servers than single server with gigantic memory and disk sizes.
>>>> >>>
>>>> >>>
>>>> >>> On 17/08/2023 15:46, Joe Obernberger wrote:
>>>> >>>> Thanks for this - yeah - duh - forgot about replication in my
>>>> example!
>>>> >>>> So - is 2TBytes per Cassandra instance advisable?  Better to use
>>>> >>>> more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs;
>>>> so
>>>> >>>> assume 80Tbytes per server, you could do:
>>>> >>>> (1024*3)/80 = 39 servers, but you'd have to run 40 instances of
>>>> >>>> Cassandra on each server; maybe 24G of heap per instance, so a
>>>> >>>> server with 1TByte of RAM would work.
>>>> >>>> Is this what folks would do?
>>>> >>>>
>>>> >>>> -Joe
>>>> >>>>
>>>> >>>> On 8/17/2023 9:13 AM, Bowen Song via user wrote:
>>>> >>>>> Just pointing out the obvious, for 1PB of data on nodes with 2TB
>>>> >>>>> disk each, you will need far more than 500 nodes.
>>>> >>>>>
>>>> >>>>> 1, it is unwise to run Cassandra with replication factor 1. It
>>>> >>>>> usually makes sense to use RF=3, so 1PB data will cost 3PB of
>>>> >>>>> storage space, minimal of 1500 such nodes.
>>>> >>>>>
>>>> >>>>> 2, depending on the compaction strategy you use and the write
>>>> >>>>> access pattern, there's a disk space amplification to consider.
>>>> >>>>> For example, with STCS, the disk usage can be many times of the
>>>> >>>>> actual live data size.
>>>> >>>>>
>>>> >>>>> 3, you will need some extra free disk space as temporary space
>>>> for
>>>> >>>>> running compactions.
>>>> >>>>>
>>>> >>>>> 4, the data is rarely going to be perfectly evenly distributed
>>>> >>>>> among all nodes, and you need to take that into consideration and
>>>> >>>>> size the nodes based on the node with the most data.
>>>> >>>>>
>>>> >>>>> 5, enough of bad news, here's a good one. Compression will save
>>>> >>>>> you (a lot) of disk space!
>>>> >>>>>
>>>> >>>>> With all the above considered, you probably will end up with a
>>>> lot
>>>> >>>>> more than the 500 nodes you initially thought. Your choice of
>>>> >>>>> compaction strategy and compression ratio can dramatically affect
>>>> >>>>> this calculation.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On 16/08/2023 16:33, Joe Obernberger wrote:
>>>> >>>>>> General question on how to configure Cassandra.  Say I have
>>>> >>>>>> 1PByte of data to store.  The general rule of thumb is that each
>>>> >>>>>> node (or at least instance of Cassandra) shouldn't handle more
>>>> >>>>>> than 2TBytes of disk.  That means 500 instances of Cassandra.
>>>> >>>>>>
>>>> >>>>>> Assuming you have very fast persistent storage (such as a
>>>> NetApp,
>>>> >>>>>> PorterWorx etc.), would using Kubernetes or some orchestration
>>>> >>>>>> layer to handle those nodes be a viable approach? Perhaps the
>>>> >>>>>> worker nodes would have enough RAM to run 4 instances (pods) of
>>>> >>>>>> Cassandra, you would need 125 servers.
>>>> >>>>>> Another approach is to build your servers with 5 (or more) SSD
>>>> >>>>>> devices - one for OS, four for each instance of Cassandra
>>>> running
>>>> >>>>>> on that server.  Then build some scripts/ansible/puppet that
>>>> >>>>>> would manage Cassandra start/stops, and other maintenance items.
>>>> >>>>>>
>>>> >>>>>> Where I think this runs into problems is with repairs, or
>>>> >>>>>> sstablescrubs that can take days to run on a single instance.
>>>> How
>>>> >>>>>> is that handled 'in the real world'?  With seed nodes, how many
>>>> >>>>>> would you have in such a configuration?
>>>> >>>>>> Thanks for any thoughts!
>>>> >>>>>>
>>>> >>>>>> -Joe
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>
>>>> >>
>>>>
>>>> --
>>>> This email has been checked for viruses by AVG antivirus software.
>>>> www.avg.com
>>>>
>>>
>>>
>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>> Virus-free.www.avg.com
>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>> <#m_-6368520747759180440_m_3592702684312727802_m_-1698453826366902038_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>
>>

Reply via email to