- k8s 1. depending on the version and networking, number of containers per node, nodepooling, etc. you can expect to see 1-2% additional storage IO latency (depends on whether all are on the same network vs. separate storage IO TCP network) 2. System overhead may be 3-15% depending on what security mitigations are in place (if you own the systems and workload is dedicated, turn them off!) 3. c* pod loss recovery is the big win here. pod failure and recovery (e.g. to another node) will bring up the SAME c* node as of the node failure (so only a few updates). Perhaps 2x replication, or none if the storage itself is replicated.
I wonder if you folks have already set out OLA's for "minimum outage" with no data loss? Write amplification is mostly only a problem when networks are heavily used. May not even be an issue in your case. *.* *Arthur C. Clarke famously said that "technology sufficiently advanced is indistinguishable from magic." Magic is coming, and it's coming for all of us....* *Daemeon Reiydelle* *email: daeme...@gmail.com <daeme...@gmail.com>* *LI: https://www.linkedin.com/in/daemeonreiydelle/ <https://www.linkedin.com/in/daemeonreiydelle/>* *San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle* On Mon, Aug 21, 2023 at 8:49 AM Patrick McFadin <pmcfa...@gmail.com> wrote: > ...and a shameless plug for the Cassandra Summit in December. We have a > talk from somebody that is doing 70TB per node and will be digging into all > the aspects that make that work for them. I hope everyone in this thread is > at that talk! I can't wait to hear all the questions. > > Patrick > > On Mon, Aug 21, 2023 at 8:01 AM Jeff Jirsa <jji...@gmail.com> wrote: > >> There's a lot of questionable advice scattered in this thread. Set aside >> most of the guidance like 2TB/node, it's old and super nuanced. >> >> If you're bare metal, do what your organization is good at. If you have >> millions of dollars in SAN equipment and you know how SANs work and fail >> and get backed up, run on a SAN if your organization knows how to properly >> operate a SAN. Just make sure you understand it's a single point of failure. >> >> If you're in the cloud, EBS is basically the same concept. You can lose >> EBS in an AZ, just like you can lose SAN in a DC. Persist outside of that. >> Have backups. Know how to restore them. >> >> The reason the "2TB/node" limit was a thing was around time to recover >> from failure more than anything else. I described this in detail here, in >> 2015, before faster-streaming in 4.0 was a thing : >> https://stackoverflow.com/questions/31563447/cassandra-cluster-data-density-data-size-per-node-looking-for-feedback-and/31690279 >> . With faster streaming, IF you use LCS (so faster streaming works), you >> can probably go at least 4-5x more dense than before, if you understand how >> likely your disks are to fail and you can ensure you dont have correlated >> failures when they age out (that means if you're on bare metal, measuring >> flash life, and ideally mixing vendors to avoid firmware bugs). >> >> You'll still see risks of huge clusters, largely in gossip and schema >> propagation. Upcoming CEPs address those. 4.0 is better there (with schema, >> especially) than 3.0 was, but for "max nodes in a cluster", what you're >> really comparing is "how many gossip speakers and tokens are in the >> cluster" (which means your vnode settings matter, for things like pending >> range calculators). >> >> Looking at the roadmap, your real question comes down to : >> - If you expect to use the transactional features in Accord/5.0 to >> transact across rows/keys, you probably want to keep one cluster >> - If you dont ever expect to use multi-key transactions, just de-risk by >> sharding your cluster into many smaller clusters now, with consistent >> hashing to map keys to clusters, and have 4 clusters of the same smaller >> size, with whatever node density you think you can do based on your >> compaction strategy and streaming rate (and disk type). >> >> If you have time and budget, create a 3 node cluster with whatever disks >> you have, fill them, start working on them - expand to 4, treat one as >> failed and replace it - simulate the operations you'll do at that size. >> It's expensive to mimic a 500 host cluster, but if you've got budget, try >> it in AWS and see what happens when you apply your real schema, and then do >> a schema change. >> >> >> >> >> >> On Mon, Aug 21, 2023 at 7:31 AM Joe Obernberger < >> joseph.obernber...@gmail.com> wrote: >> >>> For our scenario, the goal is to minimize down-time for a single (at >>> least initially) data center system. Data-loss is basically unacceptable. >>> I wouldn't say we have a "rusty slow data center" - we can certainly use >>> SSDs and have servers connected via 10G copper to a fast back-plane. For >>> our specific use case with Cassandra (lots of writes, small number of >>> reads), the network load is usually pretty low. I suspect that would >>> change if we used Kubernetes + central persistent storage. >>> Good discussion. >>> >>> -Joe >>> On 8/17/2023 7:37 PM, daemeon reiydelle wrote: >>> >>> I started to respond, then realized I and the other OP posters are not >>> thinking the same: What is the business case for availability, data >>> los/reload/recoverability? You all argue for higher availability and damn >>> the cost. But noone asked "can you lose access, for 20 minutes, to a >>> portion of the data, 10 times a year, on a 250 node cluster in AWS, if it >>> is not lost"? Can you lose access 1-2 times a year for the cost of a 500 >>> node cluster holding the same data? >>> >>> Then we can discuss 32/64g JVM and SSD's. >>> *.* >>> *Arthur C. Clarke famously said that "technology sufficiently advanced >>> is indistinguishable from magic." Magic is coming, and it's coming for all >>> of us....* >>> >>> *Daemeon Reiydelle* >>> *email: daeme...@gmail.com <daeme...@gmail.com>* >>> *LI: https://www.linkedin.com/in/daemeonreiydelle/ >>> <https://www.linkedin.com/in/daemeonreiydelle/>* >>> *San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle* >>> >>> >>> On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger < >>> joseph.obernber...@gmail.com> wrote: >>> >>>> Was assuming reaper did incremental? That was probably a bad >>>> assumption. >>>> >>>> nodetool repair -pr >>>> I know it well now! >>>> >>>> :) >>>> >>>> -Joe >>>> >>>> On 8/17/2023 4:47 PM, Bowen Song via user wrote: >>>> > I don't have experience with Cassandra on Kubernetes, so I can't >>>> > comment on that. >>>> > >>>> > For repairs, may I interest you with incremental repairs? It will >>>> make >>>> > repairs hell of a lot faster. Of course, occasional full repair is >>>> > still needed, but that's another story. >>>> > >>>> > >>>> > On 17/08/2023 21:36, Joe Obernberger wrote: >>>> >> Thank you. Enjoying this conversation. >>>> >> Agree on blade servers, where each blade has a small number of >>>> SSDs. >>>> >> Yeh/Nah to a kubernetes approach assuming fast persistent storage? >>>> I >>>> >> think that might be easier to manage. >>>> >> >>>> >> In my current benchmarks, the performance is excellent, but the >>>> >> repairs are painful. I come from the Hadoop world where it was all >>>> >> about large servers with lots of disk. >>>> >> Relatively small number of tables, but some have a high number of >>>> >> rows, 10bil + - we use spark to run across all the data. >>>> >> >>>> >> -Joe >>>> >> >>>> >> On 8/17/2023 12:13 PM, Bowen Song via user wrote: >>>> >>> The optimal node size largely depends on the table schema and >>>> >>> read/write pattern. In some cases 500 GB per node is too large, but >>>> >>> in some other cases 10TB per node works totally fine. It's hard to >>>> >>> estimate that without benchmarking. >>>> >>> >>>> >>> Again, just pointing out the obvious, you did not count the >>>> off-heap >>>> >>> memory and page cache. 1TB of RAM for 24GB heap * 40 instances is >>>> >>> definitely not enough. You'll most likely need between 1.5 and 2 TB >>>> >>> memory for 40x 24GB heap nodes. You may be better off with blade >>>> >>> servers than single server with gigantic memory and disk sizes. >>>> >>> >>>> >>> >>>> >>> On 17/08/2023 15:46, Joe Obernberger wrote: >>>> >>>> Thanks for this - yeah - duh - forgot about replication in my >>>> example! >>>> >>>> So - is 2TBytes per Cassandra instance advisable? Better to use >>>> >>>> more/less? Modern 2u servers can be had with 24 3.8TBtyte SSDs; >>>> so >>>> >>>> assume 80Tbytes per server, you could do: >>>> >>>> (1024*3)/80 = 39 servers, but you'd have to run 40 instances of >>>> >>>> Cassandra on each server; maybe 24G of heap per instance, so a >>>> >>>> server with 1TByte of RAM would work. >>>> >>>> Is this what folks would do? >>>> >>>> >>>> >>>> -Joe >>>> >>>> >>>> >>>> On 8/17/2023 9:13 AM, Bowen Song via user wrote: >>>> >>>>> Just pointing out the obvious, for 1PB of data on nodes with 2TB >>>> >>>>> disk each, you will need far more than 500 nodes. >>>> >>>>> >>>> >>>>> 1, it is unwise to run Cassandra with replication factor 1. It >>>> >>>>> usually makes sense to use RF=3, so 1PB data will cost 3PB of >>>> >>>>> storage space, minimal of 1500 such nodes. >>>> >>>>> >>>> >>>>> 2, depending on the compaction strategy you use and the write >>>> >>>>> access pattern, there's a disk space amplification to consider. >>>> >>>>> For example, with STCS, the disk usage can be many times of the >>>> >>>>> actual live data size. >>>> >>>>> >>>> >>>>> 3, you will need some extra free disk space as temporary space >>>> for >>>> >>>>> running compactions. >>>> >>>>> >>>> >>>>> 4, the data is rarely going to be perfectly evenly distributed >>>> >>>>> among all nodes, and you need to take that into consideration and >>>> >>>>> size the nodes based on the node with the most data. >>>> >>>>> >>>> >>>>> 5, enough of bad news, here's a good one. Compression will save >>>> >>>>> you (a lot) of disk space! >>>> >>>>> >>>> >>>>> With all the above considered, you probably will end up with a >>>> lot >>>> >>>>> more than the 500 nodes you initially thought. Your choice of >>>> >>>>> compaction strategy and compression ratio can dramatically affect >>>> >>>>> this calculation. >>>> >>>>> >>>> >>>>> >>>> >>>>> On 16/08/2023 16:33, Joe Obernberger wrote: >>>> >>>>>> General question on how to configure Cassandra. Say I have >>>> >>>>>> 1PByte of data to store. The general rule of thumb is that each >>>> >>>>>> node (or at least instance of Cassandra) shouldn't handle more >>>> >>>>>> than 2TBytes of disk. That means 500 instances of Cassandra. >>>> >>>>>> >>>> >>>>>> Assuming you have very fast persistent storage (such as a >>>> NetApp, >>>> >>>>>> PorterWorx etc.), would using Kubernetes or some orchestration >>>> >>>>>> layer to handle those nodes be a viable approach? Perhaps the >>>> >>>>>> worker nodes would have enough RAM to run 4 instances (pods) of >>>> >>>>>> Cassandra, you would need 125 servers. >>>> >>>>>> Another approach is to build your servers with 5 (or more) SSD >>>> >>>>>> devices - one for OS, four for each instance of Cassandra >>>> running >>>> >>>>>> on that server. Then build some scripts/ansible/puppet that >>>> >>>>>> would manage Cassandra start/stops, and other maintenance items. >>>> >>>>>> >>>> >>>>>> Where I think this runs into problems is with repairs, or >>>> >>>>>> sstablescrubs that can take days to run on a single instance. >>>> How >>>> >>>>>> is that handled 'in the real world'? With seed nodes, how many >>>> >>>>>> would you have in such a configuration? >>>> >>>>>> Thanks for any thoughts! >>>> >>>>>> >>>> >>>>>> -Joe >>>> >>>>>> >>>> >>>>>> >>>> >>>> >>>> >> >>>> >>>> -- >>>> This email has been checked for viruses by AVG antivirus software. >>>> www.avg.com >>>> >>> >>> >>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> >>> Virus-free.www.avg.com >>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> >>> <#m_-6368520747759180440_m_3592702684312727802_m_-1698453826366902038_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >>> >>