...and a shameless plug for the Cassandra Summit in December. We have a talk from somebody that is doing 70TB per node and will be digging into all the aspects that make that work for them. I hope everyone in this thread is at that talk! I can't wait to hear all the questions.
Patrick On Mon, Aug 21, 2023 at 8:01 AM Jeff Jirsa <jji...@gmail.com> wrote: > There's a lot of questionable advice scattered in this thread. Set aside > most of the guidance like 2TB/node, it's old and super nuanced. > > If you're bare metal, do what your organization is good at. If you have > millions of dollars in SAN equipment and you know how SANs work and fail > and get backed up, run on a SAN if your organization knows how to properly > operate a SAN. Just make sure you understand it's a single point of failure. > > If you're in the cloud, EBS is basically the same concept. You can lose > EBS in an AZ, just like you can lose SAN in a DC. Persist outside of that. > Have backups. Know how to restore them. > > The reason the "2TB/node" limit was a thing was around time to recover > from failure more than anything else. I described this in detail here, in > 2015, before faster-streaming in 4.0 was a thing : > https://stackoverflow.com/questions/31563447/cassandra-cluster-data-density-data-size-per-node-looking-for-feedback-and/31690279 > . With faster streaming, IF you use LCS (so faster streaming works), you > can probably go at least 4-5x more dense than before, if you understand how > likely your disks are to fail and you can ensure you dont have correlated > failures when they age out (that means if you're on bare metal, measuring > flash life, and ideally mixing vendors to avoid firmware bugs). > > You'll still see risks of huge clusters, largely in gossip and schema > propagation. Upcoming CEPs address those. 4.0 is better there (with schema, > especially) than 3.0 was, but for "max nodes in a cluster", what you're > really comparing is "how many gossip speakers and tokens are in the > cluster" (which means your vnode settings matter, for things like pending > range calculators). > > Looking at the roadmap, your real question comes down to : > - If you expect to use the transactional features in Accord/5.0 to > transact across rows/keys, you probably want to keep one cluster > - If you dont ever expect to use multi-key transactions, just de-risk by > sharding your cluster into many smaller clusters now, with consistent > hashing to map keys to clusters, and have 4 clusters of the same smaller > size, with whatever node density you think you can do based on your > compaction strategy and streaming rate (and disk type). > > If you have time and budget, create a 3 node cluster with whatever disks > you have, fill them, start working on them - expand to 4, treat one as > failed and replace it - simulate the operations you'll do at that size. > It's expensive to mimic a 500 host cluster, but if you've got budget, try > it in AWS and see what happens when you apply your real schema, and then do > a schema change. > > > > > > On Mon, Aug 21, 2023 at 7:31 AM Joe Obernberger < > joseph.obernber...@gmail.com> wrote: > >> For our scenario, the goal is to minimize down-time for a single (at >> least initially) data center system. Data-loss is basically unacceptable. >> I wouldn't say we have a "rusty slow data center" - we can certainly use >> SSDs and have servers connected via 10G copper to a fast back-plane. For >> our specific use case with Cassandra (lots of writes, small number of >> reads), the network load is usually pretty low. I suspect that would >> change if we used Kubernetes + central persistent storage. >> Good discussion. >> >> -Joe >> On 8/17/2023 7:37 PM, daemeon reiydelle wrote: >> >> I started to respond, then realized I and the other OP posters are not >> thinking the same: What is the business case for availability, data >> los/reload/recoverability? You all argue for higher availability and damn >> the cost. But noone asked "can you lose access, for 20 minutes, to a >> portion of the data, 10 times a year, on a 250 node cluster in AWS, if it >> is not lost"? Can you lose access 1-2 times a year for the cost of a 500 >> node cluster holding the same data? >> >> Then we can discuss 32/64g JVM and SSD's. >> *.* >> *Arthur C. Clarke famously said that "technology sufficiently advanced is >> indistinguishable from magic." Magic is coming, and it's coming for all of >> us....* >> >> *Daemeon Reiydelle* >> *email: daeme...@gmail.com <daeme...@gmail.com>* >> *LI: https://www.linkedin.com/in/daemeonreiydelle/ >> <https://www.linkedin.com/in/daemeonreiydelle/>* >> *San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle* >> >> >> On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger < >> joseph.obernber...@gmail.com> wrote: >> >>> Was assuming reaper did incremental? That was probably a bad assumption. >>> >>> nodetool repair -pr >>> I know it well now! >>> >>> :) >>> >>> -Joe >>> >>> On 8/17/2023 4:47 PM, Bowen Song via user wrote: >>> > I don't have experience with Cassandra on Kubernetes, so I can't >>> > comment on that. >>> > >>> > For repairs, may I interest you with incremental repairs? It will make >>> > repairs hell of a lot faster. Of course, occasional full repair is >>> > still needed, but that's another story. >>> > >>> > >>> > On 17/08/2023 21:36, Joe Obernberger wrote: >>> >> Thank you. Enjoying this conversation. >>> >> Agree on blade servers, where each blade has a small number of SSDs. >>> >> Yeh/Nah to a kubernetes approach assuming fast persistent storage? I >>> >> think that might be easier to manage. >>> >> >>> >> In my current benchmarks, the performance is excellent, but the >>> >> repairs are painful. I come from the Hadoop world where it was all >>> >> about large servers with lots of disk. >>> >> Relatively small number of tables, but some have a high number of >>> >> rows, 10bil + - we use spark to run across all the data. >>> >> >>> >> -Joe >>> >> >>> >> On 8/17/2023 12:13 PM, Bowen Song via user wrote: >>> >>> The optimal node size largely depends on the table schema and >>> >>> read/write pattern. In some cases 500 GB per node is too large, but >>> >>> in some other cases 10TB per node works totally fine. It's hard to >>> >>> estimate that without benchmarking. >>> >>> >>> >>> Again, just pointing out the obvious, you did not count the off-heap >>> >>> memory and page cache. 1TB of RAM for 24GB heap * 40 instances is >>> >>> definitely not enough. You'll most likely need between 1.5 and 2 TB >>> >>> memory for 40x 24GB heap nodes. You may be better off with blade >>> >>> servers than single server with gigantic memory and disk sizes. >>> >>> >>> >>> >>> >>> On 17/08/2023 15:46, Joe Obernberger wrote: >>> >>>> Thanks for this - yeah - duh - forgot about replication in my >>> example! >>> >>>> So - is 2TBytes per Cassandra instance advisable? Better to use >>> >>>> more/less? Modern 2u servers can be had with 24 3.8TBtyte SSDs; so >>> >>>> assume 80Tbytes per server, you could do: >>> >>>> (1024*3)/80 = 39 servers, but you'd have to run 40 instances of >>> >>>> Cassandra on each server; maybe 24G of heap per instance, so a >>> >>>> server with 1TByte of RAM would work. >>> >>>> Is this what folks would do? >>> >>>> >>> >>>> -Joe >>> >>>> >>> >>>> On 8/17/2023 9:13 AM, Bowen Song via user wrote: >>> >>>>> Just pointing out the obvious, for 1PB of data on nodes with 2TB >>> >>>>> disk each, you will need far more than 500 nodes. >>> >>>>> >>> >>>>> 1, it is unwise to run Cassandra with replication factor 1. It >>> >>>>> usually makes sense to use RF=3, so 1PB data will cost 3PB of >>> >>>>> storage space, minimal of 1500 such nodes. >>> >>>>> >>> >>>>> 2, depending on the compaction strategy you use and the write >>> >>>>> access pattern, there's a disk space amplification to consider. >>> >>>>> For example, with STCS, the disk usage can be many times of the >>> >>>>> actual live data size. >>> >>>>> >>> >>>>> 3, you will need some extra free disk space as temporary space for >>> >>>>> running compactions. >>> >>>>> >>> >>>>> 4, the data is rarely going to be perfectly evenly distributed >>> >>>>> among all nodes, and you need to take that into consideration and >>> >>>>> size the nodes based on the node with the most data. >>> >>>>> >>> >>>>> 5, enough of bad news, here's a good one. Compression will save >>> >>>>> you (a lot) of disk space! >>> >>>>> >>> >>>>> With all the above considered, you probably will end up with a lot >>> >>>>> more than the 500 nodes you initially thought. Your choice of >>> >>>>> compaction strategy and compression ratio can dramatically affect >>> >>>>> this calculation. >>> >>>>> >>> >>>>> >>> >>>>> On 16/08/2023 16:33, Joe Obernberger wrote: >>> >>>>>> General question on how to configure Cassandra. Say I have >>> >>>>>> 1PByte of data to store. The general rule of thumb is that each >>> >>>>>> node (or at least instance of Cassandra) shouldn't handle more >>> >>>>>> than 2TBytes of disk. That means 500 instances of Cassandra. >>> >>>>>> >>> >>>>>> Assuming you have very fast persistent storage (such as a NetApp, >>> >>>>>> PorterWorx etc.), would using Kubernetes or some orchestration >>> >>>>>> layer to handle those nodes be a viable approach? Perhaps the >>> >>>>>> worker nodes would have enough RAM to run 4 instances (pods) of >>> >>>>>> Cassandra, you would need 125 servers. >>> >>>>>> Another approach is to build your servers with 5 (or more) SSD >>> >>>>>> devices - one for OS, four for each instance of Cassandra running >>> >>>>>> on that server. Then build some scripts/ansible/puppet that >>> >>>>>> would manage Cassandra start/stops, and other maintenance items. >>> >>>>>> >>> >>>>>> Where I think this runs into problems is with repairs, or >>> >>>>>> sstablescrubs that can take days to run on a single instance. How >>> >>>>>> is that handled 'in the real world'? With seed nodes, how many >>> >>>>>> would you have in such a configuration? >>> >>>>>> Thanks for any thoughts! >>> >>>>>> >>> >>>>>> -Joe >>> >>>>>> >>> >>>>>> >>> >>>> >>> >> >>> >>> -- >>> This email has been checked for viruses by AVG antivirus software. >>> www.avg.com >>> >> >> >> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> >> Virus-free.www.avg.com >> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> >> <#m_3592702684312727802_m_-1698453826366902038_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >> >