For our scenario, the goal is to minimize down-time for a single (at least initially) data center system.  Data-loss is basically unacceptable.  I wouldn't say we have a "rusty slow data center" - we can certainly use SSDs and have servers connected via 10G copper to a fast back-plane.  For our specific use case with Cassandra (lots of writes, small number of reads), the network load is usually pretty low.  I suspect that would change if we used Kubernetes + central persistent storage.
Good discussion.

-Joe

On 8/17/2023 7:37 PM, daemeon reiydelle wrote:
I started to respond, then realized I and the other OP posters are not thinking the same: What is the business case for availability, data los/reload/recoverability? You all argue for higher availability and damn the cost. But noone asked "can you lose access, for 20 minutes, to a portion of the data, 10 times a year, on a 250 node cluster in AWS, if it is not lost"? Can you lose access 1-2 times a year for the cost of a 500 node cluster holding the same data?

Then we can discuss 32/64g JVM and SSD's.
/./
/Arthur C. Clarke famously said that "technology sufficiently advanced is indistinguishable from magic." Magic is coming, and it's coming for all of us..../
/
/
*Daemeon Reiydelle*
*email: daeme...@gmail.com*
*LI: https://www.linkedin.com/in/daemeonreiydelle/*
*San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*


On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger <joseph.obernber...@gmail.com> wrote:

    Was assuming reaper did incremental?  That was probably a bad
    assumption.

    nodetool repair -pr
    I know it well now!

    :)

    -Joe

    On 8/17/2023 4:47 PM, Bowen Song via user wrote:
    > I don't have experience with Cassandra on Kubernetes, so I can't
    > comment on that.
    >
    > For repairs, may I interest you with incremental repairs? It
    will make
    > repairs hell of a lot faster. Of course, occasional full repair is
    > still needed, but that's another story.
    >
    >
    > On 17/08/2023 21:36, Joe Obernberger wrote:
    >> Thank you.  Enjoying this conversation.
    >> Agree on blade servers, where each blade has a small number of
    SSDs.
    >> Yeh/Nah to a kubernetes approach assuming fast persistent
    storage?  I
    >> think that might be easier to manage.
    >>
    >> In my current benchmarks, the performance is excellent, but the
    >> repairs are painful.  I come from the Hadoop world where it was
    all
    >> about large servers with lots of disk.
    >> Relatively small number of tables, but some have a high number of
    >> rows, 10bil + - we use spark to run across all the data.
    >>
    >> -Joe
    >>
    >> On 8/17/2023 12:13 PM, Bowen Song via user wrote:
    >>> The optimal node size largely depends on the table schema and
    >>> read/write pattern. In some cases 500 GB per node is too
    large, but
    >>> in some other cases 10TB per node works totally fine. It's
    hard to
    >>> estimate that without benchmarking.
    >>>
    >>> Again, just pointing out the obvious, you did not count the
    off-heap
    >>> memory and page cache. 1TB of RAM for 24GB heap * 40 instances is
    >>> definitely not enough. You'll most likely need between 1.5 and
    2 TB
    >>> memory for 40x 24GB heap nodes. You may be better off with blade
    >>> servers than single server with gigantic memory and disk sizes.
    >>>
    >>>
    >>> On 17/08/2023 15:46, Joe Obernberger wrote:
    >>>> Thanks for this - yeah - duh - forgot about replication in my
    example!
    >>>> So - is 2TBytes per Cassandra instance advisable?  Better to use
    >>>> more/less?  Modern 2u servers can be had with 24 3.8TBtyte
    SSDs; so
    >>>> assume 80Tbytes per server, you could do:
    >>>> (1024*3)/80 = 39 servers, but you'd have to run 40 instances of
    >>>> Cassandra on each server; maybe 24G of heap per instance, so a
    >>>> server with 1TByte of RAM would work.
    >>>> Is this what folks would do?
    >>>>
    >>>> -Joe
    >>>>
    >>>> On 8/17/2023 9:13 AM, Bowen Song via user wrote:
    >>>>> Just pointing out the obvious, for 1PB of data on nodes with
    2TB
    >>>>> disk each, you will need far more than 500 nodes.
    >>>>>
    >>>>> 1, it is unwise to run Cassandra with replication factor 1. It
    >>>>> usually makes sense to use RF=3, so 1PB data will cost 3PB of
    >>>>> storage space, minimal of 1500 such nodes.
    >>>>>
    >>>>> 2, depending on the compaction strategy you use and the write
    >>>>> access pattern, there's a disk space amplification to consider.
    >>>>> For example, with STCS, the disk usage can be many times of the
    >>>>> actual live data size.
    >>>>>
    >>>>> 3, you will need some extra free disk space as temporary
    space for
    >>>>> running compactions.
    >>>>>
    >>>>> 4, the data is rarely going to be perfectly evenly distributed
    >>>>> among all nodes, and you need to take that into
    consideration and
    >>>>> size the nodes based on the node with the most data.
    >>>>>
    >>>>> 5, enough of bad news, here's a good one. Compression will save
    >>>>> you (a lot) of disk space!
    >>>>>
    >>>>> With all the above considered, you probably will end up with
    a lot
    >>>>> more than the 500 nodes you initially thought. Your choice of
    >>>>> compaction strategy and compression ratio can dramatically
    affect
    >>>>> this calculation.
    >>>>>
    >>>>>
    >>>>> On 16/08/2023 16:33, Joe Obernberger wrote:
    >>>>>> General question on how to configure Cassandra.  Say I have
    >>>>>> 1PByte of data to store.  The general rule of thumb is that
    each
    >>>>>> node (or at least instance of Cassandra) shouldn't handle more
    >>>>>> than 2TBytes of disk.  That means 500 instances of Cassandra.
    >>>>>>
    >>>>>> Assuming you have very fast persistent storage (such as a
    NetApp,
    >>>>>> PorterWorx etc.), would using Kubernetes or some orchestration
    >>>>>> layer to handle those nodes be a viable approach? Perhaps the
    >>>>>> worker nodes would have enough RAM to run 4 instances
    (pods) of
    >>>>>> Cassandra, you would need 125 servers.
    >>>>>> Another approach is to build your servers with 5 (or more) SSD
    >>>>>> devices - one for OS, four for each instance of Cassandra
    running
    >>>>>> on that server.  Then build some scripts/ansible/puppet that
    >>>>>> would manage Cassandra start/stops, and other maintenance
    items.
    >>>>>>
    >>>>>> Where I think this runs into problems is with repairs, or
    >>>>>> sstablescrubs that can take days to run on a single
    instance. How
    >>>>>> is that handled 'in the real world'? With seed nodes, how many
    >>>>>> would you have in such a configuration?
    >>>>>> Thanks for any thoughts!
    >>>>>>
    >>>>>> -Joe
    >>>>>>
    >>>>>>
    >>>>
    >>

-- This email has been checked for viruses by AVG antivirus software.
    www.avg.com <http://www.avg.com>

Reply via email to