Re: Big Data Question

2023-08-21 Thread Jeff Jirsa
(Yes, just somewhat less likely to be the same order of speed-up in STCS
where sstables are more likely to cross token boundaries, modulo some stuff
around sstable splitting at token ranges a la 6696)

On Mon, Aug 21, 2023 at 11:35 AM Dinesh Joshi  wrote:

> Minor correction, zero copy streaming aka faster streaming also works for
> STCS.
>
> Dinesh
>
> On Aug 21, 2023, at 8:01 AM, Jeff Jirsa  wrote:
>
> 
> There's a lot of questionable advice scattered in this thread. Set aside
> most of the guidance like 2TB/node, it's old and super nuanced.
>
> If you're bare metal, do what your organization is good at. If you have
> millions of dollars in SAN equipment and you know how SANs work and fail
> and get backed up, run on a SAN if your organization knows how to properly
> operate a SAN. Just make sure you understand it's a single point of failure.
>
> If you're in the cloud, EBS is basically the same concept. You can lose
> EBS in an AZ, just like you can lose SAN in a DC. Persist outside of that.
> Have backups. Know how to restore them.
>
> The reason the "2TB/node" limit was a thing was around time to recover
> from failure more than anything else. I described this in detail here, in
> 2015, before faster-streaming in 4.0 was a thing :
> https://stackoverflow.com/questions/31563447/cassandra-cluster-data-density-data-size-per-node-looking-for-feedback-and/31690279
> . With faster streaming, IF you use LCS (so faster streaming works), you
> can probably go at least 4-5x more dense than before, if you understand how
> likely your disks are to fail and you can ensure you dont have correlated
> failures when they age out (that means if you're on bare metal, measuring
> flash life, and ideally mixing vendors to avoid firmware bugs).
>
> You'll still see risks of huge clusters, largely in gossip and schema
> propagation. Upcoming CEPs address those. 4.0 is better there (with schema,
> especially) than 3.0 was, but for "max nodes in a cluster", what you're
> really comparing is "how many gossip speakers and tokens are in the
> cluster" (which means your vnode settings matter, for things like pending
> range calculators).
>
> Looking at the roadmap, your real question comes down to :
> - If you expect to use the transactional features in Accord/5.0 to
> transact across rows/keys, you probably want to keep one cluster
> - If you dont ever expect to use multi-key transactions, just de-risk by
> sharding your cluster into many smaller clusters now, with consistent
> hashing to map keys to clusters, and have 4 clusters of the same smaller
> size, with whatever node density you think you can do based on your
> compaction strategy and streaming rate (and disk type).
>
> If you have time and budget, create a 3 node cluster with whatever disks
> you have, fill them, start working on them - expand to 4, treat one as
> failed and replace it - simulate the operations you'll do at that size.
> It's expensive to mimic a 500 host cluster, but if you've got budget, try
> it in AWS and see what happens when you apply your real schema, and then do
> a schema change.
>
>
>
>
>
> On Mon, Aug 21, 2023 at 7:31 AM Joe Obernberger <
> joseph.obernber...@gmail.com> wrote:
>
>> For our scenario, the goal is to minimize down-time for a single (at
>> least initially) data center system.  Data-loss is basically unacceptable.
>> I wouldn't say we have a "rusty slow data center" - we can certainly use
>> SSDs and have servers connected via 10G copper to a fast back-plane.  For
>> our specific use case with Cassandra (lots of writes, small number of
>> reads), the network load is usually pretty low.  I suspect that would
>> change if we used Kubernetes + central persistent storage.
>> Good discussion.
>>
>> -Joe
>> On 8/17/2023 7:37 PM, daemeon reiydelle wrote:
>>
>> I started to respond, then realized I and the other OP posters are not
>> thinking the same: What is the business case for availability, data
>> los/reload/recoverability? You all argue for higher availability and damn
>> the cost. But noone asked "can you lose access, for 20 minutes, to a
>> portion of the data, 10 times a year, on a 250 node cluster in AWS, if it
>> is not lost"? Can you lose access 1-2 times a year for the cost of a 500
>> node cluster holding the same data?
>>
>> Then we can discuss 32/64g JVM and SSD's.
>> *.*
>> *Arthur C. Clarke famously said that "technology sufficiently advanced is
>> indistinguishable from magic." Magic is coming, and it's coming for all of
>> us*
>>
>> *Daemeon Reiydelle*
>> *email: daeme...@gmail.com *
>> *LI: https://www.linkedin.com/in/daemeonreiydelle/
>> *
>> *San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*
>>
>>
>> On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger <
>> joseph.obernber...@gmail.com> wrote:
>>
>>> Was assuming reaper did incremental?  That was probably a bad assumption.
>>>
>>> nodetool repair -pr
>>> I know it well now!
>>>
>>> :)
>>>
>>> -Joe
>>>
>>> On 

Re: Big Data Question

2023-08-21 Thread Dinesh Joshi
Minor correction, zero copy streaming aka faster streaming also works for STCS.DineshOn Aug 21, 2023, at 8:01 AM, Jeff Jirsa  wrote:There's a lot of questionable advice scattered in this thread. Set aside most of the guidance like 2TB/node, it's old and super nuanced.If you're bare metal, do what your organization is good at. If you have millions of dollars in SAN equipment and you know how SANs work and fail and get backed up, run on a SAN if your organization knows how to properly operate a SAN. Just make sure you understand it's a single point of failure.If you're in the cloud, EBS is basically the same concept. You can lose EBS in an AZ, just like you can lose SAN in a DC. Persist outside of that. Have backups. Know how to restore them. The reason the "2TB/node" limit was a thing was around time to recover from failure more than anything else. I described this in detail here, in 2015, before faster-streaming in 4.0 was a thing : https://stackoverflow.com/questions/31563447/cassandra-cluster-data-density-data-size-per-node-looking-for-feedback-and/31690279 . With faster streaming, IF you use LCS (so faster streaming works), you can probably go at least 4-5x more dense than before, if you understand how likely your disks are to fail and you can ensure you dont have correlated failures when they age out (that means if you're on bare metal, measuring flash life, and ideally mixing vendors to avoid firmware bugs). You'll still see risks of huge clusters, largely in gossip and schema propagation. Upcoming CEPs address those. 4.0 is better there (with schema, especially) than 3.0 was, but for "max nodes in a cluster", what you're really comparing is "how many gossip speakers and tokens are in the cluster" (which means your vnode settings matter, for things like pending range calculators). Looking at the roadmap, your real question comes down to : - If you expect to use the transactional features in Accord/5.0 to transact across rows/keys, you probably want to keep one cluster- If you dont ever expect to use multi-key transactions, just de-risk by sharding your cluster into many smaller clusters now, with consistent hashing to map keys to clusters, and have 4 clusters of the same smaller size, with whatever node density you think you can do based on your compaction strategy and streaming rate (and disk type). If you have time and budget, create a 3 node cluster with whatever disks you have, fill them, start working on them - expand to 4, treat one as failed and replace it - simulate the operations you'll do at that size. It's expensive to mimic a 500 host cluster, but if you've got budget, try it in AWS and see what happens when you apply your real schema, and then do a schema change.On Mon, Aug 21, 2023 at 7:31 AM Joe Obernberger  wrote:

  

  
  
For our scenario, the goal is to minimize down-time for a single
  (at least initially) data center system.  Data-loss is basically
  unacceptable.  I wouldn't say we have a "rusty slow data center" -
  we can certainly use SSDs and have servers connected via 10G
  copper to a fast back-plane.  For our specific use case with
  Cassandra (lots of writes, small number of reads), the network
  load is usually pretty low.  I suspect that would change if we
  used Kubernetes + central persistent storage.  
  Good discussion.
-Joe

On 8/17/2023 7:37 PM, daemeon reiydelle
  wrote:


  
  
I
  started to respond, then realized I and the other OP posters
  are not thinking the same: What is the business case for
  availability, data los/reload/recoverability? You all argue
  for higher availability and damn the cost. But noone asked
  "can you lose access, for 20 minutes, to a portion of the
  data, 10 times a year, on a 250 node cluster in AWS, if it is
  not lost"? Can you lose access 1-2 times a year for the cost
  of a 500 node cluster holding the same data?


Then
  we can discuss 32/64g JVM and SSD's.

  

  .
  Arthur C. Clarke famously said that "technology
sufficiently advanced is indistinguishable from magic."
Magic is coming, and it's coming for all of us
  

  

  
Daemeon
Reiydelle

  email: daeme...@gmail.com
LI: https://www.linkedin.com/in/daemeonreiydelle/

San Francisco
1.415.501.0198/Skype daemeon.c.m.reiydelle
  

  

  


  
  
  
On Thu, Aug 17, 2023 at
  1:53 PM Joe Obernberger 
 

Re: Big Data Question

2023-08-21 Thread daemeon reiydelle
- k8s

   1. depending on the version and networking, number of containers per
   node, nodepooling, etc. you can expect to see 1-2% additional storage IO
   latency (depends on whether all are on the same network vs. separate
   storage IO TCP network)
   2. System overhead may be 3-15% depending on what security mitigations
   are in place (if you own the systems and workload is dedicated, turn them
   off!)
   3. c* pod loss recovery is the big win here. pod failure and recovery
   (e.g. to another node) will bring up the SAME c* node as of the node
   failure (so only a few updates). Perhaps 2x replication, or none if the
   storage itself is replicated.

I wonder if you folks have already set out OLA's for "minimum outage" with
no data loss? Write amplification is mostly only a problem when networks
are heavily used. May not even be an issue in your case.
*.*
*Arthur C. Clarke famously said that "technology sufficiently advanced is
indistinguishable from magic." Magic is coming, and it's coming for all of
us*

*Daemeon Reiydelle*
*email: daeme...@gmail.com *
*LI: https://www.linkedin.com/in/daemeonreiydelle/
*
*San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*


On Mon, Aug 21, 2023 at 8:49 AM Patrick McFadin  wrote:

> ...and a shameless plug for the Cassandra Summit in December. We have a
> talk from somebody that is doing 70TB per node and will be digging into all
> the aspects that make that work for them. I hope everyone in this thread is
> at that talk! I can't wait to hear all the questions.
>
> Patrick
>
> On Mon, Aug 21, 2023 at 8:01 AM Jeff Jirsa  wrote:
>
>> There's a lot of questionable advice scattered in this thread. Set aside
>> most of the guidance like 2TB/node, it's old and super nuanced.
>>
>> If you're bare metal, do what your organization is good at. If you have
>> millions of dollars in SAN equipment and you know how SANs work and fail
>> and get backed up, run on a SAN if your organization knows how to properly
>> operate a SAN. Just make sure you understand it's a single point of failure.
>>
>> If you're in the cloud, EBS is basically the same concept. You can lose
>> EBS in an AZ, just like you can lose SAN in a DC. Persist outside of that.
>> Have backups. Know how to restore them.
>>
>> The reason the "2TB/node" limit was a thing was around time to recover
>> from failure more than anything else. I described this in detail here, in
>> 2015, before faster-streaming in 4.0 was a thing :
>> https://stackoverflow.com/questions/31563447/cassandra-cluster-data-density-data-size-per-node-looking-for-feedback-and/31690279
>> . With faster streaming, IF you use LCS (so faster streaming works), you
>> can probably go at least 4-5x more dense than before, if you understand how
>> likely your disks are to fail and you can ensure you dont have correlated
>> failures when they age out (that means if you're on bare metal, measuring
>> flash life, and ideally mixing vendors to avoid firmware bugs).
>>
>> You'll still see risks of huge clusters, largely in gossip and schema
>> propagation. Upcoming CEPs address those. 4.0 is better there (with schema,
>> especially) than 3.0 was, but for "max nodes in a cluster", what you're
>> really comparing is "how many gossip speakers and tokens are in the
>> cluster" (which means your vnode settings matter, for things like pending
>> range calculators).
>>
>> Looking at the roadmap, your real question comes down to :
>> - If you expect to use the transactional features in Accord/5.0 to
>> transact across rows/keys, you probably want to keep one cluster
>> - If you dont ever expect to use multi-key transactions, just de-risk by
>> sharding your cluster into many smaller clusters now, with consistent
>> hashing to map keys to clusters, and have 4 clusters of the same smaller
>> size, with whatever node density you think you can do based on your
>> compaction strategy and streaming rate (and disk type).
>>
>> If you have time and budget, create a 3 node cluster with whatever disks
>> you have, fill them, start working on them - expand to 4, treat one as
>> failed and replace it - simulate the operations you'll do at that size.
>> It's expensive to mimic a 500 host cluster, but if you've got budget, try
>> it in AWS and see what happens when you apply your real schema, and then do
>> a schema change.
>>
>>
>>
>>
>>
>> On Mon, Aug 21, 2023 at 7:31 AM Joe Obernberger <
>> joseph.obernber...@gmail.com> wrote:
>>
>>> For our scenario, the goal is to minimize down-time for a single (at
>>> least initially) data center system.  Data-loss is basically unacceptable.
>>> I wouldn't say we have a "rusty slow data center" - we can certainly use
>>> SSDs and have servers connected via 10G copper to a fast back-plane.  For
>>> our specific use case with Cassandra (lots of writes, small number of
>>> reads), the network load is usually pretty low.  I suspect that would
>>> change if we used Kubernetes + central 

Re: Big Data Question

2023-08-21 Thread Patrick McFadin
...and a shameless plug for the Cassandra Summit in December. We have a
talk from somebody that is doing 70TB per node and will be digging into all
the aspects that make that work for them. I hope everyone in this thread is
at that talk! I can't wait to hear all the questions.

Patrick

On Mon, Aug 21, 2023 at 8:01 AM Jeff Jirsa  wrote:

> There's a lot of questionable advice scattered in this thread. Set aside
> most of the guidance like 2TB/node, it's old and super nuanced.
>
> If you're bare metal, do what your organization is good at. If you have
> millions of dollars in SAN equipment and you know how SANs work and fail
> and get backed up, run on a SAN if your organization knows how to properly
> operate a SAN. Just make sure you understand it's a single point of failure.
>
> If you're in the cloud, EBS is basically the same concept. You can lose
> EBS in an AZ, just like you can lose SAN in a DC. Persist outside of that.
> Have backups. Know how to restore them.
>
> The reason the "2TB/node" limit was a thing was around time to recover
> from failure more than anything else. I described this in detail here, in
> 2015, before faster-streaming in 4.0 was a thing :
> https://stackoverflow.com/questions/31563447/cassandra-cluster-data-density-data-size-per-node-looking-for-feedback-and/31690279
> . With faster streaming, IF you use LCS (so faster streaming works), you
> can probably go at least 4-5x more dense than before, if you understand how
> likely your disks are to fail and you can ensure you dont have correlated
> failures when they age out (that means if you're on bare metal, measuring
> flash life, and ideally mixing vendors to avoid firmware bugs).
>
> You'll still see risks of huge clusters, largely in gossip and schema
> propagation. Upcoming CEPs address those. 4.0 is better there (with schema,
> especially) than 3.0 was, but for "max nodes in a cluster", what you're
> really comparing is "how many gossip speakers and tokens are in the
> cluster" (which means your vnode settings matter, for things like pending
> range calculators).
>
> Looking at the roadmap, your real question comes down to :
> - If you expect to use the transactional features in Accord/5.0 to
> transact across rows/keys, you probably want to keep one cluster
> - If you dont ever expect to use multi-key transactions, just de-risk by
> sharding your cluster into many smaller clusters now, with consistent
> hashing to map keys to clusters, and have 4 clusters of the same smaller
> size, with whatever node density you think you can do based on your
> compaction strategy and streaming rate (and disk type).
>
> If you have time and budget, create a 3 node cluster with whatever disks
> you have, fill them, start working on them - expand to 4, treat one as
> failed and replace it - simulate the operations you'll do at that size.
> It's expensive to mimic a 500 host cluster, but if you've got budget, try
> it in AWS and see what happens when you apply your real schema, and then do
> a schema change.
>
>
>
>
>
> On Mon, Aug 21, 2023 at 7:31 AM Joe Obernberger <
> joseph.obernber...@gmail.com> wrote:
>
>> For our scenario, the goal is to minimize down-time for a single (at
>> least initially) data center system.  Data-loss is basically unacceptable.
>> I wouldn't say we have a "rusty slow data center" - we can certainly use
>> SSDs and have servers connected via 10G copper to a fast back-plane.  For
>> our specific use case with Cassandra (lots of writes, small number of
>> reads), the network load is usually pretty low.  I suspect that would
>> change if we used Kubernetes + central persistent storage.
>> Good discussion.
>>
>> -Joe
>> On 8/17/2023 7:37 PM, daemeon reiydelle wrote:
>>
>> I started to respond, then realized I and the other OP posters are not
>> thinking the same: What is the business case for availability, data
>> los/reload/recoverability? You all argue for higher availability and damn
>> the cost. But noone asked "can you lose access, for 20 minutes, to a
>> portion of the data, 10 times a year, on a 250 node cluster in AWS, if it
>> is not lost"? Can you lose access 1-2 times a year for the cost of a 500
>> node cluster holding the same data?
>>
>> Then we can discuss 32/64g JVM and SSD's.
>> *.*
>> *Arthur C. Clarke famously said that "technology sufficiently advanced is
>> indistinguishable from magic." Magic is coming, and it's coming for all of
>> us*
>>
>> *Daemeon Reiydelle*
>> *email: daeme...@gmail.com *
>> *LI: https://www.linkedin.com/in/daemeonreiydelle/
>> *
>> *San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*
>>
>>
>> On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger <
>> joseph.obernber...@gmail.com> wrote:
>>
>>> Was assuming reaper did incremental?  That was probably a bad assumption.
>>>
>>> nodetool repair -pr
>>> I know it well now!
>>>
>>> :)
>>>
>>> -Joe
>>>
>>> On 8/17/2023 4:47 PM, Bowen Song via user wrote:
>>> > I don't have 

Re: Big Data Question

2023-08-21 Thread Jeff Jirsa
There's a lot of questionable advice scattered in this thread. Set aside
most of the guidance like 2TB/node, it's old and super nuanced.

If you're bare metal, do what your organization is good at. If you have
millions of dollars in SAN equipment and you know how SANs work and fail
and get backed up, run on a SAN if your organization knows how to properly
operate a SAN. Just make sure you understand it's a single point of failure.

If you're in the cloud, EBS is basically the same concept. You can lose EBS
in an AZ, just like you can lose SAN in a DC. Persist outside of that. Have
backups. Know how to restore them.

The reason the "2TB/node" limit was a thing was around time to recover from
failure more than anything else. I described this in detail here, in 2015,
before faster-streaming in 4.0 was a thing :
https://stackoverflow.com/questions/31563447/cassandra-cluster-data-density-data-size-per-node-looking-for-feedback-and/31690279
. With faster streaming, IF you use LCS (so faster streaming works), you
can probably go at least 4-5x more dense than before, if you understand how
likely your disks are to fail and you can ensure you dont have correlated
failures when they age out (that means if you're on bare metal, measuring
flash life, and ideally mixing vendors to avoid firmware bugs).

You'll still see risks of huge clusters, largely in gossip and schema
propagation. Upcoming CEPs address those. 4.0 is better there (with schema,
especially) than 3.0 was, but for "max nodes in a cluster", what you're
really comparing is "how many gossip speakers and tokens are in the
cluster" (which means your vnode settings matter, for things like pending
range calculators).

Looking at the roadmap, your real question comes down to :
- If you expect to use the transactional features in Accord/5.0 to transact
across rows/keys, you probably want to keep one cluster
- If you dont ever expect to use multi-key transactions, just de-risk by
sharding your cluster into many smaller clusters now, with consistent
hashing to map keys to clusters, and have 4 clusters of the same smaller
size, with whatever node density you think you can do based on your
compaction strategy and streaming rate (and disk type).

If you have time and budget, create a 3 node cluster with whatever disks
you have, fill them, start working on them - expand to 4, treat one as
failed and replace it - simulate the operations you'll do at that size.
It's expensive to mimic a 500 host cluster, but if you've got budget, try
it in AWS and see what happens when you apply your real schema, and then do
a schema change.





On Mon, Aug 21, 2023 at 7:31 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> For our scenario, the goal is to minimize down-time for a single (at least
> initially) data center system.  Data-loss is basically unacceptable.  I
> wouldn't say we have a "rusty slow data center" - we can certainly use SSDs
> and have servers connected via 10G copper to a fast back-plane.  For our
> specific use case with Cassandra (lots of writes, small number of reads),
> the network load is usually pretty low.  I suspect that would change if we
> used Kubernetes + central persistent storage.
> Good discussion.
>
> -Joe
> On 8/17/2023 7:37 PM, daemeon reiydelle wrote:
>
> I started to respond, then realized I and the other OP posters are not
> thinking the same: What is the business case for availability, data
> los/reload/recoverability? You all argue for higher availability and damn
> the cost. But noone asked "can you lose access, for 20 minutes, to a
> portion of the data, 10 times a year, on a 250 node cluster in AWS, if it
> is not lost"? Can you lose access 1-2 times a year for the cost of a 500
> node cluster holding the same data?
>
> Then we can discuss 32/64g JVM and SSD's.
> *.*
> *Arthur C. Clarke famously said that "technology sufficiently advanced is
> indistinguishable from magic." Magic is coming, and it's coming for all of
> us*
>
> *Daemeon Reiydelle*
> *email: daeme...@gmail.com *
> *LI: https://www.linkedin.com/in/daemeonreiydelle/
> *
> *San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*
>
>
> On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger <
> joseph.obernber...@gmail.com> wrote:
>
>> Was assuming reaper did incremental?  That was probably a bad assumption.
>>
>> nodetool repair -pr
>> I know it well now!
>>
>> :)
>>
>> -Joe
>>
>> On 8/17/2023 4:47 PM, Bowen Song via user wrote:
>> > I don't have experience with Cassandra on Kubernetes, so I can't
>> > comment on that.
>> >
>> > For repairs, may I interest you with incremental repairs? It will make
>> > repairs hell of a lot faster. Of course, occasional full repair is
>> > still needed, but that's another story.
>> >
>> >
>> > On 17/08/2023 21:36, Joe Obernberger wrote:
>> >> Thank you.  Enjoying this conversation.
>> >> Agree on blade servers, where each blade has a small number of SSDs.
>> >> Yeh/Nah to a kubernetes 

Re: Big Data Question

2023-08-21 Thread Joe Obernberger
For our scenario, the goal is to minimize down-time for a single (at 
least initially) data center system.  Data-loss is basically 
unacceptable.  I wouldn't say we have a "rusty slow data center" - we 
can certainly use SSDs and have servers connected via 10G copper to a 
fast back-plane.  For our specific use case with Cassandra (lots of 
writes, small number of reads), the network load is usually pretty low.  
I suspect that would change if we used Kubernetes + central persistent 
storage.

Good discussion.

-Joe

On 8/17/2023 7:37 PM, daemeon reiydelle wrote:
I started to respond, then realized I and the other OP posters are not 
thinking the same: What is the business case for availability, data 
los/reload/recoverability? You all argue for higher availability and 
damn the cost. But noone asked "can you lose access, for 20 minutes, 
to a portion of the data, 10 times a year, on a 250 node cluster in 
AWS, if it is not lost"? Can you lose access 1-2 times a year for the 
cost of a 500 node cluster holding the same data?


Then we can discuss 32/64g JVM and SSD's.
/./
/Arthur C. Clarke famously said that "technology sufficiently advanced 
is indistinguishable from magic." Magic is coming, and it's coming for 
all of us/

/
/
*Daemeon Reiydelle*
*email: daeme...@gmail.com*
*LI: https://www.linkedin.com/in/daemeonreiydelle/*
*San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*


On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger 
 wrote:


Was assuming reaper did incremental?  That was probably a bad
assumption.

nodetool repair -pr
I know it well now!

:)

-Joe

On 8/17/2023 4:47 PM, Bowen Song via user wrote:
> I don't have experience with Cassandra on Kubernetes, so I can't
> comment on that.
>
> For repairs, may I interest you with incremental repairs? It
will make
> repairs hell of a lot faster. Of course, occasional full repair is
> still needed, but that's another story.
>
>
> On 17/08/2023 21:36, Joe Obernberger wrote:
>> Thank you.  Enjoying this conversation.
>> Agree on blade servers, where each blade has a small number of
SSDs.
>> Yeh/Nah to a kubernetes approach assuming fast persistent
storage?  I
>> think that might be easier to manage.
>>
>> In my current benchmarks, the performance is excellent, but the
>> repairs are painful.  I come from the Hadoop world where it was
all
>> about large servers with lots of disk.
>> Relatively small number of tables, but some have a high number of
>> rows, 10bil + - we use spark to run across all the data.
>>
>> -Joe
>>
>> On 8/17/2023 12:13 PM, Bowen Song via user wrote:
>>> The optimal node size largely depends on the table schema and
>>> read/write pattern. In some cases 500 GB per node is too
large, but
>>> in some other cases 10TB per node works totally fine. It's
hard to
>>> estimate that without benchmarking.
>>>
>>> Again, just pointing out the obvious, you did not count the
off-heap
>>> memory and page cache. 1TB of RAM for 24GB heap * 40 instances is
>>> definitely not enough. You'll most likely need between 1.5 and
2 TB
>>> memory for 40x 24GB heap nodes. You may be better off with blade
>>> servers than single server with gigantic memory and disk sizes.
>>>
>>>
>>> On 17/08/2023 15:46, Joe Obernberger wrote:
 Thanks for this - yeah - duh - forgot about replication in my
example!
 So - is 2TBytes per Cassandra instance advisable?  Better to use
 more/less?  Modern 2u servers can be had with 24 3.8TBtyte
SSDs; so
 assume 80Tbytes per server, you could do:
 (1024*3)/80 = 39 servers, but you'd have to run 40 instances of
 Cassandra on each server; maybe 24G of heap per instance, so a
 server with 1TByte of RAM would work.
 Is this what folks would do?

 -Joe

 On 8/17/2023 9:13 AM, Bowen Song via user wrote:
> Just pointing out the obvious, for 1PB of data on nodes with
2TB
> disk each, you will need far more than 500 nodes.
>
> 1, it is unwise to run Cassandra with replication factor 1. It
> usually makes sense to use RF=3, so 1PB data will cost 3PB of
> storage space, minimal of 1500 such nodes.
>
> 2, depending on the compaction strategy you use and the write
> access pattern, there's a disk space amplification to consider.
> For example, with STCS, the disk usage can be many times of the
> actual live data size.
>
> 3, you will need some extra free disk space as temporary
space for
> running compactions.
>
> 4, the data is rarely going to be perfectly evenly distributed
> among all nodes, and you need to take that into
consideration and
> size the nodes based on the node with 

RE: Big Data Question

2023-08-18 Thread Durity, Sean R via user
Cost of availability is a fair question at some level of the discussion. In my 
experience, high availability is one of the top 2 or 3 reasons why Cassandra is 
chosen as the data solution. So, if I am given a Cassandra use case to build 
out, I would normally assume high availability is needed, even in a single data 
center scenario. Otherwise, there are other data options.


Sean R. Durity
DB Solutions
Staff Systems Engineer – Cassandra



INTERNAL USE
From: daemeon reiydelle 
Sent: Thursday, August 17, 2023 7:38 PM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Big Data Question

I started to respond, then realized I and the other OP posters are not thinking 
the same: What is the business case for availability, data 
los/reload/recoverability? You all argue for higher availability and damn the 
cost. But noone asked "can

I started to respond, then realized I and the other OP posters are not thinking 
the same: What is the business case for availability, data 
los/reload/recoverability? You all argue for higher availability and damn the 
cost. But noone asked "can you lose access, for 20 minutes, to a portion of the 
data, 10 times a year, on a 250 node cluster in AWS, if it is not lost"? Can 
you lose access 1-2 times a year for the cost of a 500 node cluster holding the 
same data?

Then we can discuss 32/64g JVM and SSD's.
.
Arthur C. Clarke famously said that "technology sufficiently advanced is 
indistinguishable from magic." Magic is coming, and it's coming for all of 
us

Daemeon Reiydelle
email: daeme...@gmail.com<mailto:daeme...@gmail.com>
LI: https://www.linkedin.com/in/daemeonreiydelle/ 
[linkedin.com]<https://urldefense.com/v3/__https:/www.linkedin.com/in/daemeonreiydelle/__;!!M-nmYVHPHQ!N1mRxPwl0tSfRLfEYnmvAswjcTP4hJoaD3cez01eFBEF_XdWXPbfExCyr_FrSTYe9KCKfOoNiLhaQPjzvUoMj-Q$>
San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle


On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger 
mailto:joseph.obernber...@gmail.com>> wrote:
Was assuming reaper did incremental?  That was probably a bad assumption.

nodetool repair -pr
I know it well now!

:)

-Joe

On 8/17/2023 4:47 PM, Bowen Song via user wrote:
> I don't have experience with Cassandra on Kubernetes, so I can't
> comment on that.
>
> For repairs, may I interest you with incremental repairs? It will make
> repairs hell of a lot faster. Of course, occasional full repair is
> still needed, but that's another story.
>
>
> On 17/08/2023 21:36, Joe Obernberger wrote:
>> Thank you.  Enjoying this conversation.
>> Agree on blade servers, where each blade has a small number of SSDs.
>> Yeh/Nah to a kubernetes approach assuming fast persistent storage?  I
>> think that might be easier to manage.
>>
>> In my current benchmarks, the performance is excellent, but the
>> repairs are painful.  I come from the Hadoop world where it was all
>> about large servers with lots of disk.
>> Relatively small number of tables, but some have a high number of
>> rows, 10bil + - we use spark to run across all the data.
>>
>> -Joe
>>
>> On 8/17/2023 12:13 PM, Bowen Song via user wrote:
>>> The optimal node size largely depends on the table schema and
>>> read/write pattern. In some cases 500 GB per node is too large, but
>>> in some other cases 10TB per node works totally fine. It's hard to
>>> estimate that without benchmarking.
>>>
>>> Again, just pointing out the obvious, you did not count the off-heap
>>> memory and page cache. 1TB of RAM for 24GB heap * 40 instances is
>>> definitely not enough. You'll most likely need between 1.5 and 2 TB
>>> memory for 40x 24GB heap nodes. You may be better off with blade
>>> servers than single server with gigantic memory and disk sizes.
>>>
>>>
>>> On 17/08/2023 15:46, Joe Obernberger wrote:
>>>> Thanks for this - yeah - duh - forgot about replication in my example!
>>>> So - is 2TBytes per Cassandra instance advisable?  Better to use
>>>> more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so
>>>> assume 80Tbytes per server, you could do:
>>>> (1024*3)/80 = 39 servers, but you'd have to run 40 instances of
>>>> Cassandra on each server; maybe 24G of heap per instance, so a
>>>> server with 1TByte of RAM would work.
>>>> Is this what folks would do?
>>>>
>>>> -Joe
>>>>
>>>> On 8/17/2023 9:13 AM, Bowen Song via user wrote:
>>>>> Just pointing out the obvious, for 1PB of data on nodes with 2TB
>>>>> disk each, you will need far more than 500 nodes.
>>>>>
>>>>> 1, it is unwise to run Cassandra with replication factor 1. It
>>>&g

Re: Big Data Question

2023-08-17 Thread daemeon reiydelle
I started to respond, then realized I and the other OP posters are not
thinking the same: What is the business case for availability, data
los/reload/recoverability? You all argue for higher availability and damn
the cost. But noone asked "can you lose access, for 20 minutes, to a
portion of the data, 10 times a year, on a 250 node cluster in AWS, if it
is not lost"? Can you lose access 1-2 times a year for the cost of a 500
node cluster holding the same data?

Then we can discuss 32/64g JVM and SSD's.
*.*
*Arthur C. Clarke famously said that "technology sufficiently advanced is
indistinguishable from magic." Magic is coming, and it's coming for all of
us*

*Daemeon Reiydelle*
*email: daeme...@gmail.com *
*LI: https://www.linkedin.com/in/daemeonreiydelle/
*
*San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*


On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Was assuming reaper did incremental?  That was probably a bad assumption.
>
> nodetool repair -pr
> I know it well now!
>
> :)
>
> -Joe
>
> On 8/17/2023 4:47 PM, Bowen Song via user wrote:
> > I don't have experience with Cassandra on Kubernetes, so I can't
> > comment on that.
> >
> > For repairs, may I interest you with incremental repairs? It will make
> > repairs hell of a lot faster. Of course, occasional full repair is
> > still needed, but that's another story.
> >
> >
> > On 17/08/2023 21:36, Joe Obernberger wrote:
> >> Thank you.  Enjoying this conversation.
> >> Agree on blade servers, where each blade has a small number of SSDs.
> >> Yeh/Nah to a kubernetes approach assuming fast persistent storage?  I
> >> think that might be easier to manage.
> >>
> >> In my current benchmarks, the performance is excellent, but the
> >> repairs are painful.  I come from the Hadoop world where it was all
> >> about large servers with lots of disk.
> >> Relatively small number of tables, but some have a high number of
> >> rows, 10bil + - we use spark to run across all the data.
> >>
> >> -Joe
> >>
> >> On 8/17/2023 12:13 PM, Bowen Song via user wrote:
> >>> The optimal node size largely depends on the table schema and
> >>> read/write pattern. In some cases 500 GB per node is too large, but
> >>> in some other cases 10TB per node works totally fine. It's hard to
> >>> estimate that without benchmarking.
> >>>
> >>> Again, just pointing out the obvious, you did not count the off-heap
> >>> memory and page cache. 1TB of RAM for 24GB heap * 40 instances is
> >>> definitely not enough. You'll most likely need between 1.5 and 2 TB
> >>> memory for 40x 24GB heap nodes. You may be better off with blade
> >>> servers than single server with gigantic memory and disk sizes.
> >>>
> >>>
> >>> On 17/08/2023 15:46, Joe Obernberger wrote:
>  Thanks for this - yeah - duh - forgot about replication in my example!
>  So - is 2TBytes per Cassandra instance advisable?  Better to use
>  more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so
>  assume 80Tbytes per server, you could do:
>  (1024*3)/80 = 39 servers, but you'd have to run 40 instances of
>  Cassandra on each server; maybe 24G of heap per instance, so a
>  server with 1TByte of RAM would work.
>  Is this what folks would do?
> 
>  -Joe
> 
>  On 8/17/2023 9:13 AM, Bowen Song via user wrote:
> > Just pointing out the obvious, for 1PB of data on nodes with 2TB
> > disk each, you will need far more than 500 nodes.
> >
> > 1, it is unwise to run Cassandra with replication factor 1. It
> > usually makes sense to use RF=3, so 1PB data will cost 3PB of
> > storage space, minimal of 1500 such nodes.
> >
> > 2, depending on the compaction strategy you use and the write
> > access pattern, there's a disk space amplification to consider.
> > For example, with STCS, the disk usage can be many times of the
> > actual live data size.
> >
> > 3, you will need some extra free disk space as temporary space for
> > running compactions.
> >
> > 4, the data is rarely going to be perfectly evenly distributed
> > among all nodes, and you need to take that into consideration and
> > size the nodes based on the node with the most data.
> >
> > 5, enough of bad news, here's a good one. Compression will save
> > you (a lot) of disk space!
> >
> > With all the above considered, you probably will end up with a lot
> > more than the 500 nodes you initially thought. Your choice of
> > compaction strategy and compression ratio can dramatically affect
> > this calculation.
> >
> >
> > On 16/08/2023 16:33, Joe Obernberger wrote:
> >> General question on how to configure Cassandra.  Say I have
> >> 1PByte of data to store.  The general rule of thumb is that each
> >> node (or at least instance of Cassandra) shouldn't handle more
> >> than 2TBytes of disk.  That 

Re: Big Data Question

2023-08-17 Thread Joe Obernberger

Was assuming reaper did incremental?  That was probably a bad assumption.

nodetool repair -pr
I know it well now!

:)

-Joe

On 8/17/2023 4:47 PM, Bowen Song via user wrote:
I don't have experience with Cassandra on Kubernetes, so I can't 
comment on that.


For repairs, may I interest you with incremental repairs? It will make 
repairs hell of a lot faster. Of course, occasional full repair is 
still needed, but that's another story.



On 17/08/2023 21:36, Joe Obernberger wrote:

Thank you.  Enjoying this conversation.
Agree on blade servers, where each blade has a small number of SSDs.  
Yeh/Nah to a kubernetes approach assuming fast persistent storage?  I 
think that might be easier to manage.


In my current benchmarks, the performance is excellent, but the 
repairs are painful.  I come from the Hadoop world where it was all 
about large servers with lots of disk.
Relatively small number of tables, but some have a high number of 
rows, 10bil + - we use spark to run across all the data.


-Joe

On 8/17/2023 12:13 PM, Bowen Song via user wrote:
The optimal node size largely depends on the table schema and 
read/write pattern. In some cases 500 GB per node is too large, but 
in some other cases 10TB per node works totally fine. It's hard to 
estimate that without benchmarking.


Again, just pointing out the obvious, you did not count the off-heap 
memory and page cache. 1TB of RAM for 24GB heap * 40 instances is 
definitely not enough. You'll most likely need between 1.5 and 2 TB 
memory for 40x 24GB heap nodes. You may be better off with blade 
servers than single server with gigantic memory and disk sizes.



On 17/08/2023 15:46, Joe Obernberger wrote:

Thanks for this - yeah - duh - forgot about replication in my example!
So - is 2TBytes per Cassandra instance advisable?  Better to use 
more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so 
assume 80Tbytes per server, you could do:
(1024*3)/80 = 39 servers, but you'd have to run 40 instances of 
Cassandra on each server; maybe 24G of heap per instance, so a 
server with 1TByte of RAM would work.

Is this what folks would do?

-Joe

On 8/17/2023 9:13 AM, Bowen Song via user wrote:
Just pointing out the obvious, for 1PB of data on nodes with 2TB 
disk each, you will need far more than 500 nodes.


1, it is unwise to run Cassandra with replication factor 1. It 
usually makes sense to use RF=3, so 1PB data will cost 3PB of 
storage space, minimal of 1500 such nodes.


2, depending on the compaction strategy you use and the write 
access pattern, there's a disk space amplification to consider. 
For example, with STCS, the disk usage can be many times of the 
actual live data size.


3, you will need some extra free disk space as temporary space for 
running compactions.


4, the data is rarely going to be perfectly evenly distributed 
among all nodes, and you need to take that into consideration and 
size the nodes based on the node with the most data.


5, enough of bad news, here's a good one. Compression will save 
you (a lot) of disk space!


With all the above considered, you probably will end up with a lot 
more than the 500 nodes you initially thought. Your choice of 
compaction strategy and compression ratio can dramatically affect 
this calculation.



On 16/08/2023 16:33, Joe Obernberger wrote:
General question on how to configure Cassandra.  Say I have 
1PByte of data to store.  The general rule of thumb is that each 
node (or at least instance of Cassandra) shouldn't handle more 
than 2TBytes of disk.  That means 500 instances of Cassandra.


Assuming you have very fast persistent storage (such as a NetApp, 
PorterWorx etc.), would using Kubernetes or some orchestration 
layer to handle those nodes be a viable approach? Perhaps the 
worker nodes would have enough RAM to run 4 instances (pods) of 
Cassandra, you would need 125 servers.
Another approach is to build your servers with 5 (or more) SSD 
devices - one for OS, four for each instance of Cassandra running 
on that server.  Then build some scripts/ansible/puppet that 
would manage Cassandra start/stops, and other maintenance items.


Where I think this runs into problems is with repairs, or 
sstablescrubs that can take days to run on a single instance. How 
is that handled 'in the real world'?  With seed nodes, how many 
would you have in such a configuration?

Thanks for any thoughts!

-Joe








--
This email has been checked for viruses by AVG antivirus software.
www.avg.com


Re: Big Data Question

2023-08-17 Thread Bowen Song via user
I don't have experience with Cassandra on Kubernetes, so I can't comment 
on that.


For repairs, may I interest you with incremental repairs? It will make 
repairs hell of a lot faster. Of course, occasional full repair is still 
needed, but that's another story.



On 17/08/2023 21:36, Joe Obernberger wrote:

Thank you.  Enjoying this conversation.
Agree on blade servers, where each blade has a small number of SSDs.  
Yeh/Nah to a kubernetes approach assuming fast persistent storage?  I 
think that might be easier to manage.


In my current benchmarks, the performance is excellent, but the 
repairs are painful.  I come from the Hadoop world where it was all 
about large servers with lots of disk.
Relatively small number of tables, but some have a high number of 
rows, 10bil + - we use spark to run across all the data.


-Joe

On 8/17/2023 12:13 PM, Bowen Song via user wrote:
The optimal node size largely depends on the table schema and 
read/write pattern. In some cases 500 GB per node is too large, but 
in some other cases 10TB per node works totally fine. It's hard to 
estimate that without benchmarking.


Again, just pointing out the obvious, you did not count the off-heap 
memory and page cache. 1TB of RAM for 24GB heap * 40 instances is 
definitely not enough. You'll most likely need between 1.5 and 2 TB 
memory for 40x 24GB heap nodes. You may be better off with blade 
servers than single server with gigantic memory and disk sizes.



On 17/08/2023 15:46, Joe Obernberger wrote:

Thanks for this - yeah - duh - forgot about replication in my example!
So - is 2TBytes per Cassandra instance advisable?  Better to use 
more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so 
assume 80Tbytes per server, you could do:
(1024*3)/80 = 39 servers, but you'd have to run 40 instances of 
Cassandra on each server; maybe 24G of heap per instance, so a 
server with 1TByte of RAM would work.

Is this what folks would do?

-Joe

On 8/17/2023 9:13 AM, Bowen Song via user wrote:
Just pointing out the obvious, for 1PB of data on nodes with 2TB 
disk each, you will need far more than 500 nodes.


1, it is unwise to run Cassandra with replication factor 1. It 
usually makes sense to use RF=3, so 1PB data will cost 3PB of 
storage space, minimal of 1500 such nodes.


2, depending on the compaction strategy you use and the write 
access pattern, there's a disk space amplification to consider. For 
example, with STCS, the disk usage can be many times of the actual 
live data size.


3, you will need some extra free disk space as temporary space for 
running compactions.


4, the data is rarely going to be perfectly evenly distributed 
among all nodes, and you need to take that into consideration and 
size the nodes based on the node with the most data.


5, enough of bad news, here's a good one. Compression will save you 
(a lot) of disk space!


With all the above considered, you probably will end up with a lot 
more than the 500 nodes you initially thought. Your choice of 
compaction strategy and compression ratio can dramatically affect 
this calculation.



On 16/08/2023 16:33, Joe Obernberger wrote:
General question on how to configure Cassandra.  Say I have 1PByte 
of data to store.  The general rule of thumb is that each node (or 
at least instance of Cassandra) shouldn't handle more than 2TBytes 
of disk.  That means 500 instances of Cassandra.


Assuming you have very fast persistent storage (such as a NetApp, 
PorterWorx etc.), would using Kubernetes or some orchestration 
layer to handle those nodes be a viable approach? Perhaps the 
worker nodes would have enough RAM to run 4 instances (pods) of 
Cassandra, you would need 125 servers.
Another approach is to build your servers with 5 (or more) SSD 
devices - one for OS, four for each instance of Cassandra running 
on that server.  Then build some scripts/ansible/puppet that would 
manage Cassandra start/stops, and other maintenance items.


Where I think this runs into problems is with repairs, or 
sstablescrubs that can take days to run on a single instance. How 
is that handled 'in the real world'?  With seed nodes, how many 
would you have in such a configuration?

Thanks for any thoughts!

-Joe








Re: Big Data Question

2023-08-17 Thread Joe Obernberger

Thank you.  Enjoying this conversation.
Agree on blade servers, where each blade has a small number of SSDs.  
Yeh/Nah to a kubernetes approach assuming fast persistent storage?  I 
think that might be easier to manage.


In my current benchmarks, the performance is excellent, but the repairs 
are painful.  I come from the Hadoop world where it was all about large 
servers with lots of disk.
Relatively small number of tables, but some have a high number of rows, 
10bil + - we use spark to run across all the data.


-Joe

On 8/17/2023 12:13 PM, Bowen Song via user wrote:
The optimal node size largely depends on the table schema and 
read/write pattern. In some cases 500 GB per node is too large, but in 
some other cases 10TB per node works totally fine. It's hard to 
estimate that without benchmarking.


Again, just pointing out the obvious, you did not count the off-heap 
memory and page cache. 1TB of RAM for 24GB heap * 40 instances is 
definitely not enough. You'll most likely need between 1.5 and 2 TB 
memory for 40x 24GB heap nodes. You may be better off with blade 
servers than single server with gigantic memory and disk sizes.



On 17/08/2023 15:46, Joe Obernberger wrote:

Thanks for this - yeah - duh - forgot about replication in my example!
So - is 2TBytes per Cassandra instance advisable?  Better to use 
more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so 
assume 80Tbytes per server, you could do:
(1024*3)/80 = 39 servers, but you'd have to run 40 instances of 
Cassandra on each server; maybe 24G of heap per instance, so a server 
with 1TByte of RAM would work.

Is this what folks would do?

-Joe

On 8/17/2023 9:13 AM, Bowen Song via user wrote:
Just pointing out the obvious, for 1PB of data on nodes with 2TB 
disk each, you will need far more than 500 nodes.


1, it is unwise to run Cassandra with replication factor 1. It 
usually makes sense to use RF=3, so 1PB data will cost 3PB of 
storage space, minimal of 1500 such nodes.


2, depending on the compaction strategy you use and the write access 
pattern, there's a disk space amplification to consider. For 
example, with STCS, the disk usage can be many times of the actual 
live data size.


3, you will need some extra free disk space as temporary space for 
running compactions.


4, the data is rarely going to be perfectly evenly distributed among 
all nodes, and you need to take that into consideration and size the 
nodes based on the node with the most data.


5, enough of bad news, here's a good one. Compression will save you 
(a lot) of disk space!


With all the above considered, you probably will end up with a lot 
more than the 500 nodes you initially thought. Your choice of 
compaction strategy and compression ratio can dramatically affect 
this calculation.



On 16/08/2023 16:33, Joe Obernberger wrote:
General question on how to configure Cassandra.  Say I have 1PByte 
of data to store.  The general rule of thumb is that each node (or 
at least instance of Cassandra) shouldn't handle more than 2TBytes 
of disk.  That means 500 instances of Cassandra.


Assuming you have very fast persistent storage (such as a NetApp, 
PorterWorx etc.), would using Kubernetes or some orchestration 
layer to handle those nodes be a viable approach? Perhaps the 
worker nodes would have enough RAM to run 4 instances (pods) of 
Cassandra, you would need 125 servers.
Another approach is to build your servers with 5 (or more) SSD 
devices - one for OS, four for each instance of Cassandra running 
on that server.  Then build some scripts/ansible/puppet that would 
manage Cassandra start/stops, and other maintenance items.


Where I think this runs into problems is with repairs, or 
sstablescrubs that can take days to run on a single instance. How 
is that handled 'in the real world'?  With seed nodes, how many 
would you have in such a configuration?

Thanks for any thoughts!

-Joe






--
This email has been checked for viruses by AVG antivirus software.
www.avg.com


Re: Big Data Question

2023-08-17 Thread Bowen Song via user
From my experience, that's not entirely true. For large nodes, the 
bottleneck is usually the JVM garbage collector. The the GC pauses can 
easily get out of control on very large heaps, and long STW pauses may 
also result in nodes flip up and down from other nodes' perspective, 
which often renders the entire cluster unstable.


Using RF=1 is also strongly discouraged, even with reliable and durable 
storage. By going with RF=1, you don't only lose the data replication, 
but also the high-availability. If any node becomes unavailable in the 
cluster, it will render the entire token range(s) owned by that node 
inaccessible, causing (some or all) CQL queries to fail. This means many 
routine maintenance tasks, such as upgrading and restarting nodes, are 
going to introduce downtime for the cluster. To ensure strong 
consistency and HA, RF=3 is recommended.



On 17/08/2023 20:40, daemeon reiydelle wrote:
A lot of (actually all) seem to be based on local nodes with 1gb 
networks of spinning rust. Much of what is mentioned below is TOTALLY 
wrong for cloud. So clarify whether you are "real world" or rusty slow 
data center world (definitely not modern DC either).


E.g. should not handle more than 2tb of ACTIVE disk, and that was for 
spinning rust with maybe 1gb networks. 10tb of modern high speed SSD 
is more typical with 10 or 40gb networks. If data is persisted to 
cloud storage, replication should be 1, vm's fail over to new 
hardware. Obviously if your storage is ephemeral, you have a different 
discussion. More of a monologue with an idiot in Finance, but 

/./
/Arthur C. Clarke famously said that "technology sufficiently advanced 
is indistinguishable from magic." Magic is coming, and it's coming for 
all of us/

/
/
*Daemeon Reiydelle*
*email: daeme...@gmail.com*
*LI: https://www.linkedin.com/in/daemeonreiydelle/*
*San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*


On Thu, Aug 17, 2023 at 6:13 AM Bowen Song via user 
 wrote:


Just pointing out the obvious, for 1PB of data on nodes with 2TB disk
each, you will need far more than 500 nodes.

1, it is unwise to run Cassandra with replication factor 1. It
usually
makes sense to use RF=3, so 1PB data will cost 3PB of storage space,
minimal of 1500 such nodes.

2, depending on the compaction strategy you use and the write access
pattern, there's a disk space amplification to consider. For example,
with STCS, the disk usage can be many times of the actual live
data size.

3, you will need some extra free disk space as temporary space for
running compactions.

4, the data is rarely going to be perfectly evenly distributed
among all
nodes, and you need to take that into consideration and size the
nodes
based on the node with the most data.

5, enough of bad news, here's a good one. Compression will save
you (a
lot) of disk space!

With all the above considered, you probably will end up with a lot
more
than the 500 nodes you initially thought. Your choice of compaction
strategy and compression ratio can dramatically affect this
calculation.


On 16/08/2023 16:33, Joe Obernberger wrote:
> General question on how to configure Cassandra.  Say I have
1PByte of
> data to store.  The general rule of thumb is that each node (or at
> least instance of Cassandra) shouldn't handle more than 2TBytes of
> disk.  That means 500 instances of Cassandra.
>
> Assuming you have very fast persistent storage (such as a NetApp,
> PorterWorx etc.), would using Kubernetes or some orchestration
layer
> to handle those nodes be a viable approach?  Perhaps the worker
nodes
> would have enough RAM to run 4 instances (pods) of Cassandra, you
> would need 125 servers.
> Another approach is to build your servers with 5 (or more) SSD
devices
> - one for OS, four for each instance of Cassandra running on that
> server.  Then build some scripts/ansible/puppet that would manage
> Cassandra start/stops, and other maintenance items.
>
> Where I think this runs into problems is with repairs, or
> sstablescrubs that can take days to run on a single instance. 
How is
> that handled 'in the real world'?  With seed nodes, how many
would you
> have in such a configuration?
> Thanks for any thoughts!
>
> -Joe
>
>


Re: Big Data Question

2023-08-17 Thread daemeon reiydelle
A lot of (actually all) seem to be based on local nodes with 1gb networks
of spinning rust. Much of what is mentioned below is TOTALLY wrong for
cloud. So clarify whether you are "real world" or rusty slow data center
world (definitely not modern DC either).

E.g. should not handle more than 2tb of ACTIVE disk, and that was for
spinning rust with maybe 1gb networks. 10tb of modern high speed SSD is
more typical with 10 or 40gb networks. If data is persisted to cloud
storage, replication should be 1, vm's fail over to new hardware. Obviously
if your storage is ephemeral, you have a different discussion. More of a
monologue with an idiot in Finance, but 
*.*
*Arthur C. Clarke famously said that "technology sufficiently advanced is
indistinguishable from magic." Magic is coming, and it's coming for all of
us*

*Daemeon Reiydelle*
*email: daeme...@gmail.com *
*LI: https://www.linkedin.com/in/daemeonreiydelle/
*
*San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*


On Thu, Aug 17, 2023 at 6:13 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Just pointing out the obvious, for 1PB of data on nodes with 2TB disk
> each, you will need far more than 500 nodes.
>
> 1, it is unwise to run Cassandra with replication factor 1. It usually
> makes sense to use RF=3, so 1PB data will cost 3PB of storage space,
> minimal of 1500 such nodes.
>
> 2, depending on the compaction strategy you use and the write access
> pattern, there's a disk space amplification to consider. For example,
> with STCS, the disk usage can be many times of the actual live data size.
>
> 3, you will need some extra free disk space as temporary space for
> running compactions.
>
> 4, the data is rarely going to be perfectly evenly distributed among all
> nodes, and you need to take that into consideration and size the nodes
> based on the node with the most data.
>
> 5, enough of bad news, here's a good one. Compression will save you (a
> lot) of disk space!
>
> With all the above considered, you probably will end up with a lot more
> than the 500 nodes you initially thought. Your choice of compaction
> strategy and compression ratio can dramatically affect this calculation.
>
>
> On 16/08/2023 16:33, Joe Obernberger wrote:
> > General question on how to configure Cassandra.  Say I have 1PByte of
> > data to store.  The general rule of thumb is that each node (or at
> > least instance of Cassandra) shouldn't handle more than 2TBytes of
> > disk.  That means 500 instances of Cassandra.
> >
> > Assuming you have very fast persistent storage (such as a NetApp,
> > PorterWorx etc.), would using Kubernetes or some orchestration layer
> > to handle those nodes be a viable approach?  Perhaps the worker nodes
> > would have enough RAM to run 4 instances (pods) of Cassandra, you
> > would need 125 servers.
> > Another approach is to build your servers with 5 (or more) SSD devices
> > - one for OS, four for each instance of Cassandra running on that
> > server.  Then build some scripts/ansible/puppet that would manage
> > Cassandra start/stops, and other maintenance items.
> >
> > Where I think this runs into problems is with repairs, or
> > sstablescrubs that can take days to run on a single instance.  How is
> > that handled 'in the real world'?  With seed nodes, how many would you
> > have in such a configuration?
> > Thanks for any thoughts!
> >
> > -Joe
> >
> >
>


Re: Big Data Question

2023-08-17 Thread Bowen Song via user
The optimal node size largely depends on the table schema and read/write 
pattern. In some cases 500 GB per node is too large, but in some other 
cases 10TB per node works totally fine. It's hard to estimate that 
without benchmarking.


Again, just pointing out the obvious, you did not count the off-heap 
memory and page cache. 1TB of RAM for 24GB heap * 40 instances is 
definitely not enough. You'll most likely need between 1.5 and 2 TB 
memory for 40x 24GB heap nodes. You may be better off with blade servers 
than single server with gigantic memory and disk sizes.



On 17/08/2023 15:46, Joe Obernberger wrote:

Thanks for this - yeah - duh - forgot about replication in my example!
So - is 2TBytes per Cassandra instance advisable?  Better to use 
more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so 
assume 80Tbytes per server, you could do:
(1024*3)/80 = 39 servers, but you'd have to run 40 instances of 
Cassandra on each server; maybe 24G of heap per instance, so a server 
with 1TByte of RAM would work.

Is this what folks would do?

-Joe

On 8/17/2023 9:13 AM, Bowen Song via user wrote:
Just pointing out the obvious, for 1PB of data on nodes with 2TB disk 
each, you will need far more than 500 nodes.


1, it is unwise to run Cassandra with replication factor 1. It 
usually makes sense to use RF=3, so 1PB data will cost 3PB of storage 
space, minimal of 1500 such nodes.


2, depending on the compaction strategy you use and the write access 
pattern, there's a disk space amplification to consider. For example, 
with STCS, the disk usage can be many times of the actual live data 
size.


3, you will need some extra free disk space as temporary space for 
running compactions.


4, the data is rarely going to be perfectly evenly distributed among 
all nodes, and you need to take that into consideration and size the 
nodes based on the node with the most data.


5, enough of bad news, here's a good one. Compression will save you 
(a lot) of disk space!


With all the above considered, you probably will end up with a lot 
more than the 500 nodes you initially thought. Your choice of 
compaction strategy and compression ratio can dramatically affect 
this calculation.



On 16/08/2023 16:33, Joe Obernberger wrote:
General question on how to configure Cassandra.  Say I have 1PByte 
of data to store.  The general rule of thumb is that each node (or 
at least instance of Cassandra) shouldn't handle more than 2TBytes 
of disk.  That means 500 instances of Cassandra.


Assuming you have very fast persistent storage (such as a NetApp, 
PorterWorx etc.), would using Kubernetes or some orchestration layer 
to handle those nodes be a viable approach? Perhaps the worker nodes 
would have enough RAM to run 4 instances (pods) of Cassandra, you 
would need 125 servers.
Another approach is to build your servers with 5 (or more) SSD 
devices - one for OS, four for each instance of Cassandra running on 
that server.  Then build some scripts/ansible/puppet that would 
manage Cassandra start/stops, and other maintenance items.


Where I think this runs into problems is with repairs, or 
sstablescrubs that can take days to run on a single instance. How is 
that handled 'in the real world'?  With seed nodes, how many would 
you have in such a configuration?

Thanks for any thoughts!

-Joe






RE: Big Data Question

2023-08-17 Thread Durity, Sean R via user
For a variety of reasons, we have clusters with 5 TB of disk per host as a 
“standard.” In our larger data clusters, it does take longer to add/remove 
nodes or do things like upgradesstables after an upgrade. These nodes have 3+TB 
of actual data on the drive. But, we were able to shrink the node count from 
our days of using 1 or 2 TB of disk. Lots of potential cost tradeoffs to 
consider – licensing/support, server cost, maintenance time, more or less 
servers to have failures, number of (expensive?!) switch ports used, etc.

NOTE: this is 3.x experience, not 4.x with faster streaming.

Sean R. Durity



INTERNAL USE
From: Joe Obernberger 
Sent: Thursday, August 17, 2023 10:46 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Big Data Question

Thanks for this - yeah - duh - forgot about replication in my example! So - is 
2TBytes per Cassandra instance advisable?  Better to use more/less?  Modern 2u 
servers can be had with 24 3. 8TBtyte SSDs; so assume 80Tbytes per server, you 
could


Thanks for this - yeah - duh - forgot about replication in my example!

So - is 2TBytes per Cassandra instance advisable?  Better to use

more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so

assume 80Tbytes per server, you could do:

(1024*3)/80 = 39 servers, but you'd have to run 40 instances of

Cassandra on each server; maybe 24G of heap per instance, so a server

with 1TByte of RAM would work.

Is this what folks would do?



-Joe



On 8/17/2023 9:13 AM, Bowen Song via user wrote:

> Just pointing out the obvious, for 1PB of data on nodes with 2TB disk

> each, you will need far more than 500 nodes.

>

> 1, it is unwise to run Cassandra with replication factor 1. It usually

> makes sense to use RF=3, so 1PB data will cost 3PB of storage space,

> minimal of 1500 such nodes.

>

> 2, depending on the compaction strategy you use and the write access

> pattern, there's a disk space amplification to consider. For example,

> with STCS, the disk usage can be many times of the actual live data size.

>

> 3, you will need some extra free disk space as temporary space for

> running compactions.

>

> 4, the data is rarely going to be perfectly evenly distributed among

> all nodes, and you need to take that into consideration and size the

> nodes based on the node with the most data.

>

> 5, enough of bad news, here's a good one. Compression will save you (a

> lot) of disk space!

>

> With all the above considered, you probably will end up with a lot

> more than the 500 nodes you initially thought. Your choice of

> compaction strategy and compression ratio can dramatically affect this

> calculation.

>

>

> On 16/08/2023 16:33, Joe Obernberger wrote:

>> General question on how to configure Cassandra.  Say I have 1PByte of

>> data to store.  The general rule of thumb is that each node (or at

>> least instance of Cassandra) shouldn't handle more than 2TBytes of

>> disk.  That means 500 instances of Cassandra.

>>

>> Assuming you have very fast persistent storage (such as a NetApp,

>> PorterWorx etc.), would using Kubernetes or some orchestration layer

>> to handle those nodes be a viable approach? Perhaps the worker nodes

>> would have enough RAM to run 4 instances (pods) of Cassandra, you

>> would need 125 servers.

>> Another approach is to build your servers with 5 (or more) SSD

>> devices - one for OS, four for each instance of Cassandra running on

>> that server.  Then build some scripts/ansible/puppet that would

>> manage Cassandra start/stops, and other maintenance items.

>>

>> Where I think this runs into problems is with repairs, or

>> sstablescrubs that can take days to run on a single instance. How is

>> that handled 'in the real world'?  With seed nodes, how many would

>> you have in such a configuration?

>> Thanks for any thoughts!

>>

>> -Joe

>>

>>



--

This email has been checked for viruses by AVG antivirus software.

https://urldefense.com/v3/__http://www.avg.com__;!!M-nmYVHPHQ!JNgRIPkjVYoJBn7hBrUMEUxlXhoB0f9NUYIcGYPiexUZA5rpWWgPiLJp37dwGzdXMyMVIJJn0hzkcljb0wokF_RwMJ_g6KRPXA$<https://urldefense.com/v3/__http:/www.avg.com__;!!M-nmYVHPHQ!JNgRIPkjVYoJBn7hBrUMEUxlXhoB0f9NUYIcGYPiexUZA5rpWWgPiLJp37dwGzdXMyMVIJJn0hzkcljb0wokF_RwMJ_g6KRPXA$>



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to

Re: Big Data Question

2023-08-17 Thread C. Scott Andreas

A few thoughts on this:– 80TB per machine is pretty dense. Consider the amount of data 
you'd need to re-replicate in the event of a hardware failure that takes down all 80TB 
(DIMM failure requiring replacement, non-reduntant PSU failure, NIC, etc).– 24GB of heap 
is also pretty generous. Depending on how you're using Cassandra, you may be able to get 
by with ~half of this (though keep in mind the additional direct memory/offheap space 
required if you're using off-heap merkle trees).– 40 instances per machine can be a lot 
to manage. You can reduce this and address multiple physical drives per instance by 
either RAID-0'ing them together; or by using Cassandra in a JBOD configuration (multiple 
data dirs per instance).– Remember to consider the ratio of available CPU vs. amount of 
storage you're addressing per machine in your configuration. It's easy to spec a box 
that maxes out on disk without enough oomph to serve user queries and compaction over 
the amount of storage.– You'll want to run some smaller-scale perf testing to determine 
this ratio. The good news is that you mostly need to stress is the throughput of a 
replica set rather than an entire cluster. Small-scale POCs will generally map well to 
larger clusters, so long as the total count of Cassandra processes isn't more than a 
couple thousand.– At this scale, small improvements can go a very long way. If your data 
is compressible (i.e., not pre-compressed/encrypted prior to being stored in Cassandra), 
you'll likely want to use ZStandard rather than LZ4 - and possibly at a higher-ratio 
than the default. Test a set of input data with different ZStandard compression levels. 
You may save > 10% of storage relative to LZ4 by doing so without sacrificing much in 
terms of CPU.On Aug 17, 2023, at 7:46 AM, Joe Obernberger 
 wrote:Thanks for this - yeah - duh - forgot about 
replication in my example!So - is 2TBytes per Cassandra instance advisable?  Better to 
use more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so assume 80Tbytes 
per server, you could do:(1024*3)/80 = 39 servers, but you'd have to run 40 instances of 
Cassandra on each server; maybe 24G of heap per instance, so a server with 1TByte of RAM 
would work.Is this what folks would do?-JoeOn 8/17/2023 9:13 AM, Bowen Song via user 
wrote:Just pointing out the obvious, for 1PB of data on nodes with 2TB disk each, you 
will need far more than 500 nodes.1, it is unwise to run Cassandra with replication 
factor 1. It usually makes sense to use RF=3, so 1PB data will cost 3PB of storage 
space, minimal of 1500 such nodes.2, depending on the compaction strategy you use and 
the write access pattern, there's a disk space amplification to consider. For example, 
with STCS, the disk usage can be many times of the actual live data size.3, you will 
need some extra free disk space as temporary space for running compactions.4, the data 
is rarely going to be perfectly evenly distributed among all nodes, and you need to take 
that into consideration and size the nodes based on the node with the most data.5, 
enough of bad news, here's a good one. Compression will save you (a lot) of disk 
space!With all the above considered, you probably will end up with a lot more than the 
500 nodes you initially thought. Your choice of compaction strategy and compression 
ratio can dramatically affect this calculation.On 16/08/2023 16:33, Joe Obernberger 
wrote:General question on how to configure Cassandra.  Say I have 1PByte of data to 
store.  The general rule of thumb is that each node (or at least instance of Cassandra) 
shouldn't handle more than 2TBytes of disk.  That means 500 instances of 
Cassandra.Assuming you have very fast persistent storage (such as a NetApp, PorterWorx 
etc.), would using Kubernetes or some orchestration layer to handle those nodes be a 
viable approach? Perhaps the worker nodes would have enough RAM to run 4 instances 
(pods) of Cassandra, you would need 125 servers.Another approach is to build your 
servers with 5 (or more) SSD devices - one for OS, four for each instance of Cassandra 
running on that server.  Then build some scripts/ansible/puppet that would manage 
Cassandra start/stops, and other maintenance items.Where I think this runs into problems 
is with repairs, or sstablescrubs that can take days to run on a single instance. How is 
that handled 'in the real world'?  With seed nodes, how many would you have in such a 
configuration?Thanks for any thoughts!-Joe-- This email has been checked for viruses by 
AVG antivirus software.www.avg.com

Re: Big Data Question

2023-08-17 Thread Joe Obernberger

Thanks for this - yeah - duh - forgot about replication in my example!
So - is 2TBytes per Cassandra instance advisable?  Better to use 
more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so 
assume 80Tbytes per server, you could do:
(1024*3)/80 = 39 servers, but you'd have to run 40 instances of 
Cassandra on each server; maybe 24G of heap per instance, so a server 
with 1TByte of RAM would work.

Is this what folks would do?

-Joe

On 8/17/2023 9:13 AM, Bowen Song via user wrote:
Just pointing out the obvious, for 1PB of data on nodes with 2TB disk 
each, you will need far more than 500 nodes.


1, it is unwise to run Cassandra with replication factor 1. It usually 
makes sense to use RF=3, so 1PB data will cost 3PB of storage space, 
minimal of 1500 such nodes.


2, depending on the compaction strategy you use and the write access 
pattern, there's a disk space amplification to consider. For example, 
with STCS, the disk usage can be many times of the actual live data size.


3, you will need some extra free disk space as temporary space for 
running compactions.


4, the data is rarely going to be perfectly evenly distributed among 
all nodes, and you need to take that into consideration and size the 
nodes based on the node with the most data.


5, enough of bad news, here's a good one. Compression will save you (a 
lot) of disk space!


With all the above considered, you probably will end up with a lot 
more than the 500 nodes you initially thought. Your choice of 
compaction strategy and compression ratio can dramatically affect this 
calculation.



On 16/08/2023 16:33, Joe Obernberger wrote:
General question on how to configure Cassandra.  Say I have 1PByte of 
data to store.  The general rule of thumb is that each node (or at 
least instance of Cassandra) shouldn't handle more than 2TBytes of 
disk.  That means 500 instances of Cassandra.


Assuming you have very fast persistent storage (such as a NetApp, 
PorterWorx etc.), would using Kubernetes or some orchestration layer 
to handle those nodes be a viable approach? Perhaps the worker nodes 
would have enough RAM to run 4 instances (pods) of Cassandra, you 
would need 125 servers.
Another approach is to build your servers with 5 (or more) SSD 
devices - one for OS, four for each instance of Cassandra running on 
that server.  Then build some scripts/ansible/puppet that would 
manage Cassandra start/stops, and other maintenance items.


Where I think this runs into problems is with repairs, or 
sstablescrubs that can take days to run on a single instance. How is 
that handled 'in the real world'?  With seed nodes, how many would 
you have in such a configuration?

Thanks for any thoughts!

-Joe




--
This email has been checked for viruses by AVG antivirus software.
www.avg.com


Re: Big Data Question

2023-08-17 Thread Bowen Song via user
Just pointing out the obvious, for 1PB of data on nodes with 2TB disk 
each, you will need far more than 500 nodes.


1, it is unwise to run Cassandra with replication factor 1. It usually 
makes sense to use RF=3, so 1PB data will cost 3PB of storage space, 
minimal of 1500 such nodes.


2, depending on the compaction strategy you use and the write access 
pattern, there's a disk space amplification to consider. For example, 
with STCS, the disk usage can be many times of the actual live data size.


3, you will need some extra free disk space as temporary space for 
running compactions.


4, the data is rarely going to be perfectly evenly distributed among all 
nodes, and you need to take that into consideration and size the nodes 
based on the node with the most data.


5, enough of bad news, here's a good one. Compression will save you (a 
lot) of disk space!


With all the above considered, you probably will end up with a lot more 
than the 500 nodes you initially thought. Your choice of compaction 
strategy and compression ratio can dramatically affect this calculation.



On 16/08/2023 16:33, Joe Obernberger wrote:
General question on how to configure Cassandra.  Say I have 1PByte of 
data to store.  The general rule of thumb is that each node (or at 
least instance of Cassandra) shouldn't handle more than 2TBytes of 
disk.  That means 500 instances of Cassandra.


Assuming you have very fast persistent storage (such as a NetApp, 
PorterWorx etc.), would using Kubernetes or some orchestration layer 
to handle those nodes be a viable approach?  Perhaps the worker nodes 
would have enough RAM to run 4 instances (pods) of Cassandra, you 
would need 125 servers.
Another approach is to build your servers with 5 (or more) SSD devices 
- one for OS, four for each instance of Cassandra running on that 
server.  Then build some scripts/ansible/puppet that would manage 
Cassandra start/stops, and other maintenance items.


Where I think this runs into problems is with repairs, or 
sstablescrubs that can take days to run on a single instance.  How is 
that handled 'in the real world'?  With seed nodes, how many would you 
have in such a configuration?

Thanks for any thoughts!

-Joe




Re: Big Data Question

2023-08-16 Thread Jeff Jirsa
A lot of things depend on actual cluster config - compaction settings (LCS
vs STCS vs TWCS) and token allocation (single token, vnodes, etc) matter a
ton.

With 4.0 and LCS, streaming for replacement is MUCH faster, so much so that
most people should be fine with 4-8TB/node, because the rebuild time is
decreased by an order of magnitude.

If you happen to have large physical machines, running multiple instances
on a machine (each with a single token, and making sure you match rack
awareness) sorta approximates vnodes without some of the unpleasant side
effects.

If you happen to run on more-reliable-storage (like EBS, or a SAN, and you
understand what that means from a business continuity perspective), then
you can assume that your rebuild frequency is probably an order of
magnitude less often, so you can adjust your risk calculation based on
measured reliability there (again, EBS and other disaggregated disks still
fail, just less often than single physical flash devices).

Seed nodes never really need to change significantly. You should be fine
with 2-3 per DC no matter the instance count.




On Wed, Aug 16, 2023 at 8:34 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> General question on how to configure Cassandra.  Say I have 1PByte of
> data to store.  The general rule of thumb is that each node (or at least
> instance of Cassandra) shouldn't handle more than 2TBytes of disk.  That
> means 500 instances of Cassandra.
>
> Assuming you have very fast persistent storage (such as a NetApp,
> PorterWorx etc.), would using Kubernetes or some orchestration layer to
> handle those nodes be a viable approach?  Perhaps the worker nodes would
> have enough RAM to run 4 instances (pods) of Cassandra, you would need
> 125 servers.
> Another approach is to build your servers with 5 (or more) SSD devices -
> one for OS, four for each instance of Cassandra running on that server.
> Then build some scripts/ansible/puppet that would manage Cassandra
> start/stops, and other maintenance items.
>
> Where I think this runs into problems is with repairs, or sstablescrubs
> that can take days to run on a single instance.  How is that handled 'in
> the real world'?  With seed nodes, how many would you have in such a
> configuration?
> Thanks for any thoughts!
>
> -Joe
>
>
> --
> This email has been checked for viruses by AVG antivirus software.
> www.avg.com
>


Big Data Question

2023-08-16 Thread Joe Obernberger
General question on how to configure Cassandra.  Say I have 1PByte of 
data to store.  The general rule of thumb is that each node (or at least 
instance of Cassandra) shouldn't handle more than 2TBytes of disk.  That 
means 500 instances of Cassandra.


Assuming you have very fast persistent storage (such as a NetApp, 
PorterWorx etc.), would using Kubernetes or some orchestration layer to 
handle those nodes be a viable approach?  Perhaps the worker nodes would 
have enough RAM to run 4 instances (pods) of Cassandra, you would need 
125 servers.
Another approach is to build your servers with 5 (or more) SSD devices - 
one for OS, four for each instance of Cassandra running on that server.  
Then build some scripts/ansible/puppet that would manage Cassandra 
start/stops, and other maintenance items.


Where I think this runs into problems is with repairs, or sstablescrubs 
that can take days to run on a single instance.  How is that handled 'in 
the real world'?  With seed nodes, how many would you have in such a 
configuration?

Thanks for any thoughts!

-Joe


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com