Re: Cluster sizing for huge dataset

2019-10-04 Thread DuyHai Doan
The problem is that the user wants to access old data also using cql, not popping un a Sparksql just to fetch one or two old records Le 4 oct. 2019 12:38, "Cedrick Lunven" a écrit : > Hi, > > If you are using DataStax Enterprise why not offloading cold data to DSEFS > (HDFS implementation) with

Re: Cluster sizing for huge dataset

2019-10-04 Thread Cedrick Lunven
Hi, If you are using DataStax Enterprise why not offloading cold data to DSEFS (HDFS implementation) with friendly analytics storage format like parquet, keep only OLTP in the Cassandra Tables. Recommended size for DSEFS can go up to 30TB a node. I am pretty sure you are already aware of this

Re: Cluster sizing for huge dataset

2019-10-01 Thread DuyHai Doan
The client wants to be able to access cold data (2 years old) in the same cluster so moving data to another system is not possible However, since we're using Datastax Enterprise, we can leverage Tiered Storage and store old data on Spinning Disks to save on hardware Regards On Tue, Oct 1, 2019

Re: Cluster sizing for huge dataset

2019-10-01 Thread Julien Laurenceau
Hi, Depending on the use case, you may also consider storage tiering with fresh data on hot-tier (Cassandra) and older data on cold-tier (Spark/Parquet or Presto/Parquet). It would be a lot more complex, but may fit more appropriately the budget and you may reuse some tech already present in your

Re: Cluster sizing for huge dataset

2019-09-30 Thread DuyHai Doan
Thanks all for your reply The target deployment is on Azure so with the Nice disk snapshot feature, replacing a dead node is easier, no streaming from Cassandra About compaction overhead, using TwCs with a 1 day bucket and removing read repair and subrange repair should be sufficient Now the

Re: Cluster sizing for huge dataset

2019-09-30 Thread Eric Evans
On Sat, Sep 28, 2019 at 8:50 PM Jeff Jirsa wrote: [ ... ] > 2) The 2TB guidance is old and irrelevant for most people, what you really > care about is how fast you can replace the failed machine > > You’d likely be ok going significantly larger than that if you use a few > vnodes, since

Re: Cluster sizing for huge dataset

2019-09-30 Thread Laxmikant Upadhyay
I noticed that the compaction overhead has not been taken into account while capacity planning, I think it is due to the used compression is going to compensate for that. Is my assumption correct? On Sun, Sep 29, 2019 at 11:04 PM Jeff Jirsa wrote: > > > > On Sep 29, 2019, at 12:30 AM, DuyHai

Re: Cluster sizing for huge dataset

2019-09-29 Thread Jeff Jirsa
> On Sep 29, 2019, at 12:30 AM, DuyHai Doan wrote: > > Thank you Jeff for the hints > > We are targeting to reach 20Tb/machine using TWCS and 8 vnodes (using > the new token allocation algo). Also we will try the new zstd > compression. I’d provably still be inclined to run two instances

Re: Cluster sizing for huge dataset

2019-09-29 Thread DuyHai Doan
Thank you Jeff for the hints We are targeting to reach 20Tb/machine using TWCS and 8 vnodes (using the new token allocation algo). Also we will try the new zstd compression. About transient replication, the underlying trade-offs and semantics are hard to understand for common people (for

Re: Cluster sizing for huge dataset

2019-09-28 Thread Jeff Jirsa
A few random thoughts here 1) 90 nodes / 900T in a cluster isn’t that big. petabyte per cluster is a manageable size. 2) The 2TB guidance is old and irrelevant for most people, what you really care about is how fast you can replace the failed machine You’d likely be ok going significantly

Cluster sizing for huge dataset

2019-09-28 Thread DuyHai Doan
Hello users I'm facing with a very challenging exercise: size a cluster with a huge dataset. Use-case = IoT Number of sensors: 30 millions Frequency of data: every 10 minutes Estimate size of a data: 100 bytes (including clustering columns) Data retention: 2 years Replication factor: 3 (pretty