Estimating partition size for C2.X and C3.X and Time Series Data Modelling.

G P Mon, 20 Jun 2016 09:12:58 -0700

Hello,

I'm currently enrolled in a master's degree and my thesis project involves the 
usage of Big Data tools in the context of Smart Grid applications. I explored 
sever storage solutions and found Cassandra to be fitting to my problem.
The data is mostly Time Series data, incoming from multiple PLCs, currently 
being captured and stored by a proprietary SCADA software connected to a MSSQL 
server. Reading into C* storage engine and how Time Series should be modelled, 
it is inevitable that I have to use a sort of time bucketing for splitting into 
multiple partitions.


Here is the issue, in the MSSQL server, each PLC has very wide tables (5 at the 
moment for one building) with around 36 columns of data being collected every 
10 seconds. Data is being queried as much as 15 columns at a time with time 
ranges varying between 1 hour and a whole month. A simple mapping of the same 
tables in MSSQL to C* is not recommended due to the way C*2.X stores its data.

I took the DS220: Data Modelling Course, that showcases two formulas for 
estimating a partition size based on the Table design.

[cid:image003.png@01D1CB16.9A41FD30]
[cid:image004.png@01D1CB16.9A41FD30]
Note: This Ps formula does not account for column name length, TTLs, counter 
columns, and additional overhead.

If my calculations are correct, with a table such as the one below and a the 
time resolution of 10 seconds, the Ps (Partition Size) would be shy of 10 MB 
(value often recommended) if I partitioned it weekly.

CREATE TABLE TEST (
    BuildingAnalyzer text,
    Time timestamp,
    P1 double,
    P2 double,
    P3 double,
    Acte1 int,
    Acte2 int,
    Acte3 int,
    PRIMARY KEY (BuildingAnalyzer, Time)
)

However, as of C*3.0, a major refactor of the storage engine brought efficiency 
in storage costs. From what I could gather in [1], clustering columns and 
column name are no longer repeated for each value in a record and, among other 
things, the timestamps for conflict resolution (the 8 × Nv of the 2nd formula) 
can be stored only once per record if they have the same value and are encoded 
as varints.

I also read [2], which explains the storage in intricate detail, adding too 
much complexity to a simple estimation formula.

Is there any way to estimate partition size of a table with similar formulas as 
the ones above?
Should I just model my tables similar to what is done with metric collection 
(table with columns, "parametername" and "value")?


[1]    http://www.datastax.com/2015/12/storage-engine-30

[2]    
http://thelastpickle.com/blog/2016/03/04/introductiont-to-the-apache-cassandra-3-storage-engine.html

Sorry for the long wall of text,
Best regards,
Gil Pinheiro.

Estimating partition size for C*2.X and C*3.X and Time Series Data Modelling.

Reply via email to

Estimating partition size for C2.X and C3.X and Time Series Data Modelling.