Re: C* data modeling for time series

2018-06-18 Thread mm

Hi,

we're currently evaluating KairosDB for time series which looks quite 
nice.

https://kairosdb.github.io/

The cool thing with KairosDB is that it uses Cassandra as storage engine 
and provide

additional features (mainly a REST-based API for accessing data).

Maybe you can take a look the schema definition kairos uses for 
cassandra and
check if it suits you. (Or use it directly as it stores data in 
cassandra anyway).


Greetings,
Michael

PS: Oh and GRAFANA has a kairosdb connector so you can test queries and 
create dashboards fast.



On 18.06.2018 09:46, Affan Syed wrote:

I have looked at this problem for a good year now. My feel is that
Cassandra alone as the sole underlying DB for Timeseries just does not
cut it.

I am starting to look at C* along with another DB for executing the
sort of queries we want here.

Currently I am evaluating Druid vs Kudu to be this supportive DB. Any
comments from community? Cassandra would more be for storage and
backup, while the data denormalization effort is taken care of by
another DB.

thank you

- Affan
On Thu, Jul 27, 2017 at 1:38 AM, CPC  wrote:


If all of your queries like this(i mean get all devices given a  a
time range) Hadoop would be more appropriate since those are
analytical queries.

Anyway, to query such data with spark Cassandra connector  your
partition key could include day and hash of your deviceid as pseudo
partition key column (could be abs(murmur(deviceid)%500) we add this
column to distribute data more evenly) . When you want query a time
range you should generate a rdd of tuple2 with all days that
intersect with that date and for each day your rdd should include
0..500 range. Like:

(20170726,0)
(20170726,1)
.
.
.
(20170726,499)

Then you should join this rdd with your table using
joinwithcassandratable method.

On Jul 26, 2017 4:41 PM, "Junaid Nasir"  wrote:

all devices.
After selecting the data I group them and perform other actions i.e
sum, avg on fields and then display those to compare how devices are
doing compared to each other.

On Wed, Jul 26, 2017 at 5:32 PM, CPC  wrote:

Hi Junaid,

Given a time range do you want to take all devices or a specific
device?

On Jul 26, 2017 3:15 PM, "Junaid Nasir"  wrote:

I have a C* cluster (3 nodes) with some 60gb data (replication
factor 2). when I started using C* coming from SQL background didn't
give much thought about modeling the data correctly. so what I did
was

CREATE TABLE data ( deviceId int,
time timestamp,
field1 text,
filed2 text,
field3 text,
PRIMARY KEY(deviceId, time)) WITH CLUSTERING
ORDER BY (time ASC);

but most of the queries I run (using spark and datastax connector)
compares data of different devices for some time period. for example

SELECT * FROM data WHERE time > '2017-07-01 12:00:00';

from my understanding this runs a full table scan. as shown in spark
UI (from DAG visualization "Scan
org.apache.spark.sql.cassandra.CassandraSourceRelation@32bb7d65")
meaning C* will read all the data and then filter for time. Spark
jobs runs for hours even for smaller time frames.

what is the right approach for data modeling for such queries?. I
want to get a general idea of things to look for when modeling such
data.
really appreciate all the help from this community :). if you need
any extra details please ask me here.

Regards,
Junaid



-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: C* data modeling for time series

2018-06-18 Thread Affan Syed
I have looked at this problem for a good year now. My feel is that
Cassandra alone as the sole underlying DB for Timeseries just does not cut
it.

I am starting to look at C* along with another DB for executing the sort of
queries we want here.

Currently I am evaluating Druid vs Kudu to be this supportive DB. Any
comments from community? Cassandra would more be for storage and backup,
while the data denormalization effort is taken care of by another DB.

thank you

- Affan

On Thu, Jul 27, 2017 at 1:38 AM, CPC  wrote:

> If all of your queries like this(i mean get all devices given a  a time
> range) Hadoop would be more appropriate since those are analytical queries.
>
> Anyway, to query such data with spark Cassandra connector  your partition
> key could include day and hash of your deviceid as pseudo partition key
> column (could be abs(murmur(deviceid)%500) we add this column to distribute
> data more evenly) . When you want query a time range you should generate a
> rdd of tuple2 with all days that intersect with that date and for each day
> your rdd should include 0..500 range. Like:
>
> (20170726,0)
> (20170726,1)
> .
> .
> .
> (20170726,499)
>
> Then you should join this rdd with your table using joinwithcassandratable
> method.
>
>
> On Jul 26, 2017 4:41 PM, "Junaid Nasir"  wrote:
>
> all devices.
> After selecting the data I group them and perform other actions i.e sum,
> avg on fields and then display those to compare how devices are doing
> compared to each other.
>
> On Wed, Jul 26, 2017 at 5:32 PM, CPC  wrote:
>
>> Hi Junaid,
>>
>> Given a time range do you want to take all devices or a specific device?
>>
>>
>> On Jul 26, 2017 3:15 PM, "Junaid Nasir"  wrote:
>>
>> I have a C* cluster (3 nodes) with some 60gb data (replication factor 2).
>> when I started using C* coming from SQL background didn't give much thought
>> about modeling the data correctly. so what I did was
>>
>> CREATE TABLE data ( deviceId int,
>> time timestamp,
>> field1 text,
>> filed2 text,
>> field3 text,
>> PRIMARY KEY(deviceId, time)) WITH CLUSTERING ORDER BY 
>> (time ASC);
>>
>> but most of the queries I run (using spark and datastax connector)
>> compares data of different devices for some time period. for example
>>
>> SELECT * FROM data WHERE time > '2017-07-01 12:00:00';
>>
>> from my understanding this runs a full table scan. as shown in spark UI
>> (from DAG visualization "Scan org.apache.spark.sql.cassandra
>> .CassandraSourceRelation@32bb7d65") meaning C* will read all the data
>> and then filter for time. Spark jobs runs for hours even for smaller time
>> frames.
>>
>> what is the right approach for data modeling for such queries?. I want to
>> get a general idea of things to look for when modeling such data.
>> really appreciate all the help from this community :). if you need any
>> extra details please ask me here.
>>
>> Regards,
>> Junaid
>>
>>
>>
>
>


Re: C* data modeling for time series

2017-07-26 Thread Jeff Jirsa


On 2017-07-26 05:15 (-0700), Junaid Nasir  wrote: 
> I have a C* cluster (3 nodes) with some 60gb data (replication factor 2).
> when I started using C* coming from SQL background didn't give much thought
> about modeling the data correctly. so what I did was
> 
> CREATE TABLE data ( deviceId int,
> time timestamp,
> field1 text,
> filed2 text,
> field3 text,
> PRIMARY KEY(deviceId, time)) WITH CLUSTERING ORDER
> BY (time ASC);
> 
> but most of the queries I run (using spark and datastax connector) compares
> data of different devices for some time period. for example
> 
> SELECT * FROM data WHERE time > '2017-07-01 12:00:00';
> 
> from my understanding this runs a full table scan. as shown in spark UI
> (from DAG visualization "Scan
> org.apache.spark.sql.cassandra.CassandraSourceRelation@32bb7d65") meaning
> C* will read all the data and then filter for time. Spark jobs runs for
> hours even for smaller time frames.
> 
> what is the right approach for data modeling for such queries?. I want to
> get a general idea of things to look for when modeling such data.
> really appreciate all the help from this community :). if you need any
> extra details please ask me here.
> 

The right approach for modeling all queries in Cassandra is to start with the 
SELECTs you'll want to do, and then build a table around it.

If your typical behavior is to query all of the data for a time window, you 
probably want your partition keys to be time windows. For a larger cluster, 
this would give you hotspots (so you may want to rethink it if you grow 
significantly), but for your size where you're already using a majority of the 
cluster for every write, it shouldn't be a big deal.

That, then, would give you a table like:

> CREATE TABLE data ( deviceId int,
> time timestamp,
> field1 text,
> filed2 text,
> field3 text,
> timeBucket text,
> PRIMARY KEY(timeBucket, deviceId, time) WITH CLUSTERING 
> ORDER
> BY (deviceId ASC, time ASC);

Where timeBucket is some date-like string like "2017-07-26-01:00:00", for all 
entries in the first hour of Jul 26 2017.

This gives you: 1) a way to query by time (primary use case), and 2) a way to 
query by ID if needed (though you'll need to issue a query for each time bucket 
and aggregate it client side). If 2 is insufficient, you would denormalize and 
create a second table.





-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: C* data modeling for time series

2017-07-26 Thread CPC
If all of your queries like this(i mean get all devices given a  a time
range) Hadoop would be more appropriate since those are analytical queries.

Anyway, to query such data with spark Cassandra connector  your partition
key could include day and hash of your deviceid as pseudo partition key
column (could be abs(murmur(deviceid)%500) we add this column to distribute
data more evenly) . When you want query a time range you should generate a
rdd of tuple2 with all days that intersect with that date and for each day
your rdd should include 0..500 range. Like:

(20170726,0)
(20170726,1)
.
.
.
(20170726,499)

Then you should join this rdd with your table using joinwithcassandratable
method.

On Jul 26, 2017 4:41 PM, "Junaid Nasir"  wrote:

all devices.
After selecting the data I group them and perform other actions i.e sum,
avg on fields and then display those to compare how devices are doing
compared to each other.

On Wed, Jul 26, 2017 at 5:32 PM, CPC  wrote:

> Hi Junaid,
>
> Given a time range do you want to take all devices or a specific device?
>
>
> On Jul 26, 2017 3:15 PM, "Junaid Nasir"  wrote:
>
> I have a C* cluster (3 nodes) with some 60gb data (replication factor 2).
> when I started using C* coming from SQL background didn't give much thought
> about modeling the data correctly. so what I did was
>
> CREATE TABLE data ( deviceId int,
> time timestamp,
> field1 text,
> filed2 text,
> field3 text,
> PRIMARY KEY(deviceId, time)) WITH CLUSTERING ORDER BY 
> (time ASC);
>
> but most of the queries I run (using spark and datastax connector)
> compares data of different devices for some time period. for example
>
> SELECT * FROM data WHERE time > '2017-07-01 12:00:00';
>
> from my understanding this runs a full table scan. as shown in spark UI
> (from DAG visualization "Scan org.apache.spark.sql.cassandra
> .CassandraSourceRelation@32bb7d65") meaning C* will read all the data and
> then filter for time. Spark jobs runs for hours even for smaller time
> frames.
>
> what is the right approach for data modeling for such queries?. I want to
> get a general idea of things to look for when modeling such data.
> really appreciate all the help from this community :). if you need any
> extra details please ask me here.
>
> Regards,
> Junaid
>
>
>


Re: C* data modeling for time series

2017-07-26 Thread Junaid Nasir
all devices.
After selecting the data I group them and perform other actions i.e sum,
avg on fields and then display those to compare how devices are doing
compared to each other.

On Wed, Jul 26, 2017 at 5:32 PM, CPC  wrote:

> Hi Junaid,
>
> Given a time range do you want to take all devices or a specific device?
>
>
> On Jul 26, 2017 3:15 PM, "Junaid Nasir"  wrote:
>
> I have a C* cluster (3 nodes) with some 60gb data (replication factor 2).
> when I started using C* coming from SQL background didn't give much thought
> about modeling the data correctly. so what I did was
>
> CREATE TABLE data ( deviceId int,
> time timestamp,
> field1 text,
> filed2 text,
> field3 text,
> PRIMARY KEY(deviceId, time)) WITH CLUSTERING ORDER BY 
> (time ASC);
>
> but most of the queries I run (using spark and datastax connector)
> compares data of different devices for some time period. for example
>
> SELECT * FROM data WHERE time > '2017-07-01 12:00:00';
>
> from my understanding this runs a full table scan. as shown in spark UI
> (from DAG visualization "Scan org.apache.spark.sql.cassandra
> .CassandraSourceRelation@32bb7d65") meaning C* will read all the data and
> then filter for time. Spark jobs runs for hours even for smaller time
> frames.
>
> what is the right approach for data modeling for such queries?. I want to
> get a general idea of things to look for when modeling such data.
> really appreciate all the help from this community :). if you need any
> extra details please ask me here.
>
> Regards,
> Junaid
>
>
>


Re: C* data modeling for time series

2017-07-26 Thread CPC
Hi Junaid,

Given a time range do you want to take all devices or a specific device?

On Jul 26, 2017 3:15 PM, "Junaid Nasir"  wrote:

I have a C* cluster (3 nodes) with some 60gb data (replication factor 2).
when I started using C* coming from SQL background didn't give much thought
about modeling the data correctly. so what I did was

CREATE TABLE data ( deviceId int,
time timestamp,
field1 text,
filed2 text,
field3 text,
PRIMARY KEY(deviceId, time)) WITH CLUSTERING ORDER
BY (time ASC);

but most of the queries I run (using spark and datastax connector) compares
data of different devices for some time period. for example

SELECT * FROM data WHERE time > '2017-07-01 12:00:00';

from my understanding this runs a full table scan. as shown in spark UI
(from DAG visualization "Scan org.apache.spark.sql.cassandra.
CassandraSourceRelation@32bb7d65") meaning C* will read all the data and
then filter for time. Spark jobs runs for hours even for smaller time
frames.

what is the right approach for data modeling for such queries?. I want to
get a general idea of things to look for when modeling such data.
really appreciate all the help from this community :). if you need any
extra details please ask me here.

Regards,
Junaid