Re: Time series data model and tombstones

2017-02-08 Thread DuyHai Doan
Thanks for the update. Good to know that TWCS give you more stability

On Wed, Feb 8, 2017 at 6:20 PM, John Sanda  wrote:

> I wanted to provide a quick update. I was able to patch one of the
> environments that is hitting the tombstone problem. It has been running
> TWCS for five days now, and things are stable so far. I also had a patch to
> the application code to implement date partitioning ready to go, but I
> wanted to see how things went with only making the compaction changes.
>
> On Sun, Jan 29, 2017 at 4:05 PM, DuyHai Doan  wrote:
>
>> In theory, you're right and Cassandra should possibly skip reading cells
>> having time < 50. But it's all theory, in practice Cassandra read chunks of
>> xxx kilobytes worth of data (don't remember the exact value of xxx, maybe
>> 64k or far less) so you may end up reading tombstones.
>>
>> On Sun, Jan 29, 2017 at 9:24 PM, John Sanda  wrote:
>>
>>> Thanks for the clarification. Let's say I have a partition in an SSTable
>>> where the values of time range from 100 to 10 and everything < 50 is
>>> expired. If I do a query with time < 100 and time >= 50, are there
>>> scenarios in which Cassandra will have to read cells where time < 50? In
>>> particular I am wondering if compression might have any affect.
>>>
>>> On Sun, Jan 29, 2017 at 3:01 PM DuyHai Doan 
>>> wrote:
>>>
 "Should the data be sorted by my time column regardless of the
 compaction strategy" --> It does

 What I mean is that an old "chunk" of expired data in SSTABLE-12 may be
 compacted together with a new chunk of SSTABLE-2 containing fresh data so
 in the new resulting SSTable will contain tombstones AND fresh data inside
 the same partition, but of course sorted by clustering column "time".

 On Sun, Jan 29, 2017 at 8:55 PM, John Sanda 
 wrote:

 Since STCS does not sort data based on timestamp, your wide partition
 may span over multiple SSTables and inside each SSTable, old data (+
 tombstones) may sit on the same partition as newer data.


 Should the data be sorted by my time column regardless of the
 compaction strategy? I didn't think that the column timestamp came into
 play with respect to sorting. I have been able to review some SSTables with
 sstablemetadata and I can see that old/expired data is definitely living
 with live data.


 On Sun, Jan 29, 2017 at 2:38 PM, DuyHai Doan 
 wrote:

 Ok so give it a try with TWCS. Since STCS does not sort data based on
 timestamp, your wide partition may span over multiple SSTables and inside
 each SSTable, old data (+ tombstones) may sit on the same partition as
 newer data.

 When reading by slice, even if you request for fresh data, Cassandra
 has to scan over a lot tombstones to fetch the correct range of data thus
 your issue

 On Sun, Jan 29, 2017 at 8:19 PM, John Sanda 
 wrote:

 It was with STCS. It was on a 2.x version before TWCS was available.

 On Sun, Jan 29, 2017 at 10:58 AM DuyHai Doan 
 wrote:

 Did you get this Overwhelming tombstonne behavior with STCS or with
 TWCS ?

 If you're using DTCS, beware of its weird behavior and tricky
 configuration.

 On Sun, Jan 29, 2017 at 3:52 PM, John Sanda 
 wrote:

 Your partitioning key is text. If you have multiple entries per id you
 are likely hitting older cells that have expired. Descending only affects
 how the data is stored on disk, if you have to read the whole partition to
 find whichever time you are querying for you could potentially hit
 tombstones in other SSTables that contain the same "id". As mentioned
 previously, you need to add a time bucket to your partitioning key and
 definitely use DTCS/TWCS.


 As I mentioned previously, the UI only queries recent data, e.g., the
 past hour, past two hours, past day, past week. The UI does not query for
 anything older than the TTL which is 7 days. My understanding and
 expectation was that Cassandra would only scan live cells. The UI is a
 separate application that I do not maintain, so I am not 100% certain about
 the queries. I have been told that it does not query for anything older
 than 7 days.

 On Sun, Jan 29, 2017 at 4:14 AM, kurt greaves 
 wrote:


 Your partitioning key is text. If you have multiple entries per id you
 are likely hitting older cells that have expired. Descending only affects
 how the data is stored on disk, if you have to read the whole partition to
 find whichever time you are querying for you could potentially hit
 tombstones in other SSTables that contain the same "id". As mentioned

Re: Time series data model and tombstones

2017-02-08 Thread John Sanda
I wanted to provide a quick update. I was able to patch one of the
environments that is hitting the tombstone problem. It has been running
TWCS for five days now, and things are stable so far. I also had a patch to
the application code to implement date partitioning ready to go, but I
wanted to see how things went with only making the compaction changes.

On Sun, Jan 29, 2017 at 4:05 PM, DuyHai Doan  wrote:

> In theory, you're right and Cassandra should possibly skip reading cells
> having time < 50. But it's all theory, in practice Cassandra read chunks of
> xxx kilobytes worth of data (don't remember the exact value of xxx, maybe
> 64k or far less) so you may end up reading tombstones.
>
> On Sun, Jan 29, 2017 at 9:24 PM, John Sanda  wrote:
>
>> Thanks for the clarification. Let's say I have a partition in an SSTable
>> where the values of time range from 100 to 10 and everything < 50 is
>> expired. If I do a query with time < 100 and time >= 50, are there
>> scenarios in which Cassandra will have to read cells where time < 50? In
>> particular I am wondering if compression might have any affect.
>>
>> On Sun, Jan 29, 2017 at 3:01 PM DuyHai Doan  wrote:
>>
>>> "Should the data be sorted by my time column regardless of the
>>> compaction strategy" --> It does
>>>
>>> What I mean is that an old "chunk" of expired data in SSTABLE-12 may be
>>> compacted together with a new chunk of SSTABLE-2 containing fresh data so
>>> in the new resulting SSTable will contain tombstones AND fresh data inside
>>> the same partition, but of course sorted by clustering column "time".
>>>
>>> On Sun, Jan 29, 2017 at 8:55 PM, John Sanda 
>>> wrote:
>>>
>>> Since STCS does not sort data based on timestamp, your wide partition
>>> may span over multiple SSTables and inside each SSTable, old data (+
>>> tombstones) may sit on the same partition as newer data.
>>>
>>>
>>> Should the data be sorted by my time column regardless of the compaction
>>> strategy? I didn't think that the column timestamp came into play with
>>> respect to sorting. I have been able to review some SSTables with
>>> sstablemetadata and I can see that old/expired data is definitely living
>>> with live data.
>>>
>>>
>>> On Sun, Jan 29, 2017 at 2:38 PM, DuyHai Doan 
>>> wrote:
>>>
>>> Ok so give it a try with TWCS. Since STCS does not sort data based on
>>> timestamp, your wide partition may span over multiple SSTables and inside
>>> each SSTable, old data (+ tombstones) may sit on the same partition as
>>> newer data.
>>>
>>> When reading by slice, even if you request for fresh data, Cassandra has
>>> to scan over a lot tombstones to fetch the correct range of data thus your
>>> issue
>>>
>>> On Sun, Jan 29, 2017 at 8:19 PM, John Sanda 
>>> wrote:
>>>
>>> It was with STCS. It was on a 2.x version before TWCS was available.
>>>
>>> On Sun, Jan 29, 2017 at 10:58 AM DuyHai Doan 
>>> wrote:
>>>
>>> Did you get this Overwhelming tombstonne behavior with STCS or with TWCS
>>> ?
>>>
>>> If you're using DTCS, beware of its weird behavior and tricky
>>> configuration.
>>>
>>> On Sun, Jan 29, 2017 at 3:52 PM, John Sanda 
>>> wrote:
>>>
>>> Your partitioning key is text. If you have multiple entries per id you
>>> are likely hitting older cells that have expired. Descending only affects
>>> how the data is stored on disk, if you have to read the whole partition to
>>> find whichever time you are querying for you could potentially hit
>>> tombstones in other SSTables that contain the same "id". As mentioned
>>> previously, you need to add a time bucket to your partitioning key and
>>> definitely use DTCS/TWCS.
>>>
>>>
>>> As I mentioned previously, the UI only queries recent data, e.g., the
>>> past hour, past two hours, past day, past week. The UI does not query for
>>> anything older than the TTL which is 7 days. My understanding and
>>> expectation was that Cassandra would only scan live cells. The UI is a
>>> separate application that I do not maintain, so I am not 100% certain about
>>> the queries. I have been told that it does not query for anything older
>>> than 7 days.
>>>
>>> On Sun, Jan 29, 2017 at 4:14 AM, kurt greaves 
>>> wrote:
>>>
>>>
>>> Your partitioning key is text. If you have multiple entries per id you
>>> are likely hitting older cells that have expired. Descending only affects
>>> how the data is stored on disk, if you have to read the whole partition to
>>> find whichever time you are querying for you could potentially hit
>>> tombstones in other SSTables that contain the same "id". As mentioned
>>> previously, you need to add a time bucket to your partitioning key and
>>> definitely use DTCS/TWCS.
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> - John
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> - John
>>>
>>>
>>>
>>>
>>>
>>>
>>>

Re: Time series data model and tombstones

2017-01-29 Thread DuyHai Doan
In theory, you're right and Cassandra should possibly skip reading cells
having time < 50. But it's all theory, in practice Cassandra read chunks of
xxx kilobytes worth of data (don't remember the exact value of xxx, maybe
64k or far less) so you may end up reading tombstones.

On Sun, Jan 29, 2017 at 9:24 PM, John Sanda  wrote:

> Thanks for the clarification. Let's say I have a partition in an SSTable
> where the values of time range from 100 to 10 and everything < 50 is
> expired. If I do a query with time < 100 and time >= 50, are there
> scenarios in which Cassandra will have to read cells where time < 50? In
> particular I am wondering if compression might have any affect.
>
> On Sun, Jan 29, 2017 at 3:01 PM DuyHai Doan  wrote:
>
>> "Should the data be sorted by my time column regardless of the
>> compaction strategy" --> It does
>>
>> What I mean is that an old "chunk" of expired data in SSTABLE-12 may be
>> compacted together with a new chunk of SSTABLE-2 containing fresh data so
>> in the new resulting SSTable will contain tombstones AND fresh data inside
>> the same partition, but of course sorted by clustering column "time".
>>
>> On Sun, Jan 29, 2017 at 8:55 PM, John Sanda  wrote:
>>
>> Since STCS does not sort data based on timestamp, your wide partition may
>> span over multiple SSTables and inside each SSTable, old data (+
>> tombstones) may sit on the same partition as newer data.
>>
>>
>> Should the data be sorted by my time column regardless of the compaction
>> strategy? I didn't think that the column timestamp came into play with
>> respect to sorting. I have been able to review some SSTables with
>> sstablemetadata and I can see that old/expired data is definitely living
>> with live data.
>>
>>
>> On Sun, Jan 29, 2017 at 2:38 PM, DuyHai Doan 
>> wrote:
>>
>> Ok so give it a try with TWCS. Since STCS does not sort data based on
>> timestamp, your wide partition may span over multiple SSTables and inside
>> each SSTable, old data (+ tombstones) may sit on the same partition as
>> newer data.
>>
>> When reading by slice, even if you request for fresh data, Cassandra has
>> to scan over a lot tombstones to fetch the correct range of data thus your
>> issue
>>
>> On Sun, Jan 29, 2017 at 8:19 PM, John Sanda  wrote:
>>
>> It was with STCS. It was on a 2.x version before TWCS was available.
>>
>> On Sun, Jan 29, 2017 at 10:58 AM DuyHai Doan 
>> wrote:
>>
>> Did you get this Overwhelming tombstonne behavior with STCS or with TWCS ?
>>
>> If you're using DTCS, beware of its weird behavior and tricky
>> configuration.
>>
>> On Sun, Jan 29, 2017 at 3:52 PM, John Sanda  wrote:
>>
>> Your partitioning key is text. If you have multiple entries per id you
>> are likely hitting older cells that have expired. Descending only affects
>> how the data is stored on disk, if you have to read the whole partition to
>> find whichever time you are querying for you could potentially hit
>> tombstones in other SSTables that contain the same "id". As mentioned
>> previously, you need to add a time bucket to your partitioning key and
>> definitely use DTCS/TWCS.
>>
>>
>> As I mentioned previously, the UI only queries recent data, e.g., the
>> past hour, past two hours, past day, past week. The UI does not query for
>> anything older than the TTL which is 7 days. My understanding and
>> expectation was that Cassandra would only scan live cells. The UI is a
>> separate application that I do not maintain, so I am not 100% certain about
>> the queries. I have been told that it does not query for anything older
>> than 7 days.
>>
>> On Sun, Jan 29, 2017 at 4:14 AM, kurt greaves 
>> wrote:
>>
>>
>> Your partitioning key is text. If you have multiple entries per id you
>> are likely hitting older cells that have expired. Descending only affects
>> how the data is stored on disk, if you have to read the whole partition to
>> find whichever time you are querying for you could potentially hit
>> tombstones in other SSTables that contain the same "id". As mentioned
>> previously, you need to add a time bucket to your partitioning key and
>> definitely use DTCS/TWCS.
>>
>>
>>
>>
>>
>> --
>>
>> - John
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> - John
>>
>>
>>
>>
>>
>>
>>
>>


Re: Time series data model and tombstones

2017-01-29 Thread Jonathan Haddad
Check out our post on how to use TWCS before 3.0.

http://thelastpickle.com/blog/2017/01/10/twcs-part2.html

On Sun, Jan 29, 2017 at 11:20 AM John Sanda  wrote:

> It was with STCS. It was on a 2.x version before TWCS was available.
>
> On Sun, Jan 29, 2017 at 10:58 AM DuyHai Doan  wrote:
>
> Did you get this Overwhelming tombstonne behavior with STCS or with TWCS ?
>
> If you're using DTCS, beware of its weird behavior and tricky
> configuration.
>
> On Sun, Jan 29, 2017 at 3:52 PM, John Sanda  wrote:
>
> Your partitioning key is text. If you have multiple entries per id you are
> likely hitting older cells that have expired. Descending only affects how
> the data is stored on disk, if you have to read the whole partition to find
> whichever time you are querying for you could potentially hit tombstones in
> other SSTables that contain the same "id". As mentioned previously, you
> need to add a time bucket to your partitioning key and definitely use
> DTCS/TWCS.
>
>
> As I mentioned previously, the UI only queries recent data, e.g., the past
> hour, past two hours, past day, past week. The UI does not query for
> anything older than the TTL which is 7 days. My understanding and
> expectation was that Cassandra would only scan live cells. The UI is a
> separate application that I do not maintain, so I am not 100% certain about
> the queries. I have been told that it does not query for anything older
> than 7 days.
>
> On Sun, Jan 29, 2017 at 4:14 AM, kurt greaves 
> wrote:
>
>
> Your partitioning key is text. If you have multiple entries per id you are
> likely hitting older cells that have expired. Descending only affects how
> the data is stored on disk, if you have to read the whole partition to find
> whichever time you are querying for you could potentially hit tombstones in
> other SSTables that contain the same "id". As mentioned previously, you
> need to add a time bucket to your partitioning key and definitely use
> DTCS/TWCS.
>
>
>
>
>
> --
>
> - John
>
>
>
>
>
>
>
>


Re: Time series data model and tombstones

2017-01-29 Thread John Sanda
Thanks for the clarification. Let's say I have a partition in an SSTable
where the values of time range from 100 to 10 and everything < 50 is
expired. If I do a query with time < 100 and time >= 50, are there
scenarios in which Cassandra will have to read cells where time < 50? In
particular I am wondering if compression might have any affect.

On Sun, Jan 29, 2017 at 3:01 PM DuyHai Doan  wrote:

> "Should the data be sorted by my time column regardless of the compaction
> strategy" --> It does
>
> What I mean is that an old "chunk" of expired data in SSTABLE-12 may be
> compacted together with a new chunk of SSTABLE-2 containing fresh data so
> in the new resulting SSTable will contain tombstones AND fresh data inside
> the same partition, but of course sorted by clustering column "time".
>
> On Sun, Jan 29, 2017 at 8:55 PM, John Sanda  wrote:
>
> Since STCS does not sort data based on timestamp, your wide partition may
> span over multiple SSTables and inside each SSTable, old data (+
> tombstones) may sit on the same partition as newer data.
>
>
> Should the data be sorted by my time column regardless of the compaction
> strategy? I didn't think that the column timestamp came into play with
> respect to sorting. I have been able to review some SSTables with
> sstablemetadata and I can see that old/expired data is definitely living
> with live data.
>
>
> On Sun, Jan 29, 2017 at 2:38 PM, DuyHai Doan  wrote:
>
> Ok so give it a try with TWCS. Since STCS does not sort data based on
> timestamp, your wide partition may span over multiple SSTables and inside
> each SSTable, old data (+ tombstones) may sit on the same partition as
> newer data.
>
> When reading by slice, even if you request for fresh data, Cassandra has
> to scan over a lot tombstones to fetch the correct range of data thus your
> issue
>
> On Sun, Jan 29, 2017 at 8:19 PM, John Sanda  wrote:
>
> It was with STCS. It was on a 2.x version before TWCS was available.
>
> On Sun, Jan 29, 2017 at 10:58 AM DuyHai Doan  wrote:
>
> Did you get this Overwhelming tombstonne behavior with STCS or with TWCS ?
>
> If you're using DTCS, beware of its weird behavior and tricky
> configuration.
>
> On Sun, Jan 29, 2017 at 3:52 PM, John Sanda  wrote:
>
> Your partitioning key is text. If you have multiple entries per id you are
> likely hitting older cells that have expired. Descending only affects how
> the data is stored on disk, if you have to read the whole partition to find
> whichever time you are querying for you could potentially hit tombstones in
> other SSTables that contain the same "id". As mentioned previously, you
> need to add a time bucket to your partitioning key and definitely use
> DTCS/TWCS.
>
>
> As I mentioned previously, the UI only queries recent data, e.g., the past
> hour, past two hours, past day, past week. The UI does not query for
> anything older than the TTL which is 7 days. My understanding and
> expectation was that Cassandra would only scan live cells. The UI is a
> separate application that I do not maintain, so I am not 100% certain about
> the queries. I have been told that it does not query for anything older
> than 7 days.
>
> On Sun, Jan 29, 2017 at 4:14 AM, kurt greaves 
> wrote:
>
>
> Your partitioning key is text. If you have multiple entries per id you are
> likely hitting older cells that have expired. Descending only affects how
> the data is stored on disk, if you have to read the whole partition to find
> whichever time you are querying for you could potentially hit tombstones in
> other SSTables that contain the same "id". As mentioned previously, you
> need to add a time bucket to your partitioning key and definitely use
> DTCS/TWCS.
>
>
>
>
>
> --
>
> - John
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
> - John
>
>
>
>
>
>
>
>


Re: Time series data model and tombstones

2017-01-29 Thread DuyHai Doan
"Should the data be sorted by my time column regardless of the compaction
strategy" --> It does

What I mean is that an old "chunk" of expired data in SSTABLE-12 may be
compacted together with a new chunk of SSTABLE-2 containing fresh data so
in the new resulting SSTable will contain tombstones AND fresh data inside
the same partition, but of course sorted by clustering column "time".

On Sun, Jan 29, 2017 at 8:55 PM, John Sanda  wrote:

> Since STCS does not sort data based on timestamp, your wide partition may
>> span over multiple SSTables and inside each SSTable, old data (+
>> tombstones) may sit on the same partition as newer data.
>
>
> Should the data be sorted by my time column regardless of the compaction
> strategy? I didn't think that the column timestamp came into play with
> respect to sorting. I have been able to review some SSTables with
> sstablemetadata and I can see that old/expired data is definitely living
> with live data.
>
>
> On Sun, Jan 29, 2017 at 2:38 PM, DuyHai Doan  wrote:
>
>> Ok so give it a try with TWCS. Since STCS does not sort data based on
>> timestamp, your wide partition may span over multiple SSTables and inside
>> each SSTable, old data (+ tombstones) may sit on the same partition as
>> newer data.
>>
>> When reading by slice, even if you request for fresh data, Cassandra has
>> to scan over a lot tombstones to fetch the correct range of data thus your
>> issue
>>
>> On Sun, Jan 29, 2017 at 8:19 PM, John Sanda  wrote:
>>
>>> It was with STCS. It was on a 2.x version before TWCS was available.
>>>
>>> On Sun, Jan 29, 2017 at 10:58 AM DuyHai Doan 
>>> wrote:
>>>
 Did you get this Overwhelming tombstonne behavior with STCS or with
 TWCS ?

 If you're using DTCS, beware of its weird behavior and tricky
 configuration.

 On Sun, Jan 29, 2017 at 3:52 PM, John Sanda 
 wrote:

 Your partitioning key is text. If you have multiple entries per id you
 are likely hitting older cells that have expired. Descending only affects
 how the data is stored on disk, if you have to read the whole partition to
 find whichever time you are querying for you could potentially hit
 tombstones in other SSTables that contain the same "id". As mentioned
 previously, you need to add a time bucket to your partitioning key and
 definitely use DTCS/TWCS.


 As I mentioned previously, the UI only queries recent data, e.g., the
 past hour, past two hours, past day, past week. The UI does not query for
 anything older than the TTL which is 7 days. My understanding and
 expectation was that Cassandra would only scan live cells. The UI is a
 separate application that I do not maintain, so I am not 100% certain about
 the queries. I have been told that it does not query for anything older
 than 7 days.

 On Sun, Jan 29, 2017 at 4:14 AM, kurt greaves 
 wrote:


 Your partitioning key is text. If you have multiple entries per id you
 are likely hitting older cells that have expired. Descending only affects
 how the data is stored on disk, if you have to read the whole partition to
 find whichever time you are querying for you could potentially hit
 tombstones in other SSTables that contain the same "id". As mentioned
 previously, you need to add a time bucket to your partitioning key and
 definitely use DTCS/TWCS.





 --

 - John








>>
>
>
> --
>
> - John
>


Re: Time series data model and tombstones

2017-01-29 Thread John Sanda
>
> Since STCS does not sort data based on timestamp, your wide partition may
> span over multiple SSTables and inside each SSTable, old data (+
> tombstones) may sit on the same partition as newer data.


Should the data be sorted by my time column regardless of the compaction
strategy? I didn't think that the column timestamp came into play with
respect to sorting. I have been able to review some SSTables with
sstablemetadata and I can see that old/expired data is definitely living
with live data.


On Sun, Jan 29, 2017 at 2:38 PM, DuyHai Doan  wrote:

> Ok so give it a try with TWCS. Since STCS does not sort data based on
> timestamp, your wide partition may span over multiple SSTables and inside
> each SSTable, old data (+ tombstones) may sit on the same partition as
> newer data.
>
> When reading by slice, even if you request for fresh data, Cassandra has
> to scan over a lot tombstones to fetch the correct range of data thus your
> issue
>
> On Sun, Jan 29, 2017 at 8:19 PM, John Sanda  wrote:
>
>> It was with STCS. It was on a 2.x version before TWCS was available.
>>
>> On Sun, Jan 29, 2017 at 10:58 AM DuyHai Doan 
>> wrote:
>>
>>> Did you get this Overwhelming tombstonne behavior with STCS or with TWCS
>>> ?
>>>
>>> If you're using DTCS, beware of its weird behavior and tricky
>>> configuration.
>>>
>>> On Sun, Jan 29, 2017 at 3:52 PM, John Sanda 
>>> wrote:
>>>
>>> Your partitioning key is text. If you have multiple entries per id you
>>> are likely hitting older cells that have expired. Descending only affects
>>> how the data is stored on disk, if you have to read the whole partition to
>>> find whichever time you are querying for you could potentially hit
>>> tombstones in other SSTables that contain the same "id". As mentioned
>>> previously, you need to add a time bucket to your partitioning key and
>>> definitely use DTCS/TWCS.
>>>
>>>
>>> As I mentioned previously, the UI only queries recent data, e.g., the
>>> past hour, past two hours, past day, past week. The UI does not query for
>>> anything older than the TTL which is 7 days. My understanding and
>>> expectation was that Cassandra would only scan live cells. The UI is a
>>> separate application that I do not maintain, so I am not 100% certain about
>>> the queries. I have been told that it does not query for anything older
>>> than 7 days.
>>>
>>> On Sun, Jan 29, 2017 at 4:14 AM, kurt greaves 
>>> wrote:
>>>
>>>
>>> Your partitioning key is text. If you have multiple entries per id you
>>> are likely hitting older cells that have expired. Descending only affects
>>> how the data is stored on disk, if you have to read the whole partition to
>>> find whichever time you are querying for you could potentially hit
>>> tombstones in other SSTables that contain the same "id". As mentioned
>>> previously, you need to add a time bucket to your partitioning key and
>>> definitely use DTCS/TWCS.
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> - John
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>


-- 

- John


Re: Time series data model and tombstones

2017-01-29 Thread DuyHai Doan
Ok so give it a try with TWCS. Since STCS does not sort data based on
timestamp, your wide partition may span over multiple SSTables and inside
each SSTable, old data (+ tombstones) may sit on the same partition as
newer data.

When reading by slice, even if you request for fresh data, Cassandra has to
scan over a lot tombstones to fetch the correct range of data thus your
issue

On Sun, Jan 29, 2017 at 8:19 PM, John Sanda  wrote:

> It was with STCS. It was on a 2.x version before TWCS was available.
>
> On Sun, Jan 29, 2017 at 10:58 AM DuyHai Doan  wrote:
>
>> Did you get this Overwhelming tombstonne behavior with STCS or with TWCS ?
>>
>> If you're using DTCS, beware of its weird behavior and tricky
>> configuration.
>>
>> On Sun, Jan 29, 2017 at 3:52 PM, John Sanda  wrote:
>>
>> Your partitioning key is text. If you have multiple entries per id you
>> are likely hitting older cells that have expired. Descending only affects
>> how the data is stored on disk, if you have to read the whole partition to
>> find whichever time you are querying for you could potentially hit
>> tombstones in other SSTables that contain the same "id". As mentioned
>> previously, you need to add a time bucket to your partitioning key and
>> definitely use DTCS/TWCS.
>>
>>
>> As I mentioned previously, the UI only queries recent data, e.g., the
>> past hour, past two hours, past day, past week. The UI does not query for
>> anything older than the TTL which is 7 days. My understanding and
>> expectation was that Cassandra would only scan live cells. The UI is a
>> separate application that I do not maintain, so I am not 100% certain about
>> the queries. I have been told that it does not query for anything older
>> than 7 days.
>>
>> On Sun, Jan 29, 2017 at 4:14 AM, kurt greaves 
>> wrote:
>>
>>
>> Your partitioning key is text. If you have multiple entries per id you
>> are likely hitting older cells that have expired. Descending only affects
>> how the data is stored on disk, if you have to read the whole partition to
>> find whichever time you are querying for you could potentially hit
>> tombstones in other SSTables that contain the same "id". As mentioned
>> previously, you need to add a time bucket to your partitioning key and
>> definitely use DTCS/TWCS.
>>
>>
>>
>>
>>
>> --
>>
>> - John
>>
>>
>>
>>
>>
>>
>>
>>


Re: Time series data model and tombstones

2017-01-29 Thread John Sanda
It was with STCS. It was on a 2.x version before TWCS was available.

On Sun, Jan 29, 2017 at 10:58 AM DuyHai Doan  wrote:

> Did you get this Overwhelming tombstonne behavior with STCS or with TWCS ?
>
> If you're using DTCS, beware of its weird behavior and tricky
> configuration.
>
> On Sun, Jan 29, 2017 at 3:52 PM, John Sanda  wrote:
>
> Your partitioning key is text. If you have multiple entries per id you are
> likely hitting older cells that have expired. Descending only affects how
> the data is stored on disk, if you have to read the whole partition to find
> whichever time you are querying for you could potentially hit tombstones in
> other SSTables that contain the same "id". As mentioned previously, you
> need to add a time bucket to your partitioning key and definitely use
> DTCS/TWCS.
>
>
> As I mentioned previously, the UI only queries recent data, e.g., the past
> hour, past two hours, past day, past week. The UI does not query for
> anything older than the TTL which is 7 days. My understanding and
> expectation was that Cassandra would only scan live cells. The UI is a
> separate application that I do not maintain, so I am not 100% certain about
> the queries. I have been told that it does not query for anything older
> than 7 days.
>
> On Sun, Jan 29, 2017 at 4:14 AM, kurt greaves 
> wrote:
>
>
> Your partitioning key is text. If you have multiple entries per id you are
> likely hitting older cells that have expired. Descending only affects how
> the data is stored on disk, if you have to read the whole partition to find
> whichever time you are querying for you could potentially hit tombstones in
> other SSTables that contain the same "id". As mentioned previously, you
> need to add a time bucket to your partitioning key and definitely use
> DTCS/TWCS.
>
>
>
>
>
> --
>
> - John
>
>
>
>
>
>
>
>


Re: Time series data model and tombstones

2017-01-29 Thread DuyHai Doan
Did you get this Overwhelming tombstonne behavior with STCS or with TWCS ?

If you're using DTCS, beware of its weird behavior and tricky configuration.

On Sun, Jan 29, 2017 at 3:52 PM, John Sanda  wrote:

> Your partitioning key is text. If you have multiple entries per id you are
>> likely hitting older cells that have expired. Descending only affects how
>> the data is stored on disk, if you have to read the whole partition to find
>> whichever time you are querying for you could potentially hit tombstones in
>> other SSTables that contain the same "id". As mentioned previously, you
>> need to add a time bucket to your partitioning key and definitely use
>> DTCS/TWCS.
>
>
> As I mentioned previously, the UI only queries recent data, e.g., the past
> hour, past two hours, past day, past week. The UI does not query for
> anything older than the TTL which is 7 days. My understanding and
> expectation was that Cassandra would only scan live cells. The UI is a
> separate application that I do not maintain, so I am not 100% certain about
> the queries. I have been told that it does not query for anything older
> than 7 days.
>
> On Sun, Jan 29, 2017 at 4:14 AM, kurt greaves 
> wrote:
>
>>
>> Your partitioning key is text. If you have multiple entries per id you
>> are likely hitting older cells that have expired. Descending only affects
>> how the data is stored on disk, if you have to read the whole partition to
>> find whichever time you are querying for you could potentially hit
>> tombstones in other SSTables that contain the same "id". As mentioned
>> previously, you need to add a time bucket to your partitioning key and
>> definitely use DTCS/TWCS.
>>
>
>
>
> --
>
> - John
>


Re: Time series data model and tombstones

2017-01-29 Thread John Sanda
>
> Your partitioning key is text. If you have multiple entries per id you are
> likely hitting older cells that have expired. Descending only affects how
> the data is stored on disk, if you have to read the whole partition to find
> whichever time you are querying for you could potentially hit tombstones in
> other SSTables that contain the same "id". As mentioned previously, you
> need to add a time bucket to your partitioning key and definitely use
> DTCS/TWCS.


As I mentioned previously, the UI only queries recent data, e.g., the past
hour, past two hours, past day, past week. The UI does not query for
anything older than the TTL which is 7 days. My understanding and
expectation was that Cassandra would only scan live cells. The UI is a
separate application that I do not maintain, so I am not 100% certain about
the queries. I have been told that it does not query for anything older
than 7 days.

On Sun, Jan 29, 2017 at 4:14 AM, kurt greaves  wrote:

>
> Your partitioning key is text. If you have multiple entries per id you are
> likely hitting older cells that have expired. Descending only affects how
> the data is stored on disk, if you have to read the whole partition to find
> whichever time you are querying for you could potentially hit tombstones in
> other SSTables that contain the same "id". As mentioned previously, you
> need to add a time bucket to your partitioning key and definitely use
> DTCS/TWCS.
>



-- 

- John


Re: Time series data model and tombstones

2017-01-29 Thread kurt greaves
Your partitioning key is text. If you have multiple entries per id you are
likely hitting older cells that have expired. Descending only affects how
the data is stored on disk, if you have to read the whole partition to find
whichever time you are querying for you could potentially hit tombstones in
other SSTables that contain the same "id". As mentioned previously, you
need to add a time bucket to your partitioning key and definitely use
DTCS/TWCS.


Re: Time series data model and tombstones

2017-01-28 Thread Benjamin Roth
Maybe trace your queries to see what's happening in detail.

Am 28.01.2017 21:32 schrieb "John Sanda" <john.sa...@gmail.com>:

Thanks for the response. This version of the code is using STCS.
gc_grace_seconds was set to one day and then I changed it to zero since RF
= 1. I understand that expired data will still generate tombstones and that
STCS is not the best. More recent versions of the code use DTCS, and we'll
be switching over to TWCS shortly. The suggestions raised are excellent
ones, but I tend to think of them as optimizations that might not address
my issue which I think may be 1) a problem with my data model, 2) problem
with the queries used or 3) some misunderstanding of Cassandra performs
range scans.

I am doing append-only writes. There is no out of order data. There are no
deletes, just TTLs. Data is stored on disk in descending order, and queries
access recent data and never query past the TTL of seven days. Given this I
would not except to be reading tombstones, certainly not the large numbers
that I am seeing.

On Sat, Jan 28, 2017 at 12:15 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> Since you didn't specify a compaction strategy I'm guessing you're using
> STCS. Your TTL'ed data is becoming a tombstone. TWCS is a better strategy
> for this type of workload.
> On Sat, Jan 28, 2017 at 8:30 AM John Sanda <john.sa...@gmail.com> wrote:
>
>> I have a time series data model that is basically:
>>
>> CREATE TABLE metrics (
>> id text,
>> time timeuuid,
>> value double,
>> PRIMARY KEY (id, time)
>> ) WITH CLUSTERING ORDER BY (time DESC);
>>
>> I do append-only writes, no deletes, and use a TTL of seven days. Data
>> points are written every seconds. The UI queries data for the past hour,
>> two hours, day, or week. The UI refreshes and executes queries every 30
>> seconds. In one test environment I am seeing lots of tombstone threshold
>> warnings and Cassandra has even OOME'd. Since I am storing data in
>> descending order and always query for recent data, I do not understand why
>> I am running into this problem.
>>
>> I know that it is recommended to do some date partitioning in part to
>> ensure partitions do not grow too large. I already have some changes in
>> place to partition by day.. Before I make those changes I want to
>> understand why I am scanning so many tombstones so that I can be more
>> confident that the date partitioning changes will help.
>>
>> Thanks
>>
>> - John
>>
>


-- 

- John


Re: Time series data model and tombstones

2017-01-28 Thread John Sanda
Thanks for the response. This version of the code is using STCS.
gc_grace_seconds was set to one day and then I changed it to zero since RF
= 1. I understand that expired data will still generate tombstones and that
STCS is not the best. More recent versions of the code use DTCS, and we'll
be switching over to TWCS shortly. The suggestions raised are excellent
ones, but I tend to think of them as optimizations that might not address
my issue which I think may be 1) a problem with my data model, 2) problem
with the queries used or 3) some misunderstanding of Cassandra performs
range scans.

I am doing append-only writes. There is no out of order data. There are no
deletes, just TTLs. Data is stored on disk in descending order, and queries
access recent data and never query past the TTL of seven days. Given this I
would not except to be reading tombstones, certainly not the large numbers
that I am seeing.

On Sat, Jan 28, 2017 at 12:15 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> Since you didn't specify a compaction strategy I'm guessing you're using
> STCS. Your TTL'ed data is becoming a tombstone. TWCS is a better strategy
> for this type of workload.
> On Sat, Jan 28, 2017 at 8:30 AM John Sanda <john.sa...@gmail.com> wrote:
>
>> I have a time series data model that is basically:
>>
>> CREATE TABLE metrics (
>> id text,
>> time timeuuid,
>> value double,
>> PRIMARY KEY (id, time)
>> ) WITH CLUSTERING ORDER BY (time DESC);
>>
>> I do append-only writes, no deletes, and use a TTL of seven days. Data
>> points are written every seconds. The UI queries data for the past hour,
>> two hours, day, or week. The UI refreshes and executes queries every 30
>> seconds. In one test environment I am seeing lots of tombstone threshold
>> warnings and Cassandra has even OOME'd. Since I am storing data in
>> descending order and always query for recent data, I do not understand why
>> I am running into this problem.
>>
>> I know that it is recommended to do some date partitioning in part to
>> ensure partitions do not grow too large. I already have some changes in
>> place to partition by day.. Before I make those changes I want to
>> understand why I am scanning so many tombstones so that I can be more
>> confident that the date partitioning changes will help.
>>
>> Thanks
>>
>> - John
>>
>


-- 

- John


Re: Time series data model and tombstones

2017-01-28 Thread Jonathan Haddad
Since you didn't specify a compaction strategy I'm guessing you're using
STCS. Your TTL'ed data is becoming a tombstone. TWCS is a better strategy
for this type of workload.
On Sat, Jan 28, 2017 at 8:30 AM John Sanda <john.sa...@gmail.com> wrote:

> I have a time series data model that is basically:
>
> CREATE TABLE metrics (
> id text,
> time timeuuid,
> value double,
> PRIMARY KEY (id, time)
> ) WITH CLUSTERING ORDER BY (time DESC);
>
> I do append-only writes, no deletes, and use a TTL of seven days. Data
> points are written every seconds. The UI queries data for the past hour,
> two hours, day, or week. The UI refreshes and executes queries every 30
> seconds. In one test environment I am seeing lots of tombstone threshold
> warnings and Cassandra has even OOME'd. Since I am storing data in
> descending order and always query for recent data, I do not understand why
> I am running into this problem.
>
> I know that it is recommended to do some date partitioning in part to
> ensure partitions do not grow too large. I already have some changes in
> place to partition by day.. Before I make those changes I want to
> understand why I am scanning so many tombstones so that I can be more
> confident that the date partitioning changes will help.
>
> Thanks
>
> - John
>


Re: Time series data model and tombstones

2017-01-28 Thread DuyHai Doan
When the data expired (after TTL of 7 days), at the next compaction they
are transformed into tombstonnes and will still stay there during
gc_grace_seconds. After that, they (the tombstonnes) will be completely
removed at the next compaction, if there is any ...

So doing some maths, supposing that you have let gc_grace_seconds to its
default value of 10 days then you'll have tombstonnes for 10 days worth of
data before they got eventually removed...

What is your compaction strategy ? I strongly suggest

1) Setting TTL directly as the table property (ALTER TABLE) instead of
setting it at query level (INSERT INTO ... USING TTL). When setting TTL at
table level, Cassandra can perform some optimization and drop entirely some
SSTable and don't even bother compact them

2) Use TimeWindowCompactionStrategy and tune it properly to accomodate your
workload



On Sat, Jan 28, 2017 at 5:30 PM, John Sanda <john.sa...@gmail.com> wrote:

> I have a time series data model that is basically:
>
> CREATE TABLE metrics (
> id text,
> time timeuuid,
> value double,
> PRIMARY KEY (id, time)
> ) WITH CLUSTERING ORDER BY (time DESC);
>
> I do append-only writes, no deletes, and use a TTL of seven days. Data
> points are written every seconds. The UI queries data for the past hour,
> two hours, day, or week. The UI refreshes and executes queries every 30
> seconds. In one test environment I am seeing lots of tombstone threshold
> warnings and Cassandra has even OOME'd. Since I am storing data in
> descending order and always query for recent data, I do not understand why
> I am running into this problem.
>
> I know that it is recommended to do some date partitioning in part to
> ensure partitions do not grow too large. I already have some changes in
> place to partition by day.. Before I make those changes I want to
> understand why I am scanning so many tombstones so that I can be more
> confident that the date partitioning changes will help.
>
> Thanks
>
> - John
>


Time series data model and tombstones

2017-01-28 Thread John Sanda
I have a time series data model that is basically:

CREATE TABLE metrics (
id text,
time timeuuid,
value double,
PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY (time DESC);

I do append-only writes, no deletes, and use a TTL of seven days. Data
points are written every seconds. The UI queries data for the past hour,
two hours, day, or week. The UI refreshes and executes queries every 30
seconds. In one test environment I am seeing lots of tombstone threshold
warnings and Cassandra has even OOME'd. Since I am storing data in
descending order and always query for recent data, I do not understand why
I am running into this problem.

I know that it is recommended to do some date partitioning in part to
ensure partitions do not grow too large. I already have some changes in
place to partition by day.. Before I make those changes I want to
understand why I am scanning so many tombstones so that I can be more
confident that the date partitioning changes will help.

Thanks

- John


Re: time series data model

2016-10-24 Thread kurt Greaves
On 20 October 2016 at 09:29, wxn...@zjqunshuo.com 
wrote:

> I do need to align the time windows to day bucket to prevent one row
> become too big, and event_time is timestamp since unix epoch. If I use
> bigint as type of event_time, can I do queries as you mentioned?


Yes.

Kurt Greaves
k...@instaclustr.com
www.instaclustr.com


Re: time series data model

2016-10-20 Thread wxn...@zjqunshuo.com
Hi Kurt,
I do need to align the time windows to day bucket to prevent one row become too 
big, and event_time is timestamp since unix epoch. If I use bigint as type of 
event_time, can I do queries as you mentioned?

-Simon Wu
 
From: kurt Greaves
Date: 2016-10-20 16:18
To: user
Subject: Re: time series data model
If event_time is timestamps since unix epoch you 1. may want to use the 
in-built timestamps type, and 2. order by event_time DESC. 2 applies if you 
want to do queries such as "select * from eventdata where ... and event_time > 
x" (i.e; get latest events).

Other than that your model seems workable, I assume you're using DTCS/TWCS, and 
aligning the time windows to your day bucket. (If not you should do that)

Kurt Greaves
k...@instaclustr.com
www.instaclustr.com

On 20 October 2016 at 07:29, wxn...@zjqunshuo.com <wxn...@zjqunshuo.com> wrote:
Hi All,
I'm trying to migrate my time series data which is GPS trace from mysql to C*. 
I want a wide row to hold one day data. I designed the data model as below. 
Please help to see if there is any problem. Any suggestion is appreciated.

Table Model:
CREATE TABLE cargts.eventdata (
deviceid int,
date int,
event_time bigint,
position text,
PRIMARY KEY ((deviceid, date), event_time)
)

A slice of data:
cqlsh:cargts> SELECT * FROM eventdata WHERE deviceid =186628 and date = 
20160928 LIMIT 10;

 deviceid | date | event_time| position
--+--+---+-
   186628 | 20160928 | 1474992002000 |  
{"latitude":30.343443936386247,"longitude":120.08751351828943,"speed":41,"heading":48}
   186628 | 20160928 | 1474992012000 |   
{"latitude":30.34409508979662,"longitude":120.08840022183352,"speed":45,"heading":53}
   186628 | 20160928 | 1474992022000 |   
{"latitude":30.34461639856887,"longitude":120.08946100336443,"speed":28,"heading":65}
   186628 | 20160928 | 1474992032000 |   
{"latitude":30.34469478717028,"longitude":120.08973154015409,"speed":11,"heading":67}
   186628 | 20160928 | 1474992042000 |   
{"latitude":30.34494998929474,"longitude":120.09027263811151,"speed":19,"heading":47}
   186628 | 20160928 | 1474992052000 | 
{"latitude":30.346057349126617,"longitude":120.08967091817931,"speed":41,"heading":323}
   186628 | 20160928 | 1474992062000 |
{"latitude":30.346997145708,"longitude":120.08883508853253,"speed":52,"heading":323}
   186628 | 20160928 | 1474992072000 | 
{"latitude":30.348131044340988,"longitude":120.08774702315581,"speed":65,"heading":321}
   186628 | 20160928 | 1474992082000 | 
{"latitude":30.349438164412838,"longitude":120.08652612959328,"speed":68,"heading":322}

-Simon Wu



Re: time series data model

2016-10-20 Thread wxn...@zjqunshuo.com
Thank you Kurt, I thought the one column which was identified by the compsite 
key(deviceId+date+event_time) can hold only one value, so I packaged all info 
into one JSON. Maybe I'm wrong. I rewrite the table as below.

CREATE TABLE cargts.eventdata (
deviceid int,
date int,
event_time bigint,
heading int,
lat decimal,
lon decimal,
speed int,
PRIMARY KEY ((deviceid, date), event_time)
)

cqlsh:cargts> select * from eventdata;

 deviceid | date | event_time| heading | lat   | lon| speed
--+--+---+-+---++---
   186628 | 20160928 | 1474992002005 |  48 | 30.343443 | 120.087514 |41

-Simon Wu

From: kurt Greaves
Date: 2016-10-20 16:23
To: user
Subject: Re: time series data model
Ah didn't pick up on that but looks like he's storing JSON within position. Is 
there any strong reason for this or as Vladimir mentioned can you store the 
fields under "position" in separate columns?

Kurt Greaves
k...@instaclustr.com
www.instaclustr.com

On 20 October 2016 at 08:17, Vladimir Yudovin <vla...@winguzone.com> wrote:
Hi Simon,

Why position is text and not float? Text takes much more place.
Also speed and headings can be calculated basing on latest positions, so you 
can also save them. If you really need it in data base you can save them as 
floats, or compose single float value like speed.heading: 41.173 (or opposite, 
heading.speed) and save column storage overhead.


Best regards, Vladimir Yudovin, 
Winguzone - Hosted Cloud Cassandra
Launch your cluster in minutes.


 On Thu, 20 Oct 2016 03:29:16 -0400<wxn...@zjqunshuo.com> wrote 

Hi All,
I'm trying to migrate my time series data which is GPS trace from mysql to C*. 
I want a wide row to hold one day data. I designed the data model as below. 
Please help to see if there is any problem. Any suggestion is appreciated.

Table Model:
CREATE TABLE cargts.eventdata (
deviceid int,
date int,
event_time bigint,
position text,
PRIMARY KEY ((deviceid, date), event_time)
)

A slice of data:
cqlsh:cargts> SELECT * FROM eventdata WHERE deviceid =186628 and date = 
20160928 LIMIT 10;

 deviceid | date | event_time| position
--+--+---+-
   186628 | 20160928 | 1474992002000 |  
{"latitude":30.343443936386247,"longitude":120.08751351828943,"speed":41,"heading":48}
   186628 | 20160928 | 1474992012000 |   
{"latitude":30.34409508979662,"longitude":120.08840022183352,"speed":45,"heading":53}
   186628 | 20160928 | 1474992022000 |   
{"latitude":30.34461639856887,"longitude":120.08946100336443,"speed":28,"heading":65}
   186628 | 20160928 | 1474992032000 |   
{"latitude":30.34469478717028,"longitude":120.08973154015409,"speed":11,"heading":67}
   186628 | 20160928 | 1474992042000 |   
{"latitude":30.34494998929474,"longitude":120.09027263811151,"speed":19,"heading":47}
   186628 | 20160928 | 1474992052000 | 
{"latitude":30.346057349126617,"longitude":120.08967091817931,"speed":41,"heading":323}
   186628 | 20160928 | 1474992062000 |
{"latitude":30.346997145708,"longitude":120.08883508853253,"speed":52,"heading":323}
   186628 | 20160928 | 1474992072000 | 
{"latitude":30.348131044340988,"longitude":120.08774702315581,"speed":65,"heading":321}
   186628 | 20160928 | 1474992082000 | 
{"latitude":30.349438164412838,"longitude":120.08652612959328,"speed":68,"heading":322}

-Simon Wu




Re: time series data model

2016-10-20 Thread kurt Greaves
If event_time is timestamps since unix epoch you 1. may want to use the
in-built timestamps type, and 2. order by event_time DESC. 2 applies if you
want to do queries such as "select * from eventdata where ... and
event_time > x" (i.e; get latest events).

Other than that your model seems workable, I assume you're using DTCS/TWCS,
and aligning the time windows to your day bucket. (If not you should do
that)

Kurt Greaves
k...@instaclustr.com
www.instaclustr.com

On 20 October 2016 at 07:29, wxn...@zjqunshuo.com 
wrote:

> Hi All,
> I'm trying to migrate my time series data which is GPS trace from mysql to
> C*. I want a wide row to hold one day data. I designed the data model as
> below. Please help to see if there is any problem. Any suggestion is
> appreciated.
>
> Table Model:
> CREATE TABLE cargts.eventdata (
> deviceid int,
> date int,
> event_time bigint,
> position text,
> PRIMARY KEY ((deviceid, date), event_time)
> )
>
> A slice of data:
> cqlsh:cargts> SELECT * FROM eventdata WHERE deviceid =
> 186628 and date = 20160928 LIMIT 10;
>
>  deviceid | date | event_time| position
> --+--+---+--
> ---
>186628 | 20160928 | 1474992002000 |  {"latitude":
> 30.343443936386247,"longitude":120.08751351828943,"speed":41,"heading":48}
>186628 | 20160928 | 1474992012000 |   {"latitude":
> 30.34409508979662,"longitude":120.08840022183352,"speed":45,"heading":53}
>186628 | 20160928 | 1474992022000 |   {"latitude":
> 30.34461639856887,"longitude":120.08946100336443,"speed":28,"heading":65}
>186628 | 20160928 | 1474992032000 |   {"latitude":
> 30.34469478717028,"longitude":120.08973154015409,"speed":11,"heading":67}
>186628 | 20160928 | 1474992042000 |   {"latitude":
> 30.34494998929474,"longitude":120.09027263811151,"speed":19,"heading":47}
>186628 | 20160928 | 1474992052000 | {"latitude":
> 30.346057349126617,"longitude":120.08967091817931,"speed":
> 41,"heading":323}
>186628 | 20160928 | 1474992062000 |{"latitude"
> :30.346997145708,"longitude":120.08883508853253,"speed":52,"heading":323}
>186628 | 20160928 | 1474992072000 | {"latitude":
> 30.348131044340988,"longitude":120.08774702315581,"speed":
> 65,"heading":321}
>186628 | 20160928 | 1474992082000 | {"latitude":
> 30.349438164412838,"longitude":120.08652612959328,"speed":
> 68,"heading":322}
>
> -Simon Wu
>


Re: time series data model

2016-10-20 Thread kurt Greaves
Ah didn't pick up on that but looks like he's storing JSON within position.
Is there any strong reason for this or as Vladimir mentioned can you store
the fields under "position" in separate columns?

Kurt Greaves
k...@instaclustr.com
www.instaclustr.com

On 20 October 2016 at 08:17, Vladimir Yudovin  wrote:

> Hi Simon,
>
> Why *position *is text and not float? Text takes much more place.
> Also speed and headings can be calculated basing on latest positions, so
> you can also save them. If you really need it in data base you can save
> them as floats, or compose single float value like speed.heading: 41.173
> (or opposite, heading.speed) and save column storage overhead.
>
>
> Best regards, Vladimir Yudovin,
>
> *Winguzone  - Hosted Cloud
> CassandraLaunch your cluster in minutes.*
>
>
>  On Thu, 20 Oct 2016 03:29:16 -0400* >* wrote 
>
> Hi All,
> I'm trying to migrate my time series data which is GPS trace from mysql to
> C*. I want a wide row to hold one day data. I designed the data model as
> below. Please help to see if there is any problem. Any suggestion is
> appreciated.
>
> Table Model:
> CREATE TABLE cargts.eventdata (
> deviceid int,
> date int,
> event_time bigint,
> position text,
> PRIMARY KEY ((deviceid, date), event_time)
> )
>
> A slice of data:
> cqlsh:cargts> SELECT * FROM eventdata WHERE deviceid =
> 186628 and date = 20160928 LIMIT 10;
>
>  deviceid | date | event_time| position
> --+--+---+--
> ---
>186628 | 20160928 | 1474992002000 |  {"latitude":
> 30.343443936386247,"longitude":120.08751351828943,"speed":41,"heading":48}
>186628 | 20160928 | 1474992012000 |   {"latitude":
> 30.34409508979662,"longitude":120.08840022183352,"speed":45,"heading":53}
>186628 | 20160928 | 1474992022000 |   {"latitude":
> 30.34461639856887,"longitude":120.08946100336443,"speed":28,"heading":65}
>186628 | 20160928 | 1474992032000 |   {"latitude":
> 30.34469478717028,"longitude":120.08973154015409,"speed":11,"heading":67}
>186628 | 20160928 | 1474992042000 |   {"latitude":
> 30.34494998929474,"longitude":120.09027263811151,"speed":19,"heading":47}
>186628 | 20160928 | 1474992052000 | {"latitude":
> 30.346057349126617,"longitude":120.08967091817931,"speed":
> 41,"heading":323}
>186628 | 20160928 | 1474992062000 |{"latitude"
> :30.346997145708,"longitude":120.08883508853253,"speed":52,"heading":323}
>186628 | 20160928 | 1474992072000 | {"latitude":
> 30.348131044340988,"longitude":120.08774702315581,"speed":
> 65,"heading":321}
>186628 | 20160928 | 1474992082000 | {"latitude":
> 30.349438164412838,"longitude":120.08652612959328,"speed":
> 68,"heading":322}
>
> -Simon Wu
>
>
>


Re: time series data model

2016-10-20 Thread Vladimir Yudovin
Hi Simon,

Why position is text and not float? Text takes much more place.
Also speed and headings can be calculated basing on latest positions, so you 
can also save them. If you really need it in data base you can save them as 
floats, or compose single float value like speed.heading: 41.173 (or opposite, 
heading.speed) and save column storage overhead.



Best regards, Vladimir Yudovin, 

Winguzone - Hosted Cloud Cassandra
Launch your cluster in minutes.





 On Thu, 20 Oct 2016 03:29:16 -0400wxn...@zjqunshuo.com wrote 




Hi All,

I'm trying to migrate my time series data which is GPS trace from mysql to C*. 
I want a wide row to hold one day data. I designed the data model as below. 
Please help to see if there is any problem. Any suggestion is appreciated.



Table Model:

CREATE TABLE cargts.eventdata (
deviceid int,
date int,
event_time bigint,
position text,
PRIMARY KEY ((deviceid, date), event_time)
)


A slice of data:

cqlsh:cargts SELECT * FROM eventdata WHERE deviceid =186628 and date = 
20160928 LIMIT 10;

 deviceid | date | event_time| position
--+--+---+-
   186628 | 20160928 | 1474992002000 |  
{"latitude":30.343443936386247,"longitude":120.08751351828943,"speed":41,"heading":48}
   186628 | 20160928 | 1474992012000 |   
{"latitude":30.34409508979662,"longitude":120.08840022183352,"speed":45,"heading":53}
   186628 | 20160928 | 1474992022000 |   
{"latitude":30.34461639856887,"longitude":120.08946100336443,"speed":28,"heading":65}
   186628 | 20160928 | 1474992032000 |   
{"latitude":30.34469478717028,"longitude":120.08973154015409,"speed":11,"heading":67}
   186628 | 20160928 | 1474992042000 |   
{"latitude":30.34494998929474,"longitude":120.09027263811151,"speed":19,"heading":47}
   186628 | 20160928 | 1474992052000 | 
{"latitude":30.346057349126617,"longitude":120.08967091817931,"speed":41,"heading":323}
   186628 | 20160928 | 1474992062000 |
{"latitude":30.346997145708,"longitude":120.08883508853253,"speed":52,"heading":323}
   186628 | 20160928 | 1474992072000 | 
{"latitude":30.348131044340988,"longitude":120.08774702315581,"speed":65,"heading":321}
   186628 | 20160928 | 1474992082000 | 
{"latitude":30.349438164412838,"longitude":120.08652612959328,"speed":68,"heading":322}


-Simon Wu








time series data model

2016-10-20 Thread wxn...@zjqunshuo.com
Hi All,
I'm trying to migrate my time series data which is GPS trace from mysql to C*. 
I want a wide row to hold one day data. I designed the data model as below. 
Please help to see if there is any problem. Any suggestion is appreciated.

Table Model:
CREATE TABLE cargts.eventdata (
deviceid int,
date int,
event_time bigint,
position text,
PRIMARY KEY ((deviceid, date), event_time)
)

A slice of data:
cqlsh:cargts> SELECT * FROM eventdata WHERE deviceid =186628 and date = 
20160928 LIMIT 10;

 deviceid | date | event_time| position
--+--+---+-
   186628 | 20160928 | 1474992002000 |  
{"latitude":30.343443936386247,"longitude":120.08751351828943,"speed":41,"heading":48}
   186628 | 20160928 | 1474992012000 |   
{"latitude":30.34409508979662,"longitude":120.08840022183352,"speed":45,"heading":53}
   186628 | 20160928 | 1474992022000 |   
{"latitude":30.34461639856887,"longitude":120.08946100336443,"speed":28,"heading":65}
   186628 | 20160928 | 1474992032000 |   
{"latitude":30.34469478717028,"longitude":120.08973154015409,"speed":11,"heading":67}
   186628 | 20160928 | 1474992042000 |   
{"latitude":30.34494998929474,"longitude":120.09027263811151,"speed":19,"heading":47}
   186628 | 20160928 | 1474992052000 | 
{"latitude":30.346057349126617,"longitude":120.08967091817931,"speed":41,"heading":323}
   186628 | 20160928 | 1474992062000 |
{"latitude":30.346997145708,"longitude":120.08883508853253,"speed":52,"heading":323}
   186628 | 20160928 | 1474992072000 | 
{"latitude":30.348131044340988,"longitude":120.08774702315581,"speed":65,"heading":321}
   186628 | 20160928 | 1474992082000 | 
{"latitude":30.349438164412838,"longitude":120.08652612959328,"speed":68,"heading":322}

-Simon Wu


Re: timezone time series data model

2012-05-02 Thread samal
this will work.I have tried both gave one day unique bucket.

I just realized, If I sync all clients to one zone then date will remain
same for all.

One Zone date will give materialize view to row.

On Mon, Apr 30, 2012 at 11:43 PM, samal samalgo...@gmail.com wrote:

 hhmm. I will try both. thanks


 On Mon, Apr 30, 2012 at 11:29 PM, Tyler Hobbs ty...@datastax.com wrote:

 Err, sorry, I should have said ts - (ts % 86400).  Integer division does
 something similar.


 On Mon, Apr 30, 2012 at 12:39 PM, samal samalgo...@gmail.com wrote:

 thanks I didn't noticed.
 run script for 5  minutes = divide seems to produce result ,modulo is
 still changing. If divide is ok will do the trick.
 I will run this script on Singapore, East coast server, and New delhi
 server whole night today.

 ==
 unix  =   1335806983422
 unix /1000=   1335806983.422
 Divid i/86400 =   15460.728969907408
 Divid i/86400 INT =   15460
 Modulo i%86400=   62983
 ==
 ==
 unix  =   1335806985421
 unix /1000=   1335806985.421
 Divid i/86400 =   15460.72899306
 Divid i/86400 INT =   15460
 Modulo i%86400=   62985
 ==
 ==
 unix  =   1335806987422
 unix /1000=   1335806987.422
 Divid i/86400 =   15460.729016203704
 Divid i/86400 INT =   15460
 Modulo i%86400=   62987
 ==
 ==
 unix  =   1335806989422
 unix /1000=   1335806989.422
 Divid i/86400 =   15460.729039351852
 Divid i/86400 INT =   15460
 Modulo i%86400=   62989
 ==
 ==
 unix  =   1335806991421
 unix /1000=   1335806991.421
 Divid i/86400 =   15460.7290625
 Divid i/86400 INT =   15460
 Modulo i%86400=   62991
 ==
 ==
 unix  =   1335806993422
 unix /1000=   1335806993.422
 Divid i/86400 =   15460.729085648149
 Divid i/86400 INT =   15460
 Modulo i%86400=   62993
 ==
 ==
 unix  =   1335806995422
 unix /1000=   1335806995.422
 Divid i/86400 =   15460.729108796297
 Divid i/86400 INT =   15460
 Modulo i%86400=   62995
 ==
 ==
 unix  =   1335806997421
 unix /1000=   1335806997.421
 Divid i/86400 =   15460.72913195
 Divid i/86400 INT =   15460
 Modulo i%86400=   62997
 ==
 ==
 unix  =   1335806999422
 unix /1000=   1335806999.422
 Divid i/86400 =   15460.729155092593
 Divid i/86400 INT =   15460
 Modulo i%86400=   62999
 ==


 On Mon, Apr 30, 2012 at 10:44 PM, Tyler Hobbs ty...@datastax.comwrote:

 getTime() returns the number of milliseconds since the epoch, not the
 number of seconds: http://www.w3schools.com/jsref/jsref_gettime.asp

 If you divide that number by 1000, it should work.


 On Mon, Apr 30, 2012 at 11:28 AM, samal samalgo...@gmail.com wrote:

 I did it with node.js but it is changing after some interval.

 code
 setInterval(function(){
   var d =new Date().getTime();
   console.log(== );
   console.log(unix =  ,d);
   i=parseInt(d)
   console.log(Divid i/86400=  ,i/86400);
   console.log(Modulo i%86400= ,i%86400);
   console.log(== );
 },2000);

 /code
 Am I doing wrong?


 On Mon, Apr 30, 2012 at 9:54 PM, Tyler Hobbs ty...@datastax.comwrote:

 Correct, that's exactly what I'm saying.


 On Mon, Apr 30, 2012 at 10:37 AM, samal samalgo...@gmail.com wrote:

 thanks tyler for reply.

 are you saying  user1uuid_*{ts%86400}* would lead to unique day
 bucket which will be timezone {NZ to US} independent? I will try.


 On Mon, Apr 30, 2012 at 8:25 PM, Tyler Hobbs ty...@datastax.comwrote:

 Don't use dates or datestamps as the buckets for your row keys, use
 a unix timestamp modulo whatever size you want your bucket to be 
 instead.
 Timestamps don't involve time zones or any of that nonsense.

 So, instead of having keys like user1uuid_30042012, the second
 half would be replaced the current unix timestamp mod 86400 (the 
 number of
 seconds in a day).


 On Mon, Apr 30, 2012 at 1:46 AM, samal samalgo...@gmail.comwrote:

 Hello List,

 I need suggestion/ recommendation on time series data.

 I have requirement where users belongs to different timezone and
 they can subscribe to global group.
 When users at specific timezone send update to group it is
 available to every user in different timezone.

 I am using GroupSubscribedUsers CF where all update to group are
 push to Each User time line, and key is timelined by 
 useruuid_date(one
 day update of all groups) and columns are group updates.

 GroupSubscribedUsers ={
 user1uuid_30042012:{//this user belongs to same timezone
  timeuuid1:JSON[group1update1]
  

timezone time series data model

2012-04-30 Thread samal
Hello List,

I need suggestion/ recommendation on time series data.

I have requirement where users belongs to different timezone and they can
subscribe to global group.
When users at specific timezone send update to group it is available to
every user in different timezone.

I am using GroupSubscribedUsers CF where all update to group are push to
Each User time line, and key is timelined by useruuid_date(one day update
of all groups) and columns are group updates.

GroupSubscribedUsers ={
user1uuid_30042012:{//this user belongs to same timezone
 timeuuid1:JSON[group1update1]
 timeuuid2:JSON[group2update2]
 timeuuid3:JSON[group1update2]
timeuuid4:JSON[group4update1]
   },
  user2uuid_30042012:{//this user belongs to different timezone where date
has changed already  to 1may but  30 april is getting update
 timeuuid1:JSON[group1update1]
 timeuuid2:JSON[group2update2]
 timeuuid3:JSON[group1update2]
timeuuid4:JSON[group4update1]
timeuuid5:JSON[groupNupdate1]
   },

}

I have noticed  this approach is good for single time zone when different
timezone come into picture it breaks.

I am thinking of like when user pushed update to group -get user who is
subscribed to group-check user timezone-push time series in user time
zone. So for one user update will be on 30april where as other may have on
29april and 1may, using timestamps i can find out hours ago update came.

Is there any better approach?


Thanks,

Samal


Re: timezone time series data model

2012-04-30 Thread Tyler Hobbs
Don't use dates or datestamps as the buckets for your row keys, use a unix
timestamp modulo whatever size you want your bucket to be instead.
Timestamps don't involve time zones or any of that nonsense.

So, instead of having keys like user1uuid_30042012, the second half would
be replaced the current unix timestamp mod 86400 (the number of seconds in
a day).

On Mon, Apr 30, 2012 at 1:46 AM, samal samalgo...@gmail.com wrote:

 Hello List,

 I need suggestion/ recommendation on time series data.

 I have requirement where users belongs to different timezone and they can
 subscribe to global group.
 When users at specific timezone send update to group it is available to
 every user in different timezone.

 I am using GroupSubscribedUsers CF where all update to group are push to
 Each User time line, and key is timelined by useruuid_date(one day update
 of all groups) and columns are group updates.

 GroupSubscribedUsers ={
 user1uuid_30042012:{//this user belongs to same timezone
  timeuuid1:JSON[group1update1]
  timeuuid2:JSON[group2update2]
  timeuuid3:JSON[group1update2]
 timeuuid4:JSON[group4update1]
},
   user2uuid_30042012:{//this user belongs to different timezone where
 date has changed already  to 1may but  30 april is getting update
  timeuuid1:JSON[group1update1]
  timeuuid2:JSON[group2update2]
  timeuuid3:JSON[group1update2]
 timeuuid4:JSON[group4update1]
 timeuuid5:JSON[groupNupdate1]
},

 }

 I have noticed  this approach is good for single time zone when different
 timezone come into picture it breaks.

 I am thinking of like when user pushed update to group -get user who is
 subscribed to group-check user timezone-push time series in user time
 zone. So for one user update will be on 30april where as other may have on
 29april and 1may, using timestamps i can find out hours ago update came.

 Is there any better approach?


 Thanks,

 Samal





-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: timezone time series data model

2012-04-30 Thread samal
thanks tyler for reply.

are you saying  user1uuid_*{ts%86400}* would lead to unique day bucket
which will be timezone {NZ to US} independent? I will try.

On Mon, Apr 30, 2012 at 8:25 PM, Tyler Hobbs ty...@datastax.com wrote:

 Don't use dates or datestamps as the buckets for your row keys, use a unix
 timestamp modulo whatever size you want your bucket to be instead.
 Timestamps don't involve time zones or any of that nonsense.

 So, instead of having keys like user1uuid_30042012, the second half
 would be replaced the current unix timestamp mod 86400 (the number of
 seconds in a day).


 On Mon, Apr 30, 2012 at 1:46 AM, samal samalgo...@gmail.com wrote:

 Hello List,

 I need suggestion/ recommendation on time series data.

 I have requirement where users belongs to different timezone and they can
 subscribe to global group.
 When users at specific timezone send update to group it is available to
 every user in different timezone.

 I am using GroupSubscribedUsers CF where all update to group are push to
 Each User time line, and key is timelined by useruuid_date(one day update
 of all groups) and columns are group updates.

 GroupSubscribedUsers ={
 user1uuid_30042012:{//this user belongs to same timezone
  timeuuid1:JSON[group1update1]
  timeuuid2:JSON[group2update2]
  timeuuid3:JSON[group1update2]
 timeuuid4:JSON[group4update1]
},
   user2uuid_30042012:{//this user belongs to different timezone where
 date has changed already  to 1may but  30 april is getting update
  timeuuid1:JSON[group1update1]
  timeuuid2:JSON[group2update2]
  timeuuid3:JSON[group1update2]
 timeuuid4:JSON[group4update1]
 timeuuid5:JSON[groupNupdate1]
},

 }

 I have noticed  this approach is good for single time zone when different
 timezone come into picture it breaks.

 I am thinking of like when user pushed update to group -get user who is
 subscribed to group-check user timezone-push time series in user time
 zone. So for one user update will be on 30april where as other may have on
 29april and 1may, using timestamps i can find out hours ago update came.

 Is there any better approach?


 Thanks,

 Samal





 --
 Tyler Hobbs
 DataStax http://datastax.com/




Re: timezone time series data model

2012-04-30 Thread Tyler Hobbs
Correct, that's exactly what I'm saying.

On Mon, Apr 30, 2012 at 10:37 AM, samal samalgo...@gmail.com wrote:

 thanks tyler for reply.

 are you saying  user1uuid_*{ts%86400}* would lead to unique day bucket
 which will be timezone {NZ to US} independent? I will try.


 On Mon, Apr 30, 2012 at 8:25 PM, Tyler Hobbs ty...@datastax.com wrote:

 Don't use dates or datestamps as the buckets for your row keys, use a
 unix timestamp modulo whatever size you want your bucket to be instead.
 Timestamps don't involve time zones or any of that nonsense.

 So, instead of having keys like user1uuid_30042012, the second half
 would be replaced the current unix timestamp mod 86400 (the number of
 seconds in a day).


 On Mon, Apr 30, 2012 at 1:46 AM, samal samalgo...@gmail.com wrote:

 Hello List,

 I need suggestion/ recommendation on time series data.

 I have requirement where users belongs to different timezone and they
 can subscribe to global group.
 When users at specific timezone send update to group it is available to
 every user in different timezone.

 I am using GroupSubscribedUsers CF where all update to group are push to
 Each User time line, and key is timelined by useruuid_date(one day update
 of all groups) and columns are group updates.

 GroupSubscribedUsers ={
 user1uuid_30042012:{//this user belongs to same timezone
  timeuuid1:JSON[group1update1]
  timeuuid2:JSON[group2update2]
  timeuuid3:JSON[group1update2]
 timeuuid4:JSON[group4update1]
},
   user2uuid_30042012:{//this user belongs to different timezone where
 date has changed already  to 1may but  30 april is getting update
  timeuuid1:JSON[group1update1]
  timeuuid2:JSON[group2update2]
  timeuuid3:JSON[group1update2]
 timeuuid4:JSON[group4update1]
 timeuuid5:JSON[groupNupdate1]
},

 }

 I have noticed  this approach is good for single time zone when
 different timezone come into picture it breaks.

 I am thinking of like when user pushed update to group -get user who is
 subscribed to group-check user timezone-push time series in user time
 zone. So for one user update will be on 30april where as other may have on
 29april and 1may, using timestamps i can find out hours ago update came.

 Is there any better approach?


 Thanks,

 Samal





 --
 Tyler Hobbs
 DataStax http://datastax.com/





-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: timezone time series data model

2012-04-30 Thread samal
I did it with node.js but it is changing after some interval.

code
setInterval(function(){
  var d =new Date().getTime();
  console.log(== );
  console.log(unix =  ,d);
  i=parseInt(d)
  console.log(Divid i/86400=  ,i/86400);
  console.log(Modulo i%86400= ,i%86400);
  console.log(== );
},2000);

/code
Am I doing wrong?

On Mon, Apr 30, 2012 at 9:54 PM, Tyler Hobbs ty...@datastax.com wrote:

 Correct, that's exactly what I'm saying.


 On Mon, Apr 30, 2012 at 10:37 AM, samal samalgo...@gmail.com wrote:

 thanks tyler for reply.

 are you saying  user1uuid_*{ts%86400}* would lead to unique day bucket
 which will be timezone {NZ to US} independent? I will try.


 On Mon, Apr 30, 2012 at 8:25 PM, Tyler Hobbs ty...@datastax.com wrote:

 Don't use dates or datestamps as the buckets for your row keys, use a
 unix timestamp modulo whatever size you want your bucket to be instead.
 Timestamps don't involve time zones or any of that nonsense.

 So, instead of having keys like user1uuid_30042012, the second half
 would be replaced the current unix timestamp mod 86400 (the number of
 seconds in a day).


 On Mon, Apr 30, 2012 at 1:46 AM, samal samalgo...@gmail.com wrote:

 Hello List,

 I need suggestion/ recommendation on time series data.

 I have requirement where users belongs to different timezone and they
 can subscribe to global group.
 When users at specific timezone send update to group it is available to
 every user in different timezone.

 I am using GroupSubscribedUsers CF where all update to group are push
 to Each User time line, and key is timelined by useruuid_date(one day
 update of all groups) and columns are group updates.

 GroupSubscribedUsers ={
 user1uuid_30042012:{//this user belongs to same timezone
  timeuuid1:JSON[group1update1]
  timeuuid2:JSON[group2update2]
  timeuuid3:JSON[group1update2]
 timeuuid4:JSON[group4update1]
},
   user2uuid_30042012:{//this user belongs to different timezone where
 date has changed already  to 1may but  30 april is getting update
  timeuuid1:JSON[group1update1]
  timeuuid2:JSON[group2update2]
  timeuuid3:JSON[group1update2]
 timeuuid4:JSON[group4update1]
 timeuuid5:JSON[groupNupdate1]
},

 }

 I have noticed  this approach is good for single time zone when
 different timezone come into picture it breaks.

 I am thinking of like when user pushed update to group -get user who
 is subscribed to group-check user timezone-push time series in user time
 zone. So for one user update will be on 30april where as other may have on
 29april and 1may, using timestamps i can find out hours ago update came.

 Is there any better approach?


 Thanks,

 Samal





 --
 Tyler Hobbs
 DataStax http://datastax.com/





 --
 Tyler Hobbs
 DataStax http://datastax.com/




Re: timezone time series data model

2012-04-30 Thread Tyler Hobbs
getTime() returns the number of milliseconds since the epoch, not the
number of seconds: http://www.w3schools.com/jsref/jsref_gettime.asp

If you divide that number by 1000, it should work.

On Mon, Apr 30, 2012 at 11:28 AM, samal samalgo...@gmail.com wrote:

 I did it with node.js but it is changing after some interval.

 code
 setInterval(function(){
   var d =new Date().getTime();
   console.log(== );
   console.log(unix =  ,d);
   i=parseInt(d)
   console.log(Divid i/86400=  ,i/86400);
   console.log(Modulo i%86400= ,i%86400);
   console.log(== );
 },2000);

 /code
 Am I doing wrong?


 On Mon, Apr 30, 2012 at 9:54 PM, Tyler Hobbs ty...@datastax.com wrote:

 Correct, that's exactly what I'm saying.


 On Mon, Apr 30, 2012 at 10:37 AM, samal samalgo...@gmail.com wrote:

 thanks tyler for reply.

 are you saying  user1uuid_*{ts%86400}* would lead to unique day bucket
 which will be timezone {NZ to US} independent? I will try.


 On Mon, Apr 30, 2012 at 8:25 PM, Tyler Hobbs ty...@datastax.com wrote:

 Don't use dates or datestamps as the buckets for your row keys, use a
 unix timestamp modulo whatever size you want your bucket to be instead.
 Timestamps don't involve time zones or any of that nonsense.

 So, instead of having keys like user1uuid_30042012, the second half
 would be replaced the current unix timestamp mod 86400 (the number of
 seconds in a day).


 On Mon, Apr 30, 2012 at 1:46 AM, samal samalgo...@gmail.com wrote:

 Hello List,

 I need suggestion/ recommendation on time series data.

 I have requirement where users belongs to different timezone and they
 can subscribe to global group.
 When users at specific timezone send update to group it is available
 to every user in different timezone.

 I am using GroupSubscribedUsers CF where all update to group are push
 to Each User time line, and key is timelined by useruuid_date(one day
 update of all groups) and columns are group updates.

 GroupSubscribedUsers ={
 user1uuid_30042012:{//this user belongs to same timezone
  timeuuid1:JSON[group1update1]
  timeuuid2:JSON[group2update2]
  timeuuid3:JSON[group1update2]
 timeuuid4:JSON[group4update1]
},
   user2uuid_30042012:{//this user belongs to different timezone where
 date has changed already  to 1may but  30 april is getting update
  timeuuid1:JSON[group1update1]
  timeuuid2:JSON[group2update2]
  timeuuid3:JSON[group1update2]
 timeuuid4:JSON[group4update1]
 timeuuid5:JSON[groupNupdate1]
},

 }

 I have noticed  this approach is good for single time zone when
 different timezone come into picture it breaks.

 I am thinking of like when user pushed update to group -get user who
 is subscribed to group-check user timezone-push time series in user time
 zone. So for one user update will be on 30april where as other may have on
 29april and 1may, using timestamps i can find out hours ago update came.

 Is there any better approach?


 Thanks,

 Samal





 --
 Tyler Hobbs
 DataStax http://datastax.com/





 --
 Tyler Hobbs
 DataStax http://datastax.com/





-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: timezone time series data model

2012-04-30 Thread samal
thanks I didn't noticed.
run script for 5  minutes = divide seems to produce result ,modulo is
still changing. If divide is ok will do the trick.
I will run this script on Singapore, East coast server, and New delhi
server whole night today.

==
unix  =   1335806983422
unix /1000=   1335806983.422
Divid i/86400 =   15460.728969907408
Divid i/86400 INT =   15460
Modulo i%86400=   62983
==
==
unix  =   1335806985421
unix /1000=   1335806985.421
Divid i/86400 =   15460.72899306
Divid i/86400 INT =   15460
Modulo i%86400=   62985
==
==
unix  =   1335806987422
unix /1000=   1335806987.422
Divid i/86400 =   15460.729016203704
Divid i/86400 INT =   15460
Modulo i%86400=   62987
==
==
unix  =   1335806989422
unix /1000=   1335806989.422
Divid i/86400 =   15460.729039351852
Divid i/86400 INT =   15460
Modulo i%86400=   62989
==
==
unix  =   1335806991421
unix /1000=   1335806991.421
Divid i/86400 =   15460.7290625
Divid i/86400 INT =   15460
Modulo i%86400=   62991
==
==
unix  =   1335806993422
unix /1000=   1335806993.422
Divid i/86400 =   15460.729085648149
Divid i/86400 INT =   15460
Modulo i%86400=   62993
==
==
unix  =   1335806995422
unix /1000=   1335806995.422
Divid i/86400 =   15460.729108796297
Divid i/86400 INT =   15460
Modulo i%86400=   62995
==
==
unix  =   1335806997421
unix /1000=   1335806997.421
Divid i/86400 =   15460.72913195
Divid i/86400 INT =   15460
Modulo i%86400=   62997
==
==
unix  =   1335806999422
unix /1000=   1335806999.422
Divid i/86400 =   15460.729155092593
Divid i/86400 INT =   15460
Modulo i%86400=   62999
==


On Mon, Apr 30, 2012 at 10:44 PM, Tyler Hobbs ty...@datastax.com wrote:

 getTime() returns the number of milliseconds since the epoch, not the
 number of seconds: http://www.w3schools.com/jsref/jsref_gettime.asp

 If you divide that number by 1000, it should work.


 On Mon, Apr 30, 2012 at 11:28 AM, samal samalgo...@gmail.com wrote:

 I did it with node.js but it is changing after some interval.

 code
 setInterval(function(){
   var d =new Date().getTime();
   console.log(== );
   console.log(unix =  ,d);
   i=parseInt(d)
   console.log(Divid i/86400=  ,i/86400);
   console.log(Modulo i%86400= ,i%86400);
   console.log(== );
 },2000);

 /code
 Am I doing wrong?


 On Mon, Apr 30, 2012 at 9:54 PM, Tyler Hobbs ty...@datastax.com wrote:

 Correct, that's exactly what I'm saying.


 On Mon, Apr 30, 2012 at 10:37 AM, samal samalgo...@gmail.com wrote:

 thanks tyler for reply.

 are you saying  user1uuid_*{ts%86400}* would lead to unique day bucket
 which will be timezone {NZ to US} independent? I will try.


 On Mon, Apr 30, 2012 at 8:25 PM, Tyler Hobbs ty...@datastax.comwrote:

 Don't use dates or datestamps as the buckets for your row keys, use a
 unix timestamp modulo whatever size you want your bucket to be instead.
 Timestamps don't involve time zones or any of that nonsense.

 So, instead of having keys like user1uuid_30042012, the second half
 would be replaced the current unix timestamp mod 86400 (the number of
 seconds in a day).


 On Mon, Apr 30, 2012 at 1:46 AM, samal samalgo...@gmail.com wrote:

 Hello List,

 I need suggestion/ recommendation on time series data.

 I have requirement where users belongs to different timezone and they
 can subscribe to global group.
 When users at specific timezone send update to group it is available
 to every user in different timezone.

 I am using GroupSubscribedUsers CF where all update to group are push
 to Each User time line, and key is timelined by useruuid_date(one day
 update of all groups) and columns are group updates.

 GroupSubscribedUsers ={
 user1uuid_30042012:{//this user belongs to same timezone
  timeuuid1:JSON[group1update1]
  timeuuid2:JSON[group2update2]
  timeuuid3:JSON[group1update2]
 timeuuid4:JSON[group4update1]
},
   user2uuid_30042012:{//this user belongs to different timezonewhere 
 date has changed already  to 1may but  30 april is getting update
  timeuuid1:JSON[group1update1]
  timeuuid2:JSON[group2update2]
  timeuuid3:JSON[group1update2]
 timeuuid4:JSON[group4update1]
 timeuuid5:JSON[groupNupdate1]
},

 }

 I have noticed  this approach is good for single time zone when
 different timezone come into picture it breaks.

 I am thinking of like when user 

Re: timezone time series data model

2012-04-30 Thread Tyler Hobbs
Err, sorry, I should have said ts - (ts % 86400).  Integer division does
something similar.

On Mon, Apr 30, 2012 at 12:39 PM, samal samalgo...@gmail.com wrote:

 thanks I didn't noticed.
 run script for 5  minutes = divide seems to produce result ,modulo is
 still changing. If divide is ok will do the trick.
 I will run this script on Singapore, East coast server, and New delhi
 server whole night today.

 ==
 unix  =   1335806983422
 unix /1000=   1335806983.422
 Divid i/86400 =   15460.728969907408
 Divid i/86400 INT =   15460
 Modulo i%86400=   62983
 ==
 ==
 unix  =   1335806985421
 unix /1000=   1335806985.421
 Divid i/86400 =   15460.72899306
 Divid i/86400 INT =   15460
 Modulo i%86400=   62985
 ==
 ==
 unix  =   1335806987422
 unix /1000=   1335806987.422
 Divid i/86400 =   15460.729016203704
 Divid i/86400 INT =   15460
 Modulo i%86400=   62987
 ==
 ==
 unix  =   1335806989422
 unix /1000=   1335806989.422
 Divid i/86400 =   15460.729039351852
 Divid i/86400 INT =   15460
 Modulo i%86400=   62989
 ==
 ==
 unix  =   1335806991421
 unix /1000=   1335806991.421
 Divid i/86400 =   15460.7290625
 Divid i/86400 INT =   15460
 Modulo i%86400=   62991
 ==
 ==
 unix  =   1335806993422
 unix /1000=   1335806993.422
 Divid i/86400 =   15460.729085648149
 Divid i/86400 INT =   15460
 Modulo i%86400=   62993
 ==
 ==
 unix  =   1335806995422
 unix /1000=   1335806995.422
 Divid i/86400 =   15460.729108796297
 Divid i/86400 INT =   15460
 Modulo i%86400=   62995
 ==
 ==
 unix  =   1335806997421
 unix /1000=   1335806997.421
 Divid i/86400 =   15460.72913195
 Divid i/86400 INT =   15460
 Modulo i%86400=   62997
 ==
 ==
 unix  =   1335806999422
 unix /1000=   1335806999.422
 Divid i/86400 =   15460.729155092593
 Divid i/86400 INT =   15460
 Modulo i%86400=   62999
 ==


 On Mon, Apr 30, 2012 at 10:44 PM, Tyler Hobbs ty...@datastax.com wrote:

 getTime() returns the number of milliseconds since the epoch, not the
 number of seconds: http://www.w3schools.com/jsref/jsref_gettime.asp

 If you divide that number by 1000, it should work.


 On Mon, Apr 30, 2012 at 11:28 AM, samal samalgo...@gmail.com wrote:

 I did it with node.js but it is changing after some interval.

 code
 setInterval(function(){
   var d =new Date().getTime();
   console.log(== );
   console.log(unix =  ,d);
   i=parseInt(d)
   console.log(Divid i/86400=  ,i/86400);
   console.log(Modulo i%86400= ,i%86400);
   console.log(== );
 },2000);

 /code
 Am I doing wrong?


 On Mon, Apr 30, 2012 at 9:54 PM, Tyler Hobbs ty...@datastax.com wrote:

 Correct, that's exactly what I'm saying.


 On Mon, Apr 30, 2012 at 10:37 AM, samal samalgo...@gmail.com wrote:

 thanks tyler for reply.

 are you saying  user1uuid_*{ts%86400}* would lead to unique day
 bucket which will be timezone {NZ to US} independent? I will try.


 On Mon, Apr 30, 2012 at 8:25 PM, Tyler Hobbs ty...@datastax.comwrote:

 Don't use dates or datestamps as the buckets for your row keys, use a
 unix timestamp modulo whatever size you want your bucket to be instead.
 Timestamps don't involve time zones or any of that nonsense.

 So, instead of having keys like user1uuid_30042012, the second half
 would be replaced the current unix timestamp mod 86400 (the number of
 seconds in a day).


 On Mon, Apr 30, 2012 at 1:46 AM, samal samalgo...@gmail.com wrote:

 Hello List,

 I need suggestion/ recommendation on time series data.

 I have requirement where users belongs to different timezone and
 they can subscribe to global group.
 When users at specific timezone send update to group it is available
 to every user in different timezone.

 I am using GroupSubscribedUsers CF where all update to group are
 push to Each User time line, and key is timelined by useruuid_date(one
 day update of all groups) and columns are group updates.

 GroupSubscribedUsers ={
 user1uuid_30042012:{//this user belongs to same timezone
  timeuuid1:JSON[group1update1]
  timeuuid2:JSON[group2update2]
  timeuuid3:JSON[group1update2]
 timeuuid4:JSON[group4update1]
},
   user2uuid_30042012:{//this user belongs to different timezonewhere 
 date has changed already  to 1may but  30 april is getting update
  timeuuid1:JSON[group1update1]
  timeuuid2:JSON[group2update2]
  timeuuid3:JSON[group1update2]
 

Re: timezone time series data model

2012-04-30 Thread samal
hhmm. I will try both. thanks

On Mon, Apr 30, 2012 at 11:29 PM, Tyler Hobbs ty...@datastax.com wrote:

 Err, sorry, I should have said ts - (ts % 86400).  Integer division does
 something similar.


 On Mon, Apr 30, 2012 at 12:39 PM, samal samalgo...@gmail.com wrote:

 thanks I didn't noticed.
 run script for 5  minutes = divide seems to produce result ,modulo is
 still changing. If divide is ok will do the trick.
 I will run this script on Singapore, East coast server, and New delhi
 server whole night today.

 ==
 unix  =   1335806983422
 unix /1000=   1335806983.422
 Divid i/86400 =   15460.728969907408
 Divid i/86400 INT =   15460
 Modulo i%86400=   62983
 ==
 ==
 unix  =   1335806985421
 unix /1000=   1335806985.421
 Divid i/86400 =   15460.72899306
 Divid i/86400 INT =   15460
 Modulo i%86400=   62985
 ==
 ==
 unix  =   1335806987422
 unix /1000=   1335806987.422
 Divid i/86400 =   15460.729016203704
 Divid i/86400 INT =   15460
 Modulo i%86400=   62987
 ==
 ==
 unix  =   1335806989422
 unix /1000=   1335806989.422
 Divid i/86400 =   15460.729039351852
 Divid i/86400 INT =   15460
 Modulo i%86400=   62989
 ==
 ==
 unix  =   1335806991421
 unix /1000=   1335806991.421
 Divid i/86400 =   15460.7290625
 Divid i/86400 INT =   15460
 Modulo i%86400=   62991
 ==
 ==
 unix  =   1335806993422
 unix /1000=   1335806993.422
 Divid i/86400 =   15460.729085648149
 Divid i/86400 INT =   15460
 Modulo i%86400=   62993
 ==
 ==
 unix  =   1335806995422
 unix /1000=   1335806995.422
 Divid i/86400 =   15460.729108796297
 Divid i/86400 INT =   15460
 Modulo i%86400=   62995
 ==
 ==
 unix  =   1335806997421
 unix /1000=   1335806997.421
 Divid i/86400 =   15460.72913195
 Divid i/86400 INT =   15460
 Modulo i%86400=   62997
 ==
 ==
 unix  =   1335806999422
 unix /1000=   1335806999.422
 Divid i/86400 =   15460.729155092593
 Divid i/86400 INT =   15460
 Modulo i%86400=   62999
 ==


 On Mon, Apr 30, 2012 at 10:44 PM, Tyler Hobbs ty...@datastax.com wrote:

 getTime() returns the number of milliseconds since the epoch, not the
 number of seconds: http://www.w3schools.com/jsref/jsref_gettime.asp

 If you divide that number by 1000, it should work.


 On Mon, Apr 30, 2012 at 11:28 AM, samal samalgo...@gmail.com wrote:

 I did it with node.js but it is changing after some interval.

 code
 setInterval(function(){
   var d =new Date().getTime();
   console.log(== );
   console.log(unix =  ,d);
   i=parseInt(d)
   console.log(Divid i/86400=  ,i/86400);
   console.log(Modulo i%86400= ,i%86400);
   console.log(== );
 },2000);

 /code
 Am I doing wrong?


 On Mon, Apr 30, 2012 at 9:54 PM, Tyler Hobbs ty...@datastax.comwrote:

 Correct, that's exactly what I'm saying.


 On Mon, Apr 30, 2012 at 10:37 AM, samal samalgo...@gmail.com wrote:

 thanks tyler for reply.

 are you saying  user1uuid_*{ts%86400}* would lead to unique day
 bucket which will be timezone {NZ to US} independent? I will try.


 On Mon, Apr 30, 2012 at 8:25 PM, Tyler Hobbs ty...@datastax.comwrote:

 Don't use dates or datestamps as the buckets for your row keys, use
 a unix timestamp modulo whatever size you want your bucket to be 
 instead.
 Timestamps don't involve time zones or any of that nonsense.

 So, instead of having keys like user1uuid_30042012, the second
 half would be replaced the current unix timestamp mod 86400 (the number 
 of
 seconds in a day).


 On Mon, Apr 30, 2012 at 1:46 AM, samal samalgo...@gmail.com wrote:

 Hello List,

 I need suggestion/ recommendation on time series data.

 I have requirement where users belongs to different timezone and
 they can subscribe to global group.
 When users at specific timezone send update to group it is
 available to every user in different timezone.

 I am using GroupSubscribedUsers CF where all update to group are
 push to Each User time line, and key is timelined by 
 useruuid_date(one
 day update of all groups) and columns are group updates.

 GroupSubscribedUsers ={
 user1uuid_30042012:{//this user belongs to same timezone
  timeuuid1:JSON[group1update1]
  timeuuid2:JSON[group2update2]
  timeuuid3:JSON[group1update2]
 timeuuid4:JSON[group4update1]
},
   user2uuid_30042012:{//this user belongs to different timezonewhere 
 date has changed already  to 1may but  30 april is getting update
  

Re: Time-series data model

2010-04-15 Thread Jean-Pierre Bergamin

Am 14.04.2010 15:22, schrieb Ted Zlatanov:

On Wed, 14 Apr 2010 15:02:29 +0200 Jean-Pierre Bergaminja...@ractive.ch  
wrote:

JB  The metrics are stored together with a timestamp. The queries we want to
JB  perform are:
JB   * The last value of a specific metric of a device
JB   * The values of a specific metric of a device between two timestamps t1 
and
JB  t2

Make your key devicename-metricname-MMDD-HHMM (with whatever time
sharding makes sense to you; I use UTC by-hours and by-day in my
environment).  Then your supercolumn is the collection time as a
LongType and your columns inside the supercolumn can express the metric
in detail (collector agent, detailed breakdown, etc.).
   
Just for my understanding. What is time sharding? I couldn't find an 
explanation somewhere. Do you mean that the time-series data is rolled 
up in 5 minues, 1 hour, 1 day etc. slices?


So this would be defined as:
ColumnFamily Name=measurements ColumnType=Super 
CompareWith=UTF8Type  CompareSubcolumnsWith=LongType /


So when i want to read all values of one metric between two timestamps 
t0 and t1, I'd have to read the supercolumns that match a key range 
(device1:metric1:t0 - device1:metric1:t1) and then all the supercolumns 
for this key?



Regards
James


Re: Time-series data model

2010-04-15 Thread Dan Di Spaltro
This is actually fairly similar to how we store metrics at Cloudkick.
Below has a much more in depth explanation of some of that

https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/

So we store each natural point in the NumericArchive table.

ColumnFamily CompareWith=LongType
  Name=NumericArchive /

ColumnFamily CompareWith=LongType Name=Rollup5m
ColumnType=Super CompareSubcolumnsWith=BytesType /
ColumnFamily CompareWith=LongType Name=Rollup20m
ColumnType=Super CompareSubcolumnsWith=BytesType /
ColumnFamily CompareWith=LongType Name=Rollup30m
ColumnType=Super CompareSubcolumnsWith=BytesType /
ColumnFamily CompareWith=LongType Name=Rollup60m
ColumnType=Super CompareSubcolumnsWith=BytesType /
ColumnFamily CompareWith=LongType Name=Rollup4h
ColumnType=Super CompareSubcolumnsWith=BytesType /
ColumnFamily CompareWith=LongType Name=Rollup12h
ColumnType=Super CompareSubcolumnsWith=BytesType /
ColumnFamily CompareWith=LongType Name=Rollup1d
ColumnType=Super CompareSubcolumnsWith=BytesType /

our keys look like:
serviceuuid.metric-name

Anyways, this has been working out very well for us.

2010/4/15 Ted Zlatanov t...@lifelogs.com:
 On Thu, 15 Apr 2010 11:27:47 +0200 Jean-Pierre Bergamin ja...@ractive.ch 
 wrote:

 JB Am 14.04.2010 15:22, schrieb Ted Zlatanov:
 On Wed, 14 Apr 2010 15:02:29 +0200 Jean-Pierre Bergaminja...@ractive.ch 
  wrote:

 JB The metrics are stored together with a timestamp. The queries we want to
 JB perform are:
 JB * The last value of a specific metric of a device
 JB * The values of a specific metric of a device between two timestamps t1 
 and
 JB t2

 Make your key devicename-metricname-MMDD-HHMM (with whatever time
 sharding makes sense to you; I use UTC by-hours and by-day in my
 environment).  Then your supercolumn is the collection time as a
 LongType and your columns inside the supercolumn can express the metric
 in detail (collector agent, detailed breakdown, etc.).

 JB Just for my understanding. What is time sharding? I couldn't find an
 JB explanation somewhere. Do you mean that the time-series data is rolled
 JB up in 5 minues, 1 hour, 1 day etc. slices?

 Yes.  The usual meaning of shard in RDBMS world is to segment your
 database by some criteria, e.g. US vs. Europe in Amazon AWS because
 their data centers are laid out so.  I was taking a linguistic shortcut
 to mean break down your rows by some convenient criteria.  You can
 actually set up your Partitioner in Cassandra to literally shard your
 keyspace rows based on the key, but I just meant slice in my note.

 JB So this would be defined as:
 JB ColumnFamily Name=measurements ColumnType=Super
 JB CompareWith=UTF8Type  CompareSubcolumnsWith=LongType /

 JB So when i want to read all values of one metric between two timestamps
 JB t0 and t1, I'd have to read the supercolumns that match a key range
 JB (device1:metric1:t0 - device1:metric1:t1) and then all the
 JB supercolumns for this key?

 Yes.  This is a single multiget if you can construct the key range
 explicitly.  Cassandra loads a lot of this in memory already and filters
 it after the fact, that's why it pays to slice your keys and to stitch
 them together on the client side if you have to go across a time
 boundary.  You'll also get better key load balancing with deeper slicing
 if you use the randomizing partitioner.

 In the result set, you'll get each matching supercolumn with all the
 columns inside it.  You may have to page through supercolumns.

 Ted





-- 
Dan Di Spaltro


Time-series data model

2010-04-14 Thread Jean-Pierre Bergamin
Hello everyone

We are currently evaluating a new DB system (replacing MySQL) to store
massive amounts of time-series data. The data are various metrics from
various network and IT devices and systems. Metrics i.e. could be CPU usage
of the server xy in percent, memory usage of server xy in MB, ping
response time of server foo in milliseconds, network traffic of router
bar in MB/s and so on. Different metrics can be collected for different
devices in different intervals.

The metrics are stored together with a timestamp. The queries we want to
perform are:
 * The last value of a specific metric of a device
 * The values of a specific metric of a device between two timestamps t1 and
t2

I stumbled across this blog post which describes a very similar setup with
Cassandra:
https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
This post gave me confidence that what we want is definitively doable with
Cassandra.

But since I'm just digging into columns and super-columns and their
families, I still have some problems understanding everything.

Our data model could look in json'isch notation like this:
{
my_server_1: {
cpu_usage: {
{ts: 1271248215, value: 87 },
{ts: 1271248220, value: 34 },
{ts: 1271248225, value: 23 },
{ts: 1271248230, value: 49 }
}
ping_response: {
{ts: 1271248201, value: 0.345 },
{ts: 1271248211, value: 0.423 },
{ts: 1271248221, value: 0.311 },
{ts: 1271248232, value: 0.582 }
}
}

my_server_2: {
cpu_usage: {
{ts: 1271248215, value: 23 },
...
}
disk_usage: {
{ts: 1271243451, value: 123445 },
...
}
}

my_router_1: {
bytes_in: {
{ts: 1271243451, value: 2452346 },
...
}
bytes_out: {
{ts: 1271243451, value: 13468 },
...
}
errors: {
{ts: 1271243451, value: 24 },
...
}
}
}

What I don't get is how to created the two level hierarchy [device][metric].

Am I right that the devices would be kept in a super column family? The
ordering of those is not important.

But the metrics per device are also a super column, where the columns would
be the metric values ({ts: 1271243451, value: 24 }), isn't it?

So I'd need a super column in a super column... Hm.
My brain is definitively RDBMS-damaged and I don't see through columns and
super-columns yet. :-)

How could this be modeled in Cassandra?


Thank you very much
James




Re: Time-series data model

2010-04-14 Thread Zhiguo Zhang
first of all I am a new bee by Non-SQL. I try write down my opinions as
references:

If I were you, I will use 2 columnfamilys:

1.CF,  key is devices
2.CF,  key is timeuuid

how do u think about that?

Mike


On Wed, Apr 14, 2010 at 3:02 PM, Jean-Pierre Bergamin ja...@ractive.chwrote:

 Hello everyone

 We are currently evaluating a new DB system (replacing MySQL) to store
 massive amounts of time-series data. The data are various metrics from
 various network and IT devices and systems. Metrics i.e. could be CPU usage
 of the server xy in percent, memory usage of server xy in MB, ping
 response time of server foo in milliseconds, network traffic of router
 bar in MB/s and so on. Different metrics can be collected for different
 devices in different intervals.

 The metrics are stored together with a timestamp. The queries we want to
 perform are:
  * The last value of a specific metric of a device
  * The values of a specific metric of a device between two timestamps t1
 and
 t2

 I stumbled across this blog post which describes a very similar setup with
 Cassandra:
 https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
 This post gave me confidence that what we want is definitively doable with
 Cassandra.

 But since I'm just digging into columns and super-columns and their
 families, I still have some problems understanding everything.

 Our data model could look in json'isch notation like this:
 {
 my_server_1: {
cpu_usage: {
{ts: 1271248215, value: 87 },
{ts: 1271248220, value: 34 },
{ts: 1271248225, value: 23 },
{ts: 1271248230, value: 49 }
}
ping_response: {
{ts: 1271248201, value: 0.345 },
{ts: 1271248211, value: 0.423 },
{ts: 1271248221, value: 0.311 },
{ts: 1271248232, value: 0.582 }
}
 }

 my_server_2: {
cpu_usage: {
{ts: 1271248215, value: 23 },
...
}
disk_usage: {
{ts: 1271243451, value: 123445 },
...
}
 }

 my_router_1: {
bytes_in: {
{ts: 1271243451, value: 2452346 },
...
}
bytes_out: {
{ts: 1271243451, value: 13468 },
...
}
errors: {
{ts: 1271243451, value: 24 },
...
}
 }
 }

 What I don't get is how to created the two level hierarchy
 [device][metric].

 Am I right that the devices would be kept in a super column family? The
 ordering of those is not important.

 But the metrics per device are also a super column, where the columns would
 be the metric values ({ts: 1271243451, value: 24 }), isn't it?

 So I'd need a super column in a super column... Hm.
 My brain is definitively RDBMS-damaged and I don't see through columns and
 super-columns yet. :-)

 How could this be modeled in Cassandra?


 Thank you very much
 James





Re: Time-series data model

2010-04-14 Thread Ted Zlatanov
On Wed, 14 Apr 2010 15:02:29 +0200 Jean-Pierre Bergamin ja...@ractive.ch 
wrote: 

JB The metrics are stored together with a timestamp. The queries we want to
JB perform are:
JB  * The last value of a specific metric of a device
JB  * The values of a specific metric of a device between two timestamps t1 and
JB t2

Make your key devicename-metricname-MMDD-HHMM (with whatever time
sharding makes sense to you; I use UTC by-hours and by-day in my
environment).  Then your supercolumn is the collection time as a
LongType and your columns inside the supercolumn can express the metric
in detail (collector agent, detailed breakdown, etc.).

If you want your clients to discover the available metrics, you may need
to keep an external index.  But from your spec that doesn't seem necessary.

Ted