To add to what Nate suggested, we have an entire blog post on scaling time
series data models:

http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html

Jon


On Tue, Apr 17, 2018 at 7:39 PM Nate McCall <n...@thelastpickle.com> wrote:

> I disagree. Create date as a raw integer is an excellent surrogate for
> controlling time series "buckets" as it gives you complete control over the
> granularity. You can even have multiple granularities in the same table -
> remember that partition key "misses" in Cassandra are pretty lightweight as
> they won't make it past the bloom filter on the read path.
>
> On Wed, Apr 18, 2018 at 10:00 AM, Javier Pareja <pareja.jav...@gmail.com>
> wrote:
>
>> Hi David,
>>
>> Could you describe why you chose to include the create date in the
>> partition key? If the vin in enough "partitioning", meaning that the size
>> (number of rows x size of row) of each partition is less than 100MB, then
>> remove the date and just use the create_time, because the date is already
>> included in that column anyways.
>>
>> For example if columns "a" and "b" (from your table) are of max 256 UTF8
>> characters, then you can have approx 100MB / (2*256*2Bytes) = 100,000 rows
>> per partition. You can actually have many more but you don't want to go
>> much higher for performance reasons.
>>
>> If this is not enough you could use create_month instead of create_date,
>> for example, to reduce the partition size while not being too granular.
>>
>>
>> On Tue, 17 Apr 2018, 22:17 Nate McCall, <n...@thelastpickle.com> wrote:
>>
>>> Your table design will work fine as you have appropriately bucketed by
>>> an integer-based 'create_date' field.
>>>
>>> Your goal for this refactor should be to remove the "IN" clause from
>>> your code. This will move the rollup of multiple partition keys being
>>> retrieved into the client instead of relying on the coordinator assembling
>>> the results. You have to do more work and add some complexity, but the
>>> trade off will be much higher performance as you are removing the single
>>> coordinator as the bottleneck.
>>>
>>> On Tue, Apr 17, 2018 at 10:05 PM, Xiangfei Ni <xiangfei...@cm-dt.com>
>>> wrote:
>>>
>>>> Hi Nate,
>>>>
>>>>     Thanks for your reply!
>>>>
>>>>     Is there other way to design this table to meet this requirement?
>>>>
>>>>
>>>>
>>>> Best Regards,
>>>>
>>>>
>>>>
>>>> 倪项菲*/ **David Ni*
>>>>
>>>> 中移德电网络科技有限公司
>>>>
>>>> Virtue Intelligent Network Ltd, co.
>>>>
>>>> Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
>>>>
>>>> Mob: +86 13797007811|Tel: + 86 27 5024 2516
>>>>
>>>>
>>>>
>>>> *发件人:* Nate McCall <n...@thelastpickle.com>
>>>> *发送时间:* 2018年4月17日 7:12
>>>> *收件人:* Cassandra Users <user@cassandra.apache.org>
>>>> *主题:* Re: Time serial column family design
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Select * from test where vin =“ZD41578123DSAFWE12313” and create_date
>>>> in (20180416, 20180415, 20180414, 20180413, 20180412………………………………….);
>>>>
>>>> But this cause the cql query is very long,and I don’t know whether
>>>> there is limitation for the length of the cql.
>>>>
>>>> Please give me some advice,thanks in advance.
>>>>
>>>>
>>>>
>>>> Using the SELECT ... IN syntax  means that:
>>>>
>>>> - the driver will not be able to route the queries to the nodes which
>>>> have the partition
>>>>
>>>> - a single coordinator must scatter-gather the query and results
>>>>
>>>>
>>>>
>>>> Break this up into a series of single statements using the executeAsync
>>>> method and gather the results via something like Futures in Guava or
>>>> similar.
>>>>
>>>
>>>
>>>
>>> --
>>> -----------------
>>> Nate McCall
>>> Wellington, NZ
>>> @zznate
>>>
>>> CTO
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>
>
>
> --
> -----------------
> Nate McCall
> Wellington, NZ
> @zznate
>
> CTO
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>

Reply via email to