Re: 答复: Time serial column family design

Nate McCall Tue, 17 Apr 2018 19:40:35 -0700

I disagree. Create date as a raw integer is an excellent surrogate for
controlling time series "buckets" as it gives you complete control over the
granularity. You can even have multiple granularities in the same table -
remember that partition key "misses" in Cassandra are pretty lightweight as
they won't make it past the bloom filter on the read path.


On Wed, Apr 18, 2018 at 10:00 AM, Javier Pareja <pareja.jav...@gmail.com>
wrote:

> Hi David,
>
> Could you describe why you chose to include the create date in the
> partition key? If the vin in enough "partitioning", meaning that the size
> (number of rows x size of row) of each partition is less than 100MB, then
> remove the date and just use the create_time, because the date is already
> included in that column anyways.
>
> For example if columns "a" and "b" (from your table) are of max 256 UTF8
> characters, then you can have approx 100MB / (2*256*2Bytes) = 100,000 rows
> per partition. You can actually have many more but you don't want to go
> much higher for performance reasons.
>
> If this is not enough you could use create_month instead of create_date,
> for example, to reduce the partition size while not being too granular.
>
>
> On Tue, 17 Apr 2018, 22:17 Nate McCall, <n...@thelastpickle.com> wrote:
>
>> Your table design will work fine as you have appropriately bucketed by an
>> integer-based 'create_date' field.
>>
>> Your goal for this refactor should be to remove the "IN" clause from your
>> code. This will move the rollup of multiple partition keys being retrieved
>> into the client instead of relying on the coordinator assembling the
>> results. You have to do more work and add some complexity, but the trade
>> off will be much higher performance as you are removing the single
>> coordinator as the bottleneck.
>>
>> On Tue, Apr 17, 2018 at 10:05 PM, Xiangfei Ni <xiangfei...@cm-dt.com>
>> wrote:
>>
>>> Hi Nate,
>>>
>>>     Thanks for your reply!
>>>
>>>     Is there other way to design this table to meet this requirement?
>>>
>>>
>>>
>>> Best Regards,
>>>
>>>
>>>
>>> 倪项菲*/ **David Ni*
>>>
>>> 中移德电网络科技有限公司
>>>
>>> Virtue Intelligent Network Ltd, co.
>>>
>>> Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
>>>
>>> Mob: +86 13797007811|Tel: + 86 27 5024 2516
>>>
>>>
>>>
>>> *发件人:* Nate McCall <n...@thelastpickle.com>
>>> *发送时间:* 2018年4月17日 7:12
>>> *收件人:* Cassandra Users <user@cassandra.apache.org>
>>> *主题:* Re: Time serial column family design
>>>
>>>
>>>
>>>
>>>
>>> Select * from test where vin =“ZD41578123DSAFWE12313” and create_date in
>>> (20180416, 20180415, 20180414, 20180413, 20180412………………………………….);
>>>
>>> But this cause the cql query is very long,and I don’t know whether there
>>> is limitation for the length of the cql.
>>>
>>> Please give me some advice,thanks in advance.
>>>
>>>
>>>
>>> Using the SELECT ... IN syntax  means that:
>>>
>>> - the driver will not be able to route the queries to the nodes which
>>> have the partition
>>>
>>> - a single coordinator must scatter-gather the query and results
>>>
>>>
>>>
>>> Break this up into a series of single statements using the executeAsync
>>> method and gather the results via something like Futures in Guava or
>>> similar.
>>>
>>
>>
>>
>> --
>> -----------------
>> Nate McCall
>> Wellington, NZ
>> @zznate
>>
>> CTO
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>


-- 
-----------------
Nate McCall
Wellington, NZ
@zznate

CTO
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: 答复: Time serial column family design

Reply via email to