To add to what Nate suggested, we have an entire blog post on scaling time series data models:
http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html Jon On Tue, Apr 17, 2018 at 7:39 PM Nate McCall <n...@thelastpickle.com> wrote: > I disagree. Create date as a raw integer is an excellent surrogate for > controlling time series "buckets" as it gives you complete control over the > granularity. You can even have multiple granularities in the same table - > remember that partition key "misses" in Cassandra are pretty lightweight as > they won't make it past the bloom filter on the read path. > > On Wed, Apr 18, 2018 at 10:00 AM, Javier Pareja <pareja.jav...@gmail.com> > wrote: > >> Hi David, >> >> Could you describe why you chose to include the create date in the >> partition key? If the vin in enough "partitioning", meaning that the size >> (number of rows x size of row) of each partition is less than 100MB, then >> remove the date and just use the create_time, because the date is already >> included in that column anyways. >> >> For example if columns "a" and "b" (from your table) are of max 256 UTF8 >> characters, then you can have approx 100MB / (2*256*2Bytes) = 100,000 rows >> per partition. You can actually have many more but you don't want to go >> much higher for performance reasons. >> >> If this is not enough you could use create_month instead of create_date, >> for example, to reduce the partition size while not being too granular. >> >> >> On Tue, 17 Apr 2018, 22:17 Nate McCall, <n...@thelastpickle.com> wrote: >> >>> Your table design will work fine as you have appropriately bucketed by >>> an integer-based 'create_date' field. >>> >>> Your goal for this refactor should be to remove the "IN" clause from >>> your code. This will move the rollup of multiple partition keys being >>> retrieved into the client instead of relying on the coordinator assembling >>> the results. You have to do more work and add some complexity, but the >>> trade off will be much higher performance as you are removing the single >>> coordinator as the bottleneck. >>> >>> On Tue, Apr 17, 2018 at 10:05 PM, Xiangfei Ni <xiangfei...@cm-dt.com> >>> wrote: >>> >>>> Hi Nate, >>>> >>>> Thanks for your reply! >>>> >>>> Is there other way to design this table to meet this requirement? >>>> >>>> >>>> >>>> Best Regards, >>>> >>>> >>>> >>>> 倪项菲*/ **David Ni* >>>> >>>> 中移德电网络科技有限公司 >>>> >>>> Virtue Intelligent Network Ltd, co. >>>> >>>> Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei >>>> >>>> Mob: +86 13797007811|Tel: + 86 27 5024 2516 >>>> >>>> >>>> >>>> *发件人:* Nate McCall <n...@thelastpickle.com> >>>> *发送时间:* 2018年4月17日 7:12 >>>> *收件人:* Cassandra Users <user@cassandra.apache.org> >>>> *主题:* Re: Time serial column family design >>>> >>>> >>>> >>>> >>>> >>>> Select * from test where vin =“ZD41578123DSAFWE12313” and create_date >>>> in (20180416, 20180415, 20180414, 20180413, 20180412………………………………….); >>>> >>>> But this cause the cql query is very long,and I don’t know whether >>>> there is limitation for the length of the cql. >>>> >>>> Please give me some advice,thanks in advance. >>>> >>>> >>>> >>>> Using the SELECT ... IN syntax means that: >>>> >>>> - the driver will not be able to route the queries to the nodes which >>>> have the partition >>>> >>>> - a single coordinator must scatter-gather the query and results >>>> >>>> >>>> >>>> Break this up into a series of single statements using the executeAsync >>>> method and gather the results via something like Futures in Guava or >>>> similar. >>>> >>> >>> >>> >>> -- >>> ----------------- >>> Nate McCall >>> Wellington, NZ >>> @zznate >>> >>> CTO >>> Apache Cassandra Consulting >>> http://www.thelastpickle.com >>> >> > > > -- > ----------------- > Nate McCall > Wellington, NZ > @zznate > > CTO > Apache Cassandra Consulting > http://www.thelastpickle.com >