I guess there's no easy solution for this. The bucketing technique you were
applying with the publish_pre extra field making a composite partition key
is probably your best bet but you're right being concerned that all your
workload will hit the same node during an hour.

I'd then suggest adding a higher-cardinality extra field, as the second
number for example. That will spread the load across 60 partitions per
datacenter and, of course, when querying, you'll have to try all
partitions. It won't scale forever as the partitions will grow big at some
point, but if your data is small enough it may work.

Another suggestion would be to use the number of seconds since epoch as the
extra partitioning key. Instead of querying a range you'll have to issue
single partition queries for each second within your range. This solution
may suppose heavier read workloads but will definitely scale as the size of
the partitions shouldn't be an issue.

I don't think this suggestions will match your requirements off the shelf,
but hopefully will give you an idea on how to trade off between partition
size, number of partitions and read strategy to find the sweet spot for
your use case.

Regards

Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>

On 1 September 2016 at 02:58, Stone Fang <cnstonef...@gmail.com> wrote:

> access pattern is
>
> select *from datacenter where datacentername = '' and publish>$time and
> publish<$time
>
> On Wed, Aug 31, 2016 at 8:37 PM, Carlos Alonso <i...@mrcalonso.com> wrote:
>
>> Maybe a good question could be:
>>
>> Which is your access pattern to this data?
>>
>> Carlos Alonso | Software Engineer | @calonso
>> <https://twitter.com/calonso>
>>
>> On 31 August 2016 at 11:47, Stone Fang <cnstonef...@gmail.com> wrote:
>>
>>> Hi all,
>>> have some questions on how to define clustering key.
>>>
>>> have a table like this
>>>
>>> CREATE TABLE datacenter{
>>>
>>> datacentername varchar,
>>>
>>> publish timestamp,
>>>
>>> value varchar,
>>>
>>> PRIMARY KEY(datacentername,publish)
>>>
>>> }
>>>
>>>
>>> *issues:*
>>> there are only two datacenter,so the data would only have two
>>> partitions.and store
>>> in two nodes.want to spread the data evenly around the cluster.
>>>
>>> take this post for reference
>>> http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
>>>
>>> CREATE TABLE datacenter{
>>>
>>> datacentername varchar,
>>>
>>> publish_pre text,
>>>
>>> publish timestamp,
>>>
>>> value varchar,
>>>
>>> PRIMARY KEY((datacentername,publish_pre),publish)
>>>
>>> }
>>>
>>> publish_pre is from 1~12 hours.*but the workload is high.i dont want to
>>> all workload inserted into one node in a hour.*
>>>
>>> have no idea on how to define the partition key to spread data evenly
>>> around the cluster,and the partition not split by time.which means that
>>> data should not be inserted one node at a certain time window.
>>>
>>> thanks
>>> stone
>>>
>>
>>
>

Reply via email to