Since 500 is not that big a number, I think you can run that many M/Rs with
each emitting only records having "time" greater than specified. Input
would be {index, <<"bucket">>, <<"from_bin">>, <<"from_field_value">>}
If you decide to split the data into separate buckets based on "from"
field, input would be {index, <<"from_field_value">>, <<"time_bin">>,
<<"time_low">>, <<"time_high">>}
--
Yousuf
On Wed, Jul 25, 2012 at 6:35 PM, Andrew Kondratovich <
[email protected]> wrote:
> Hello, Yousuf.
>
> Thanks for your reply.
>
> We have several millions of items. It's about 10 000 of unique 'from'
> fields (about 1000 items for each). Usually, we need to get items for about
> 500 'from' identifiers with 'time' limit (about 5% of items is
> corresponding).
>
> On Wed, Jul 25, 2012 at 1:02 PM, Yousuf Fauzan <[email protected]>wrote:
>
>> Hi Andrew,
>>
>> First of all, the correct answer to your question is the proverbial "it
>> depends". Having said that, here is what I could do in your case
>>
>> 1. If there are enough data points with the same "from" field, I will
>> make it a bucket and then index on time.
>> 2. If the above is not true, I will index on "from" and "time" field.
>> a. If number of records where "time" is greater than the one your
>> require is small, I will run a map/reduce with the initial input as those
>> records.
>> b. If number of records having a particular "from" is small, I will
>> do the above with the initial input as records having that "from" field.
>> This could be a problem as Riak only supports range and exact queries so if
>> you want to query multiple identifiers, you will have to run multiple
>> queries.
>> In both the above cases, I will use secondary indexes to get the
>> initial records.
>> Note that we are using M/R as Riak does not support querying by
>> multiple indexes.
>>
>> What I would also suggest is to partition your data into different
>> buckets. You will need to understand the queries that you will be
>> supporting and partition it accordingly.
>>
>> --
>> Yousuf
>>
>> On Wed, Jul 25, 2012 at 2:50 PM, Andrew Kondratovich <
>> [email protected]> wrote:
>>
>>> Good afternoon.
>>>
>>> I am considering several storage solutions for my project, and now I
>>> look at Riak.
>>> We work with the following pattern of data:
>>> {
>>> time: unixtime
>>> from: int
>>> data: binary
>>> ...
>>> }
>>>
>>> The amount of data is about several millions items for now, but it's
>>> growing. It is necessary to handle the folloring requests: for a list of
>>> identifiers (about 500 items) return all records where id = from and time
>>> greater than a certain value.
>>>
>>> How to store such data and to effectively handle such requests with the
>>> Riak?
>>>
>>> Thanks.
>>>
>>> --
>>> Andrew Kondratovich
>>>
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> [email protected]
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>
>
>
> --
> Andrew Kondratovich
>
>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com