1) Identifiers is not random. They are collected in groups, but number of
these groups is large and the same identifier can occur in different groups.
2) It's a fixed amount of time - it can be changed, but usually it's one
day. But request can be "get items from day before", not only for today.

On Thu, Jul 26, 2012 at 10:48 AM, Erik Søe Sørensen <[email protected]> wrote:

> Never mind the number of requests (well, almost) - what you certainly want
> to keep down is the number of disk seeks.
>
> To that end...:
> 1) those 500 identifiers present in a request - are they totally
> unrelated, or do they occur in some pattern - e.g. the same together always?
> 2) the cut-off time - may it be anything, or is it something like a fixed
> amount back in time?
>
>
> ----- Reply message -----
> Fra: "Andrew Kondratovich" <[email protected]>
> Dato: ons., jul. 25, 2012 18:23
> Emne: How to store data
> Til: "Andres Jaan Tack" <[email protected]>
> Cc: "[email protected]" <[email protected]>
>
>
> Yeap.. half a thousand requests to riak isn't cool =( I'm looking some
> strategy of storing data so that i could fetch all items by 1 request.
>
> I could use index MR at time and filter results at map phase. I could use
> special keys with from data and use key filters (with time filtering at map
> phase)... I wish I could use several 2i at MR or combine 2i with
> keyfilters, or perform MR on buckets... I wish... =)
>
> On Wed, Jul 25, 2012 at 5:35 PM, Andres Jaan Tack <
> [email protected]<mailto:[email protected]>> wrote:
> Is that a realistic strategy for low latency requirements? Imagine this
> were some web service, and people generate this query at some reasonable
> frequency.
>
> (not that I know what Andrew is looking for, exactly)
>
>
> 2012/7/25 Yousuf Fauzan <[email protected]<mailto:
> [email protected]>>
> Since 500 is not that big a number, I think you can run that many M/Rs
> with each emitting only records having "time" greater than specified. Input
> would be {index, <<"bucket">>, <<"from_bin">>, <<"from_field_value">>}
>
> If you decide to split the data into separate buckets based on "from"
> field, input would be {index, <<"from_field_value">>, <<"time_bin">>,
> <<"time_low">>, <<"time_high">>}
>
>
> --
> Yousuf
>
> On Wed, Jul 25, 2012 at 6:35 PM, Andrew Kondratovich <
> [email protected]<mailto:[email protected]>>
> wrote:
> Hello,  Yousuf.
>
> Thanks for your reply.
>
> We have several millions of items. It's about 10 000 of unique 'from'
> fields (about 1000 items for each). Usually, we need to get items for about
> 500 'from' identifiers with 'time' limit (about 5% of items is
> corresponding).
>
> On Wed, Jul 25, 2012 at 1:02 PM, Yousuf Fauzan <[email protected]
> <mailto:[email protected]>> wrote:
> Hi Andrew,
>
> First of all, the correct answer to your question is the proverbial "it
> depends". Having said that, here is what I could do in your case
>
> 1. If there are enough data points with the same "from" field, I will make
> it a bucket and then index on time.
> 2. If the above is not true, I will index on "from" and "time" field.
>     a. If number of records where "time" is greater than the one your
> require is small, I will run a map/reduce with the initial input as those
> records.
>     b. If number of records having a particular "from" is small, I will do
> the above with the initial input as records having that "from" field. This
> could be a problem as Riak only supports range and exact queries so if you
> want to query multiple identifiers, you will have to run multiple queries.
>     In both the above cases, I will use secondary indexes to get the
> initial records.
>     Note that we are using M/R as Riak does not support querying by
> multiple indexes.
>
> What I would also suggest is to partition your data into different
> buckets. You will need to understand the queries that you will be
> supporting and partition it accordingly.
>
> --
> Yousuf
>
> On Wed, Jul 25, 2012 at 2:50 PM, Andrew Kondratovich <
> [email protected]<mailto:[email protected]>>
> wrote:
> Good afternoon.
>
> I am considering several storage solutions for my project, and now I look
> at Riak.
> We work with the following pattern of data:
> {
>   time: unixtime
>   from: int
>   data: binary
>   ...
> }
>



-- 
Andrew Kondratovich
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to