Never mind the number of requests (well, almost) - what you certainly want to 
keep down is the number of disk seeks.

To that end...:
1) those 500 identifiers present in a request - are they totally unrelated, or 
do they occur in some pattern - e.g. the same together always?
2) the cut-off time - may it be anything, or is it something like a fixed 
amount back in time?


----- Reply message -----
Fra: "Andrew Kondratovich" <[email protected]>
Dato: ons., jul. 25, 2012 18:23
Emne: How to store data
Til: "Andres Jaan Tack" <[email protected]>
Cc: "[email protected]" <[email protected]>


Yeap.. half a thousand requests to riak isn't cool =( I'm looking some strategy 
of storing data so that i could fetch all items by 1 request.

I could use index MR at time and filter results at map phase. I could use 
special keys with from data and use key filters (with time filtering at map 
phase)... I wish I could use several 2i at MR or combine 2i with keyfilters, or 
perform MR on buckets... I wish... =)

On Wed, Jul 25, 2012 at 5:35 PM, Andres Jaan Tack 
<[email protected]<mailto:[email protected]>> wrote:
Is that a realistic strategy for low latency requirements? Imagine this were 
some web service, and people generate this query at some reasonable frequency.

(not that I know what Andrew is looking for, exactly)


2012/7/25 Yousuf Fauzan <[email protected]<mailto:[email protected]>>
Since 500 is not that big a number, I think you can run that many M/Rs with 
each emitting only records having "time" greater than specified. Input would be 
{index, <<"bucket">>, <<"from_bin">>, <<"from_field_value">>}

If you decide to split the data into separate buckets based on "from" field, 
input would be {index, <<"from_field_value">>, <<"time_bin">>, <<"time_low">>, 
<<"time_high">>}


--
Yousuf

On Wed, Jul 25, 2012 at 6:35 PM, Andrew Kondratovich 
<[email protected]<mailto:[email protected]>> wrote:
Hello,  Yousuf.

Thanks for your reply.

We have several millions of items. It's about 10 000 of unique 'from' fields 
(about 1000 items for each). Usually, we need to get items for about 500 'from' 
identifiers with 'time' limit (about 5% of items is corresponding).

On Wed, Jul 25, 2012 at 1:02 PM, Yousuf Fauzan 
<[email protected]<mailto:[email protected]>> wrote:
Hi Andrew,

First of all, the correct answer to your question is the proverbial "it 
depends". Having said that, here is what I could do in your case

1. If there are enough data points with the same "from" field, I will make it a 
bucket and then index on time.
2. If the above is not true, I will index on "from" and "time" field.
    a. If number of records where "time" is greater than the one your require 
is small, I will run a map/reduce with the initial input as those records.
    b. If number of records having a particular "from" is small, I will do the 
above with the initial input as records having that "from" field. This could be 
a problem as Riak only supports range and exact queries so if you want to query 
multiple identifiers, you will have to run multiple queries.
    In both the above cases, I will use secondary indexes to get the initial 
records.
    Note that we are using M/R as Riak does not support querying by multiple 
indexes.

What I would also suggest is to partition your data into different buckets. You 
will need to understand the queries that you will be supporting and partition 
it accordingly.

--
Yousuf

On Wed, Jul 25, 2012 at 2:50 PM, Andrew Kondratovich 
<[email protected]<mailto:[email protected]>> wrote:
Good afternoon.

I am considering several storage solutions for my project, and now I look at 
Riak.
We work with the following pattern of data:
{
  time: unixtime
  from: int
  data: binary
  ...
}
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to