Never mind the number of requests (well, almost) - what you certainly want to keep down is the number of disk seeks.
To that end...: 1) those 500 identifiers present in a request - are they totally unrelated, or do they occur in some pattern - e.g. the same together always? 2) the cut-off time - may it be anything, or is it something like a fixed amount back in time? ----- Reply message ----- Fra: "Andrew Kondratovich" <[email protected]> Dato: ons., jul. 25, 2012 18:23 Emne: How to store data Til: "Andres Jaan Tack" <[email protected]> Cc: "[email protected]" <[email protected]> Yeap.. half a thousand requests to riak isn't cool =( I'm looking some strategy of storing data so that i could fetch all items by 1 request. I could use index MR at time and filter results at map phase. I could use special keys with from data and use key filters (with time filtering at map phase)... I wish I could use several 2i at MR or combine 2i with keyfilters, or perform MR on buckets... I wish... =) On Wed, Jul 25, 2012 at 5:35 PM, Andres Jaan Tack <[email protected]<mailto:[email protected]>> wrote: Is that a realistic strategy for low latency requirements? Imagine this were some web service, and people generate this query at some reasonable frequency. (not that I know what Andrew is looking for, exactly) 2012/7/25 Yousuf Fauzan <[email protected]<mailto:[email protected]>> Since 500 is not that big a number, I think you can run that many M/Rs with each emitting only records having "time" greater than specified. Input would be {index, <<"bucket">>, <<"from_bin">>, <<"from_field_value">>} If you decide to split the data into separate buckets based on "from" field, input would be {index, <<"from_field_value">>, <<"time_bin">>, <<"time_low">>, <<"time_high">>} -- Yousuf On Wed, Jul 25, 2012 at 6:35 PM, Andrew Kondratovich <[email protected]<mailto:[email protected]>> wrote: Hello, Yousuf. Thanks for your reply. We have several millions of items. It's about 10 000 of unique 'from' fields (about 1000 items for each). Usually, we need to get items for about 500 'from' identifiers with 'time' limit (about 5% of items is corresponding). On Wed, Jul 25, 2012 at 1:02 PM, Yousuf Fauzan <[email protected]<mailto:[email protected]>> wrote: Hi Andrew, First of all, the correct answer to your question is the proverbial "it depends". Having said that, here is what I could do in your case 1. If there are enough data points with the same "from" field, I will make it a bucket and then index on time. 2. If the above is not true, I will index on "from" and "time" field. a. If number of records where "time" is greater than the one your require is small, I will run a map/reduce with the initial input as those records. b. If number of records having a particular "from" is small, I will do the above with the initial input as records having that "from" field. This could be a problem as Riak only supports range and exact queries so if you want to query multiple identifiers, you will have to run multiple queries. In both the above cases, I will use secondary indexes to get the initial records. Note that we are using M/R as Riak does not support querying by multiple indexes. What I would also suggest is to partition your data into different buckets. You will need to understand the queries that you will be supporting and partition it accordingly. -- Yousuf On Wed, Jul 25, 2012 at 2:50 PM, Andrew Kondratovich <[email protected]<mailto:[email protected]>> wrote: Good afternoon. I am considering several storage solutions for my project, and now I look at Riak. We work with the following pattern of data: { time: unixtime from: int data: binary ... } _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
