1) Identifiers is not random. They are collected in groups, but number of these groups is large and the same identifier can occur in different groups. 2) It's a fixed amount of time - it can be changed, but usually it's one day. But request can be "get items from day before", not only for today.
On Thu, Jul 26, 2012 at 10:48 AM, Erik Søe Sørensen <[email protected]> wrote: > Never mind the number of requests (well, almost) - what you certainly want > to keep down is the number of disk seeks. > > To that end...: > 1) those 500 identifiers present in a request - are they totally > unrelated, or do they occur in some pattern - e.g. the same together always? > 2) the cut-off time - may it be anything, or is it something like a fixed > amount back in time? > > > ----- Reply message ----- > Fra: "Andrew Kondratovich" <[email protected]> > Dato: ons., jul. 25, 2012 18:23 > Emne: How to store data > Til: "Andres Jaan Tack" <[email protected]> > Cc: "[email protected]" <[email protected]> > > > Yeap.. half a thousand requests to riak isn't cool =( I'm looking some > strategy of storing data so that i could fetch all items by 1 request. > > I could use index MR at time and filter results at map phase. I could use > special keys with from data and use key filters (with time filtering at map > phase)... I wish I could use several 2i at MR or combine 2i with > keyfilters, or perform MR on buckets... I wish... =) > > On Wed, Jul 25, 2012 at 5:35 PM, Andres Jaan Tack < > [email protected]<mailto:[email protected]>> wrote: > Is that a realistic strategy for low latency requirements? Imagine this > were some web service, and people generate this query at some reasonable > frequency. > > (not that I know what Andrew is looking for, exactly) > > > 2012/7/25 Yousuf Fauzan <[email protected]<mailto: > [email protected]>> > Since 500 is not that big a number, I think you can run that many M/Rs > with each emitting only records having "time" greater than specified. Input > would be {index, <<"bucket">>, <<"from_bin">>, <<"from_field_value">>} > > If you decide to split the data into separate buckets based on "from" > field, input would be {index, <<"from_field_value">>, <<"time_bin">>, > <<"time_low">>, <<"time_high">>} > > > -- > Yousuf > > On Wed, Jul 25, 2012 at 6:35 PM, Andrew Kondratovich < > [email protected]<mailto:[email protected]>> > wrote: > Hello, Yousuf. > > Thanks for your reply. > > We have several millions of items. It's about 10 000 of unique 'from' > fields (about 1000 items for each). Usually, we need to get items for about > 500 'from' identifiers with 'time' limit (about 5% of items is > corresponding). > > On Wed, Jul 25, 2012 at 1:02 PM, Yousuf Fauzan <[email protected] > <mailto:[email protected]>> wrote: > Hi Andrew, > > First of all, the correct answer to your question is the proverbial "it > depends". Having said that, here is what I could do in your case > > 1. If there are enough data points with the same "from" field, I will make > it a bucket and then index on time. > 2. If the above is not true, I will index on "from" and "time" field. > a. If number of records where "time" is greater than the one your > require is small, I will run a map/reduce with the initial input as those > records. > b. If number of records having a particular "from" is small, I will do > the above with the initial input as records having that "from" field. This > could be a problem as Riak only supports range and exact queries so if you > want to query multiple identifiers, you will have to run multiple queries. > In both the above cases, I will use secondary indexes to get the > initial records. > Note that we are using M/R as Riak does not support querying by > multiple indexes. > > What I would also suggest is to partition your data into different > buckets. You will need to understand the queries that you will be > supporting and partition it accordingly. > > -- > Yousuf > > On Wed, Jul 25, 2012 at 2:50 PM, Andrew Kondratovich < > [email protected]<mailto:[email protected]>> > wrote: > Good afternoon. > > I am considering several storage solutions for my project, and now I look > at Riak. > We work with the following pattern of data: > { > time: unixtime > from: int > data: binary > ... > } > -- Andrew Kondratovich
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
