On 26-07-2012 10:23, Andrew Kondratovich wrote:
1) Identifiers is not random. They are collected in groups, but number of these groups is large and the same identifier can occur in different groups. 2) It's a fixed amount of time - it can be changed, but usually it's one day. But request can be "get items from day before", not only for today.

OK... Still assuming that latency, and hence the number of seeks, and hence the number of keys touched, is a key concern (and forgive me for going on at length, but I regard this as an exercise):

"2)" suggests that grouping events by day may be an idea. Then, to get all record for a certain "from", you'll need to look (usually) two keys.

If "time" in the events is the time the event is written, then this structuring may in itself be good enough: the 10,000 keys corresponding to 'today', and even 'yesterday' as well, might fit into the disk cache. Depending on what else is going on on the server in question, the number of items needed to actually be read from the disk to respond to a request may be low enough that performance is good enough.
That is probably assuming too much, though.


As for "1)", there's still a lot of "it depends".
First a bit of "why it depends":
- one disk seek costs approximately as much as reading 300KB from the disk -- according to "Numbers Everyone Should Know" (http://brenocon.com/dean_perf.html). - data sitting next to each other on the disk won't require (much) seeking. For certain backends, such as eleveldb, data with adjacent keys will be adjacent on disk (with high probability; for eleveldb they'll have to be roughly same age as well). - data accessed often can be accessed cheap, because it'll probably be in the disk cache.
- data access patterns are often skewed.

So,
- If the identifier groups are made of subgroups, such that each identifier only occurs in one subgroup, and identifier groups (typically) contain just a few subgroups (significantly fewer than 500), then storing each subgroup together may improve performance. This of course also apply even if an identifier group contains just part of each subgroup - instead of all of it; the remainder of the subgroup can be ignored. Whether this pays off depends on the size of the data in the entry, and the "subgroupability". For example, if instead of reading 500 keys each pertaining one identifier and holding 10KB each, we can form 50 subgroups of each 20 identifiers (using only, on average, half of these subgroups) and read those instead (= 200KB per subgroup), then we've traded 450 (potential) disk seeks for 5MB disk data transfer, which is good. (Doing the math, we may have cut time from 5.05s to 0.6s, not considering parallelism.)

- If the identifiers of a group are lexicographically adjacent, or consists of few subgroups which are each lexicographically adjacent, and a backend is used which stores the objects sorted by key, then the number of seeks may not be very high to begin with. In that case, we're lucky and can just use the (identifier/date) pair as keys. This of course also applies if, instead of having such identifiers to start with, we can translate the identifiers into other keys with that property. (One possible use case which matches this: a game or similar in which a player can see a 20x25 area centered on the location of the player; each area square = an identifier. If the squares are named "YYY,XXX", then the identifiers of a request consists of 20 subgroups ("YYY,Xmin"..."YYY,Xmax") which are lexicographically, and hence disk-wise, adjacent - and therefore, only 20 disk seeks are needed (or 2x20, to get yesterday's data as well), which is manageable.

- If the access pattern is very skewed, so that e.g. of the 500 identifiers in each request, the 450 are in fact more or less fixed and only the 50 remaining vary, then of course the 450 lookups are quite cheap, and the math changes accordingly.

Thus, the topology of the request<->identifier graph matters - both regarding which identifiers are mostly requested together, and the "degree" of each identifier.

- If the identifier/group graphs is known and the degree of each identifier is low - such as if each identifier belongs to exactly two groups, e.g. if an identifier corresponds to an edge in a graph, and the groups correspond to nodes (they might be roads and cities, or messages and actors, or whatever), - then you might want to consider storing the data in multiple places: have a key for each group (well, group/date pair) and store the data in both/all groups which contains the identifier.

- Finally (it's all about the data and access patterns...), there's the question of how many of the 500 identifiers on average have data from the past day. If it happens to be the case that for most of the identifiers, there are no recent events, then the corresponding (identifier,date) key would be absent. For certain backend - such as bitcask - this is detected without any disk access at all, because bitcask keeps the entire key directory in RAM. Then again, unless the database is pruned with suitable frequency, this same property could mean that the key set would grow too fast for bitcask to be suitable for your needs. But anyway, in a suitable scenario, this cheap existence check would keep the disk seek count low enough that simply using the (identifier,date)-pair would give reasonable performance.

There are probably other special cases of interest, but I'd better stop here - this should suffice to illustrate the scope of "it depends" :-) And all of this is of course still assuming that disk seeks are in fact of the essence; SSDs change that game, for instance. (As does a 56kbit connection between the app server and the Riak cluster.)

/Erik

On Thu, Jul 26, 2012 at 10:48 AM, Erik Søe Sørensen <[email protected] <mailto:[email protected]>> wrote:

    Never mind the number of requests (well, almost) - what you
    certainly want to keep down is the number of disk seeks.

    To that end...:
    1) those 500 identifiers present in a request - are they totally
    unrelated, or do they occur in some pattern - e.g. the same
    together always?
    2) the cut-off time - may it be anything, or is it something like
    a fixed amount back in time?


    ----- Reply message -----
    Fra: "Andrew Kondratovich" <[email protected]
    <mailto:[email protected]>>
    Dato: ons., jul. 25, 2012 18:23
    Emne: How to store data
    Til: "Andres Jaan Tack" <[email protected]
    <mailto:[email protected]>>
    Cc: "[email protected]
    <mailto:[email protected]>" <[email protected]
    <mailto:[email protected]>>


    Yeap.. half a thousand requests to riak isn't cool =( I'm looking
    some strategy of storing data so that i could fetch all items by 1
    request.

    I could use index MR at time and filter results at map phase. I
    could use special keys with from data and use key filters (with
    time filtering at map phase)... I wish I could use several 2i at
    MR or combine 2i with keyfilters, or perform MR on buckets... I
    wish... =)

    On Wed, Jul 25, 2012 at 5:35 PM, Andres Jaan Tack
    <[email protected]
    <mailto:[email protected]><mailto:[email protected] 
<mailto:[email protected]>>>
    wrote:
    Is that a realistic strategy for low latency requirements? Imagine
    this were some web service, and people generate this query at some
    reasonable frequency.

    (not that I know what Andrew is looking for, exactly)


    2012/7/25 Yousuf Fauzan <[email protected]
    <mailto:[email protected]><mailto:[email protected]
    <mailto:[email protected]>>>
    Since 500 is not that big a number, I think you can run that many
    M/Rs with each emitting only records having "time" greater than
    specified. Input would be {index, <<"bucket">>, <<"from_bin">>,
    <<"from_field_value">>}

    If you decide to split the data into separate buckets based on
    "from" field, input would be {index, <<"from_field_value">>,
    <<"time_bin">>, <<"time_low">>, <<"time_high">>}


    --
    Yousuf

    On Wed, Jul 25, 2012 at 6:35 PM, Andrew Kondratovich
    <[email protected]
    <mailto:[email protected]><mailto:[email protected]
    <mailto:[email protected]>>> wrote:
    Hello,  Yousuf.

    Thanks for your reply.

    We have several millions of items. It's about 10 000 of unique
    'from' fields (about 1000 items for each). Usually, we need to get
    items for about 500 'from' identifiers with 'time' limit (about 5%
    of items is corresponding).

    On Wed, Jul 25, 2012 at 1:02 PM, Yousuf Fauzan
    <[email protected]
    <mailto:[email protected]><mailto:[email protected]
    <mailto:[email protected]>>> wrote:
    Hi Andrew,

    First of all, the correct answer to your question is the
    proverbial "it depends". Having said that, here is what I could do
    in your case

    1. If there are enough data points with the same "from" field, I
    will make it a bucket and then index on time.
    2. If the above is not true, I will index on "from" and "time" field.
        a. If number of records where "time" is greater than the one
    your require is small, I will run a map/reduce with the initial
    input as those records.
        b. If number of records having a particular "from" is small, I
    will do the above with the initial input as records having that
    "from" field. This could be a problem as Riak only supports range
    and exact queries so if you want to query multiple identifiers,
    you will have to run multiple queries.
        In both the above cases, I will use secondary indexes to get
    the initial records.
        Note that we are using M/R as Riak does not support querying
    by multiple indexes.

    What I would also suggest is to partition your data into different
    buckets. You will need to understand the queries that you will be
    supporting and partition it accordingly.

    --
    Yousuf

    On Wed, Jul 25, 2012 at 2:50 PM, Andrew Kondratovich
    <[email protected]
    <mailto:[email protected]><mailto:[email protected]
    <mailto:[email protected]>>> wrote:
    Good afternoon.

    I am considering several storage solutions for my project, and now
    I look at Riak.
    We work with the following pattern of data:
    {
      time: unixtime
      from: int
      data: binary
      ...
    }




--
Andrew Kondratovich



--
Mobile: + 45 26 36 17 55 | Skype: eriksoesorensen | Twitter: @eriksoe
Trifork A/S | Margrethepladsen 4 | DK-8000 Aarhus C | www.trifork.com <http://www.trifork.com/>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to