Re: How to store data

Erik Søe Sørensen Thu, 26 Jul 2012 15:20:03 -0700

On 26-07-2012 10:23, Andrew Kondratovich wrote:

1) Identifiers is not random. They are collected in groups, but numberof these groups is large and the same identifier can occur indifferent groups.2) It's a fixed amount of time - it can be changed, but usually it'sone day. But request can be "get items from day before", not only fortoday.

OK... Still assuming that latency, and hence the number of seeks, andhence the number of keys touched, is a key concern (and forgive me forgoing on at length, but I regard this as an exercise):

"2)" suggests that grouping events by day may be an idea. Then, to getall record for a certain "from", you'll need to look (usually) two keys.

If "time" in the events is the time the event is written, then thisstructuring may in itself be good enough: the 10,000 keys correspondingto 'today', and even 'yesterday' as well, might fit into the disk cache.Depending on what else is going on on the server in question, the numberof items needed to actually be read from the disk to respond to arequest may be low enough that performance is good enough.

That is probably assuming too much, though.


As for "1)", there's still a lot of "it depends".
First a bit of "why it depends":

- one disk seek costs approximately as much as reading 300KB from thedisk -- according to "Numbers Everyone Should Know"(http://brenocon.com/dean_perf.html).- data sitting next to each other on the disk won't require (much)seeking. For certain backends, such as eleveldb, data with adjacentkeys will be adjacent on disk (with high probability; for eleveldbthey'll have to be roughly same age as well).- data accessed often can be accessed cheap, because it'll probably bein the disk cache.

- data access patterns are often skewed.

So,

- If the identifier groups are made of subgroups, such that eachidentifier only occurs in one subgroup, and identifier groups(typically) contain just a few subgroups (significantly fewer than 500),then storing each subgroup together may improve performance.This of course also apply even if an identifier group contains just partof each subgroup - instead of all of it; the remainder of the subgroupcan be ignored.Whether this pays off depends on the size of the data in the entry, andthe "subgroupability".For example, if instead of reading 500 keys each pertaining oneidentifier and holding 10KB each, we can form 50 subgroups of each 20identifiers (using only, on average, half of these subgroups) and readthose instead (= 200KB per subgroup), then we've traded 450 (potential)disk seeks for 5MB disk data transfer, which is good. (Doing the math,we may have cut time from 5.05s to 0.6s, not considering parallelism.)

- If the identifiers of a group are lexicographically adjacent, orconsists of few subgroups which are each lexicographically adjacent, anda backend is used which stores the objects sorted by key, then thenumber of seeks may not be very high to begin with. In that case, we'relucky and can just use the (identifier/date) pair as keys.This of course also applies if, instead of having such identifiers tostart with, we can translate the identifiers into other keys with thatproperty.(One possible use case which matches this: a game or similar in which aplayer can see a 20x25 area centered on the location of the player; eacharea square = an identifier. If the squares are named "YYY,XXX", thenthe identifiers of a request consists of 20 subgroups("YYY,Xmin"..."YYY,Xmax") which are lexicographically, and hencedisk-wise, adjacent - and therefore, only 20 disk seeks are needed (or2x20, to get yesterday's data as well), which is manageable.

- If the access pattern is very skewed, so that e.g. of the 500identifiers in each request, the 450 are in fact more or less fixed andonly the 50 remaining vary, then of course the 450 lookups are quitecheap, and the math changes accordingly.

Thus, the topology of the request<->identifier graph matters - bothregarding which identifiers are mostly requested together, and the"degree" of each identifier.

- If the identifier/group graphs is known and the degree of eachidentifier is low - such as if each identifier belongs to exactly twogroups, e.g. if an identifier corresponds to an edge in a graph, and thegroups correspond to nodes (they might be roads and cities, or messagesand actors, or whatever), - then you might want to consider storing thedata in multiple places: have a key for each group (well, group/datepair) and store the data in both/all groups which contains the identifier.

- Finally (it's all about the data and access patterns...), there's thequestion of how many of the 500 identifiers on average have data fromthe past day. If it happens to be the case that for most of theidentifiers, there are no recent events, then the corresponding(identifier,date) key would be absent. For certain backend - such asbitcask - this is detected without any disk access at all, becausebitcask keeps the entire key directory in RAM. Then again, unless thedatabase is pruned with suitable frequency, this same property couldmean that the key set would grow too fast for bitcask to be suitable foryour needs.But anyway, in a suitable scenario, this cheap existence check wouldkeep the disk seek count low enough that simply using the(identifier,date)-pair would give reasonable performance.

There are probably other special cases of interest, but I'd better stophere - this should suffice to illustrate the scope of "it depends" :-)And all of this is of course still assuming that disk seeks are in factof the essence; SSDs change that game, for instance. (As does a 56kbitconnection between the app server and the Riak cluster.)


/Erik

On Thu, Jul 26, 2012 at 10:48 AM, Erik Søe Sørensen <[email protected]<mailto:[email protected]>> wrote:


    Never mind the number of requests (well, almost) - what you
    certainly want to keep down is the number of disk seeks.

    To that end...:
    1) those 500 identifiers present in a request - are they totally
    unrelated, or do they occur in some pattern - e.g. the same
    together always?
    2) the cut-off time - may it be anything, or is it something like
    a fixed amount back in time?


    ----- Reply message -----
    Fra: "Andrew Kondratovich" <[email protected]
    <mailto:[email protected]>>
    Dato: ons., jul. 25, 2012 18:23
    Emne: How to store data
    Til: "Andres Jaan Tack" <[email protected]
    <mailto:[email protected]>>
    Cc: "[email protected]
    <mailto:[email protected]>" <[email protected]
    <mailto:[email protected]>>


    Yeap.. half a thousand requests to riak isn't cool =( I'm looking
    some strategy of storing data so that i could fetch all items by 1
    request.

    I could use index MR at time and filter results at map phase. I
    could use special keys with from data and use key filters (with
    time filtering at map phase)... I wish I could use several 2i at
    MR or combine 2i with keyfilters, or perform MR on buckets... I
    wish... =)

    On Wed, Jul 25, 2012 at 5:35 PM, Andres Jaan Tack
    <[email protected]
    <mailto:[email protected]><mailto:[email protected] 
<mailto:[email protected]>>>
    wrote:
    Is that a realistic strategy for low latency requirements? Imagine
    this were some web service, and people generate this query at some
    reasonable frequency.

    (not that I know what Andrew is looking for, exactly)


    2012/7/25 Yousuf Fauzan <[email protected]
    <mailto:[email protected]><mailto:[email protected]
    <mailto:[email protected]>>>
    Since 500 is not that big a number, I think you can run that many
    M/Rs with each emitting only records having "time" greater than
    specified. Input would be {index, <<"bucket">>, <<"from_bin">>,
    <<"from_field_value">>}

    If you decide to split the data into separate buckets based on
    "from" field, input would be {index, <<"from_field_value">>,
    <<"time_bin">>, <<"time_low">>, <<"time_high">>}


    --
    Yousuf

    On Wed, Jul 25, 2012 at 6:35 PM, Andrew Kondratovich
    <[email protected]
    <mailto:[email protected]><mailto:[email protected]
    <mailto:[email protected]>>> wrote:
    Hello,  Yousuf.

    Thanks for your reply.

    We have several millions of items. It's about 10 000 of unique
    'from' fields (about 1000 items for each). Usually, we need to get
    items for about 500 'from' identifiers with 'time' limit (about 5%
    of items is corresponding).

    On Wed, Jul 25, 2012 at 1:02 PM, Yousuf Fauzan
    <[email protected]
    <mailto:[email protected]><mailto:[email protected]
    <mailto:[email protected]>>> wrote:
    Hi Andrew,

    First of all, the correct answer to your question is the
    proverbial "it depends". Having said that, here is what I could do
    in your case

    1. If there are enough data points with the same "from" field, I
    will make it a bucket and then index on time.
    2. If the above is not true, I will index on "from" and "time" field.
        a. If number of records where "time" is greater than the one
    your require is small, I will run a map/reduce with the initial
    input as those records.
        b. If number of records having a particular "from" is small, I
    will do the above with the initial input as records having that
    "from" field. This could be a problem as Riak only supports range
    and exact queries so if you want to query multiple identifiers,
    you will have to run multiple queries.
        In both the above cases, I will use secondary indexes to get
    the initial records.
        Note that we are using M/R as Riak does not support querying
    by multiple indexes.

    What I would also suggest is to partition your data into different
    buckets. You will need to understand the queries that you will be
    supporting and partition it accordingly.

    --
    Yousuf

    On Wed, Jul 25, 2012 at 2:50 PM, Andrew Kondratovich
    <[email protected]
    <mailto:[email protected]><mailto:[email protected]
    <mailto:[email protected]>>> wrote:
    Good afternoon.

    I am considering several storage solutions for my project, and now
    I look at Riak.
    We work with the following pattern of data:
    {
      time: unixtime
      from: int
      data: binary
      ...
    }




--
Andrew Kondratovich



--
Mobile: + 45 26 36 17 55 | Skype: eriksoesorensen | Twitter: @eriksoe

Trifork A/S | Margrethepladsen 4 | DK-8000 Aarhus C |www.trifork.com <http://www.trifork.com/>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: How to store data

Reply via email to