On 26-07-2012 10:23, Andrew Kondratovich wrote:
1) Identifiers is not random. They are collected in groups, but number
of these groups is large and the same identifier can occur in
different groups.
2) It's a fixed amount of time - it can be changed, but usually it's
one day. But request can be "get items from day before", not only for
today.
OK... Still assuming that latency, and hence the number of seeks, and
hence the number of keys touched, is a key concern (and forgive me for
going on at length, but I regard this as an exercise):
"2)" suggests that grouping events by day may be an idea. Then, to get
all record for a certain "from", you'll need to look (usually) two keys.
If "time" in the events is the time the event is written, then this
structuring may in itself be good enough: the 10,000 keys corresponding
to 'today', and even 'yesterday' as well, might fit into the disk cache.
Depending on what else is going on on the server in question, the number
of items needed to actually be read from the disk to respond to a
request may be low enough that performance is good enough.
That is probably assuming too much, though.
As for "1)", there's still a lot of "it depends".
First a bit of "why it depends":
- one disk seek costs approximately as much as reading 300KB from the
disk -- according to "Numbers Everyone Should Know"
(http://brenocon.com/dean_perf.html).
- data sitting next to each other on the disk won't require (much)
seeking. For certain backends, such as eleveldb, data with adjacent
keys will be adjacent on disk (with high probability; for eleveldb
they'll have to be roughly same age as well).
- data accessed often can be accessed cheap, because it'll probably be
in the disk cache.
- data access patterns are often skewed.
So,
- If the identifier groups are made of subgroups, such that each
identifier only occurs in one subgroup, and identifier groups
(typically) contain just a few subgroups (significantly fewer than 500),
then storing each subgroup together may improve performance.
This of course also apply even if an identifier group contains just part
of each subgroup - instead of all of it; the remainder of the subgroup
can be ignored.
Whether this pays off depends on the size of the data in the entry, and
the "subgroupability".
For example, if instead of reading 500 keys each pertaining one
identifier and holding 10KB each, we can form 50 subgroups of each 20
identifiers (using only, on average, half of these subgroups) and read
those instead (= 200KB per subgroup), then we've traded 450 (potential)
disk seeks for 5MB disk data transfer, which is good. (Doing the math,
we may have cut time from 5.05s to 0.6s, not considering parallelism.)
- If the identifiers of a group are lexicographically adjacent, or
consists of few subgroups which are each lexicographically adjacent, and
a backend is used which stores the objects sorted by key, then the
number of seeks may not be very high to begin with. In that case, we're
lucky and can just use the (identifier/date) pair as keys.
This of course also applies if, instead of having such identifiers to
start with, we can translate the identifiers into other keys with that
property.
(One possible use case which matches this: a game or similar in which a
player can see a 20x25 area centered on the location of the player; each
area square = an identifier. If the squares are named "YYY,XXX", then
the identifiers of a request consists of 20 subgroups
("YYY,Xmin"..."YYY,Xmax") which are lexicographically, and hence
disk-wise, adjacent - and therefore, only 20 disk seeks are needed (or
2x20, to get yesterday's data as well), which is manageable.
- If the access pattern is very skewed, so that e.g. of the 500
identifiers in each request, the 450 are in fact more or less fixed and
only the 50 remaining vary, then of course the 450 lookups are quite
cheap, and the math changes accordingly.
Thus, the topology of the request<->identifier graph matters - both
regarding which identifiers are mostly requested together, and the
"degree" of each identifier.
- If the identifier/group graphs is known and the degree of each
identifier is low - such as if each identifier belongs to exactly two
groups, e.g. if an identifier corresponds to an edge in a graph, and the
groups correspond to nodes (they might be roads and cities, or messages
and actors, or whatever), - then you might want to consider storing the
data in multiple places: have a key for each group (well, group/date
pair) and store the data in both/all groups which contains the identifier.
- Finally (it's all about the data and access patterns...), there's the
question of how many of the 500 identifiers on average have data from
the past day. If it happens to be the case that for most of the
identifiers, there are no recent events, then the corresponding
(identifier,date) key would be absent. For certain backend - such as
bitcask - this is detected without any disk access at all, because
bitcask keeps the entire key directory in RAM. Then again, unless the
database is pruned with suitable frequency, this same property could
mean that the key set would grow too fast for bitcask to be suitable for
your needs.
But anyway, in a suitable scenario, this cheap existence check would
keep the disk seek count low enough that simply using the
(identifier,date)-pair would give reasonable performance.
There are probably other special cases of interest, but I'd better stop
here - this should suffice to illustrate the scope of "it depends" :-)
And all of this is of course still assuming that disk seeks are in fact
of the essence; SSDs change that game, for instance. (As does a 56kbit
connection between the app server and the Riak cluster.)
/Erik
On Thu, Jul 26, 2012 at 10:48 AM, Erik Søe Sørensen <[email protected]
<mailto:[email protected]>> wrote:
Never mind the number of requests (well, almost) - what you
certainly want to keep down is the number of disk seeks.
To that end...:
1) those 500 identifiers present in a request - are they totally
unrelated, or do they occur in some pattern - e.g. the same
together always?
2) the cut-off time - may it be anything, or is it something like
a fixed amount back in time?
----- Reply message -----
Fra: "Andrew Kondratovich" <[email protected]
<mailto:[email protected]>>
Dato: ons., jul. 25, 2012 18:23
Emne: How to store data
Til: "Andres Jaan Tack" <[email protected]
<mailto:[email protected]>>
Cc: "[email protected]
<mailto:[email protected]>" <[email protected]
<mailto:[email protected]>>
Yeap.. half a thousand requests to riak isn't cool =( I'm looking
some strategy of storing data so that i could fetch all items by 1
request.
I could use index MR at time and filter results at map phase. I
could use special keys with from data and use key filters (with
time filtering at map phase)... I wish I could use several 2i at
MR or combine 2i with keyfilters, or perform MR on buckets... I
wish... =)
On Wed, Jul 25, 2012 at 5:35 PM, Andres Jaan Tack
<[email protected]
<mailto:[email protected]><mailto:[email protected]
<mailto:[email protected]>>>
wrote:
Is that a realistic strategy for low latency requirements? Imagine
this were some web service, and people generate this query at some
reasonable frequency.
(not that I know what Andrew is looking for, exactly)
2012/7/25 Yousuf Fauzan <[email protected]
<mailto:[email protected]><mailto:[email protected]
<mailto:[email protected]>>>
Since 500 is not that big a number, I think you can run that many
M/Rs with each emitting only records having "time" greater than
specified. Input would be {index, <<"bucket">>, <<"from_bin">>,
<<"from_field_value">>}
If you decide to split the data into separate buckets based on
"from" field, input would be {index, <<"from_field_value">>,
<<"time_bin">>, <<"time_low">>, <<"time_high">>}
--
Yousuf
On Wed, Jul 25, 2012 at 6:35 PM, Andrew Kondratovich
<[email protected]
<mailto:[email protected]><mailto:[email protected]
<mailto:[email protected]>>> wrote:
Hello, Yousuf.
Thanks for your reply.
We have several millions of items. It's about 10 000 of unique
'from' fields (about 1000 items for each). Usually, we need to get
items for about 500 'from' identifiers with 'time' limit (about 5%
of items is corresponding).
On Wed, Jul 25, 2012 at 1:02 PM, Yousuf Fauzan
<[email protected]
<mailto:[email protected]><mailto:[email protected]
<mailto:[email protected]>>> wrote:
Hi Andrew,
First of all, the correct answer to your question is the
proverbial "it depends". Having said that, here is what I could do
in your case
1. If there are enough data points with the same "from" field, I
will make it a bucket and then index on time.
2. If the above is not true, I will index on "from" and "time" field.
a. If number of records where "time" is greater than the one
your require is small, I will run a map/reduce with the initial
input as those records.
b. If number of records having a particular "from" is small, I
will do the above with the initial input as records having that
"from" field. This could be a problem as Riak only supports range
and exact queries so if you want to query multiple identifiers,
you will have to run multiple queries.
In both the above cases, I will use secondary indexes to get
the initial records.
Note that we are using M/R as Riak does not support querying
by multiple indexes.
What I would also suggest is to partition your data into different
buckets. You will need to understand the queries that you will be
supporting and partition it accordingly.
--
Yousuf
On Wed, Jul 25, 2012 at 2:50 PM, Andrew Kondratovich
<[email protected]
<mailto:[email protected]><mailto:[email protected]
<mailto:[email protected]>>> wrote:
Good afternoon.
I am considering several storage solutions for my project, and now
I look at Riak.
We work with the following pattern of data:
{
time: unixtime
from: int
data: binary
...
}
--
Andrew Kondratovich
--
Mobile: + 45 26 36 17 55 | Skype: eriksoesorensen | Twitter: @eriksoe
Trifork A/S | Margrethepladsen 4 | DK-8000 Aarhus C |
www.trifork.com <http://www.trifork.com/>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com