I'm experimenting with a test dataset to gauge whether Riak is
suitable for a particular app. My real dataset has millions of
records, but I'm testing with just a thousand items, and
unfortunately, I am getting horrible performance -- so horrible it
can't possibly be right. What am I doing wrong?
My environment:
* Riak 0.14 with default config
* Sean Cribb's Ruby client
* MacOS X Snow Leopard
* Ruby 1.9.2
* Erlang R14B01 from MacPorts
I am testing with a single node on my MacBook, which should be enough
for just a thousand key/value-pairs. These tests are run on an
initially empty database, from a single Ruby app. Each test has been
run at least 10 times consecutively to eliminate outliers and ensure
optimal cache fill.
Here are some numbers:
* 9.6 seconds to store 1,000 items. They are loaded from a text file
as JSON data. Parsing/processing overhead is about 0.8 s, the rest is
Riak. In JSON format, the items total 570 KB. The resultant Bitcask
data directory is 3.9 MB.
* 0.3 seconds to list all keys in the bucket [1].
* 1.8 seconds to list all keys and then fetch each object [2].
* 1.5 seconds to run a very simple map/reduce query [3].
Here's something else that is weird. I repeated the steps above on a
new, empty bucket, again using just 1,000 items, but after loading 1.5
million items into a separate, empty bucket. The numbers now are very
odd:
* 4.5 seconds to list all keys.
* 6.5 seconds to list + fetch.
* 5.1 seconds to run map/reduce query.
Why are operations on the small bucket suddenly worse in the presence
of a separate, large bucket? Surely the key spaces are completely
separate? Even listing keys or querying on an *empty* bucket is taking
several seconds in this scenario.
So are these timings appropriate for such a tiny dataset, and if not,
what could I be doing wrong? I'm new to Riak and I'm not sure if the
map/reduce-query is optimally expressed, so maybe that could be fixed.
Even so, storage and key-querying performance seems off by perhaps an
order of magnitude.
I have confirmed the performance issue on an Amazon EC2 instance
running Ubuntu Maverick, where performance was in fact considerably
worse.
[1] Just looping over bucket.keys.
[2] Basically: bucket.keys { |keys| keys.each { |key| bucket.get(key) } }
[3] Here's the query code. Each stored item is a JSON hash from which
a key ("path") is mapped, then reduced to aggregate the counts of each
path.
mr = Riak::MapReduce.new(client)
mr.add("test")
mr.map <<-end, :keep => false
function(v) {
var paths = [];
var entry = Riak.mapValuesJson(v)[0];
var out = {};
out[entry.path] = 1;
paths.push(out);
return paths;
}
end
mr.reduce <<-end.strip, :keep => true
function(values) {
var result = {};
for (var i = 0; i < values.length; i++) {
var table = values[i];
for (var k in table) {
var count = table[k];
if (result[k]) {
result[k] += count;
} else {
result[k] = count;
}
}
}
return [result];
}
end
results = mr.run
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com