Inline...

@siculars on twitter
http://siculars.posterous.com

Sent from my iPhone

On Sep 27, 2010, at 11:45, Jason McInerney <[email protected]> wrote:

Throughout the year I've seen several conversations about Riak
discussing ideas on how to store, retrieve, and manipulate large
(250MB+) data sets, and I'm wondering if anyone has implemented a good
system yet with Riak.
My situation:
1) data files come in as 2 columns: time-from-start (usually a fixed
interval in milliseconds) and value, with meta data
2) files can have hundreds of millions of rows
3) retrieval will be as all data, raw subsets, subsets smoothed over
time, or subsets according to meta data

I've had some success storing as a bucket per file, with keys as a
milliseconds from start, and retrieval is awesome.  M/R works well
getting subsets.

Problems are:
a) the speed of getting the data into Riak -- I fork off 100 - 1000
threads and do PUTs on each row (basically chunk & fork), but this is
really slow.  But so is one process, a row at a time.

Are you using protobuf or http interface? I would use the former. There are lots and lots of resources on how to best chunk large text files in your favorite language and there may well be a protobuf lib for riak in that language. -Whichever interface you are using make sure your lib is not returning the data in it's reply or doing any enhanced riak stuff like waiting on n successful replies. -in your connection, supply the client id. Otherwise riak auto gens one for you (optimization, but extra processing nonetheless).
-round robin your ip's if you have a cluster of riak nodes.
-If you are testing on one node make n val equal 1 in your app.conf file or you just churning disk.

b) memory (RAM) usage after a few files are in remains very high
(30-40%), so I worry that this may no perform well with thousands of
files

If you are using a bitcask backend (default) memory is governed by total keys in the cask. Add more keys, eat more ram. There is a metric floating around (off the top of my head I think it's 40b +keylength*keys). Innodb backend may be better for this mem wise although it does consume file descriptors on a per bucket basis.

c) I simply don't know if this is the best way to do this sort of
work.  Other DBs are an option, but I prefer Riak's features.

Take a look at basho_bench to get a feel for ops/sec of your own setup. You may be hitting some max ops/s due to some hardware constraints. Lots of nobs to tweak there.


Any and all advice is wellcome!

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to