Re: large time series advice

Alexander Sicular Mon, 27 Sep 2010 09:21:06 -0700

Inline...


@siculars on twitter
http://siculars.posterous.com

Sent from my iPhone

On Sep 27, 2010, at 11:45, Jason McInerney <[email protected]> wrote:

Throughout the year I've seen several conversations about Riak
discussing ideas on how to store, retrieve, and manipulate large
(250MB+) data sets, and I'm wondering if anyone has implemented a good
system yet with Riak.
My situation:
1) data files come in as 2 columns: time-from-start (usually a fixed
interval in milliseconds) and value, with meta data
2) files can have hundreds of millions of rows
3) retrieval will be as all data, raw subsets, subsets smoothed over
time, or subsets according to meta data

I've had some success storing as a bucket per file, with keys as a
milliseconds from start, and retrieval is awesome.  M/R works well
getting subsets.

Problems are:
a) the speed of getting the data into Riak -- I fork off 100 - 1000
threads and do PUTs on each row (basically chunk & fork), but this is
really slow.  But so is one process, a row at a time.

Are you using protobuf or http interface? I would use the former.There are lots and lots of resources on how to best chunk large textfiles in your favorite language and there may well be a protobuf libfor riak in that language.-Whichever interface you are using make sure your lib is not returningthe data in it's reply or doing any enhanced riak stuff like waitingon n successful replies.-in your connection, supply the client id. Otherwise riak auto gensone for you (optimization, but extra processing nonetheless).

-round robin your ip's if you have a cluster of riak nodes.

-If you are testing on one node make n val equal 1 in your app.conffile or you just churning disk.

b) memory (RAM) usage after a few files are in remains very high
(30-40%), so I worry that this may no perform well with thousands of
files

If you are using a bitcask backend (default) memory is governed bytotal keys in the cask. Add more keys, eat more ram. There is a metricfloating around (off the top of my head I think it's 40b+keylength*keys). Innodb backend may be better for this mem wisealthough it does consume file descriptors on a per bucket basis.

c) I simply don't know if this is the best way to do this sort of
work.  Other DBs are an option, but I prefer Riak's features.

Take a look at basho_bench to get a feel for ops/sec of your ownsetup. You may be hitting some max ops/s due to some hardwareconstraints. Lots of nobs to tweak there.


Any and all advice is wellcome!

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: large time series advice

Reply via email to