model for a time index on realtime data

Colin Mon, 12 Apr 2010 08:56:47 -0700

Hi,

I am trying to figure the best model in Riak for my application. I
have read & reread the docs, wiki and list threads but haven't put any
possible solution to the test yet. I'd like your feedback to help me
avoid dead end solutions if possible!


The basic idea is that I want to aggregate a large amount of realtime
data and I want to easily retrieve this data using some time
constraints (for example, all data for the past hour, day, etc). When
I say large amount of data I mean in the order of hundreds items per
second and each item should be stored individually. For this I will
have a realtime items bucket.

Now, to create my "time" index I have a few ideas:

- create a "time capsules" buckets, with a new capsule created at some
given interval. My application would use a periodic data flusher to
accumulate and flush the data for my time capsule period, every second
for example. These time capsules would only contain links to all items
for this interval. I was actually thinking of two way links so for any
realtime items I could also retrieve its corresponding time capsule.
each time capsule would also be doubly-linked to allow walking forward
or backward in time. New time capsules buckets could be created every
day for example.

The problem I foresee with this structure is the possible number of
links for each time capsules. In this example, I am thinking one
capsule per second, with hundreds of realtime items resulting in
hundreds of links in each capsule. From what I have read, it might not
be the best of idea to deal with that many links and it will not scale
if the rate grows to thousands of items per second.

- another idea would be to directly store my realtime items keys
within each capsule instead of using links but still use links from
realtime items to capsule and between capsules. This structure would
allow me to gather all realtime items for a give timeframe by fetching
the keys list within each capsule.

- yet another solution would be to use another external storage
engine, like Redis, which supports sorted sets and store references to
my realtime items. The problem with this is that I will not be able to
easily launch map-reduce jobs to crunch data within a specific
timeframe.

Finally, the end goal is for me to be able to run map-reduce jobs on
my realtime data within a give timeframe to create other external
indexes to my data, for fulltext search for example.

Any comments, hints, pointers will be appreciated!

Thanks,
Colin

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

model for a time index on realtime data

Reply via email to