Hi, I am trying to figure the best model in Riak for my application. I have read & reread the docs, wiki and list threads but haven't put any possible solution to the test yet. I'd like your feedback to help me avoid dead end solutions if possible!
The basic idea is that I want to aggregate a large amount of realtime data and I want to easily retrieve this data using some time constraints (for example, all data for the past hour, day, etc). When I say large amount of data I mean in the order of hundreds items per second and each item should be stored individually. For this I will have a realtime items bucket. Now, to create my "time" index I have a few ideas: - create a "time capsules" buckets, with a new capsule created at some given interval. My application would use a periodic data flusher to accumulate and flush the data for my time capsule period, every second for example. These time capsules would only contain links to all items for this interval. I was actually thinking of two way links so for any realtime items I could also retrieve its corresponding time capsule. each time capsule would also be doubly-linked to allow walking forward or backward in time. New time capsules buckets could be created every day for example. The problem I foresee with this structure is the possible number of links for each time capsules. In this example, I am thinking one capsule per second, with hundreds of realtime items resulting in hundreds of links in each capsule. From what I have read, it might not be the best of idea to deal with that many links and it will not scale if the rate grows to thousands of items per second. - another idea would be to directly store my realtime items keys within each capsule instead of using links but still use links from realtime items to capsule and between capsules. This structure would allow me to gather all realtime items for a give timeframe by fetching the keys list within each capsule. - yet another solution would be to use another external storage engine, like Redis, which supports sorted sets and store references to my realtime items. The problem with this is that I will not be able to easily launch map-reduce jobs to crunch data within a specific timeframe. Finally, the end goal is for me to be able to run map-reduce jobs on my realtime data within a give timeframe to create other external indexes to my data, for fulltext search for example. Any comments, hints, pointers will be appreciated! Thanks, Colin _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
