MapReduce and Aggregation best practices inh Riak

Aaron Stannard Wed, 28 Nov 2012 11:14:02 -0800

Hi,

We're an early stage analytics company and are evaluating Riak as an option
for our analytics DB. Long story short: we don't quite have the volume to
necessitate or time to support Hive / Hbase / Hadoop and our current
analytics database (RavenDB) is starting to buckle under the load we're
throwing at it.


I went through this presentation about Kiip (company with similar struggles
to ours) scaling on Riak and wanted to ask some questions for heavy
MapReduce users: https://speakerdeck.com/mitchellh/day-at-kiip

So here's what I wanted to ask:

1. Is there an easy mechanism to run scheduled MapReduce jobs within Riak?
If there isn't, how would you set this up? Chron job or Hadoop running on a
separate box?

2. I've seen some creative strategies for using post-commit hooks in Riak
to write incremental aggregate values (i.e. for each record added to User
bucket, increment the TotalUsers property by one for a DailyUsers record in
the DailyUsers bucket) - how well does this work in practice? Are there any
issues with atomicity or inconsistency across Riak's vnodes that could
cause problems?

3. Is there a way to give Riak a hint on where it should store data? I.E.
If I have objects across multiple buckets that are related, store them on
the same vnode so we don't have to span the network to run a M/R ?

4. What are some of the best practices for running large (millions of
objects) MapReduce jobs?

Best,

-- 
*Aaron Stannard* • Founder  •  MarkedUp   •  markedup.com  •
[email protected] • 424.256.8675 • github.com/Aaronontheweb  •
@aaronontheweb

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

MapReduce and Aggregation best practices inh Riak

Reply via email to