[CONF] Apache Mahout > Mahout Benchmarks

confluence Sun, 27 Jun 2010 10:05:26 -0700

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Mahout Benchmarks 
(https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Benchmarks)



Edited by Robin Anil:
---------------------------------------------------------------------
h1. Introduction

TODO:  YMMV

h1. Recommenders

h2. A Rule of Thumb

100M preferences are about the data set size where non-distributed recommenders 
will outgrow a normal-sized machine (32-bit, <= 4GB RAM). Your mileage will 
vary significantly with the nature of the data.

h2. Distributed recommender vs. Wikipedia links (May 27 2010)

>From the mailing list:

I just finished running a set of recommendations based on the Wikipedia link 
graph, for book purposes (yeah, it's unconventional). I ran on my laptop, but 
it ought to be crudely representative of how it runs in a real cluster.

The input is 1058MB as a text file, and contains, 130M article-article 
associations, from 5.7M articles to 3.8M distinct articles ("users" and 
"items", respectively). I estimate cost based on Amazon's North
American small Linux-based instance pricing of $0.085/hour. I ran on a 
dual-core laptop with plenty of RAM, allowing 1GB per worker, so this is valid.

In this run, I run recommendations for all 5.7M "users". You can certainly run 
for any subset of all users of course.

Phase 1 (Item ID to item index mapping)
29 minutes CPU time
$0.05
60MB output

Phase 2 (Create user vectors)
88 minutes CPU time
$0.13
Output: 1159MB

Phase 3 (Count co-occurrence)
77 hours CPU time
$6.54
Output: 23.6GB

Phase 4 (Partial multiply prep)
10.5 hours CPU time
$0.90
Output: 24.6GB

Phase 5 (Aggregate and recommend)
about 600 hours
about $51.00
about 10GB
(I estimated these rather than let it run at home for days!)


Note that phases 1 and 3 may be run less frequently, and need not be run every 
time. But the cost is dominated by the last step, which is most of the work. 
I've ignored storage costs.

This implies a cost of $0.01 (or about 8 instance-minutes) per 1,000 user 
recommendations. That's not bad if, say, you want to update recs for you site's 
100,000 daily active users for a dollar.

There are several levers one could pull internally to sacrifice accuracy for 
speed, but it's currently set to pretty normal values. So this is just one 
possibility.

Now that's not terrible, but it is about 8x more computing than would be needed 
by a non-distributed implementation *if* you could fit the whole data set into 
a very large instance's memory, which is still possible at this scale but needs 
a pretty big instance. That's a very apples-to-oranges comparison of course; 
different algorithms, entirely different environments. This is about the amount 
of overhead I'd expect from distributing -- interesting to note how non-trivial 
it is.


h1. Clustering

h1. Classification

h1. Frequent Patternset Mining



Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Mahout > Mahout Benchmarks

Reply via email to