Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Mahout Benchmarks
(https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Benchmarks)
Change Comment:
---------------------------------------------------------------------
Added non-distributed KDD cup recommender result
Edited by Sean Owen:
---------------------------------------------------------------------
h1. Introduction
TODO: YMMV
h1. Recommenders
h2. A Rule of Thumb
100M preferences are about the data set size where non-distributed recommenders
will outgrow a normal-sized machine (32-bit, <= 4GB RAM). Your mileage will
vary significantly with the nature of the data.
h2. Distributed recommender vs. Wikipedia links (May 27 2010)
>From the mailing list:
I just finished running a set of recommendations based on the Wikipedia link
graph, for book purposes (yeah, it's unconventional). I ran on my laptop, but
it ought to be crudely representative of how it runs in a real cluster.
The input is 1058MB as a text file, and contains, 130M article-article
associations, from 5.7M articles to 3.8M distinct articles ("users" and
"items", respectively). I estimate cost based on Amazon's North
American small Linux-based instance pricing of $0.085/hour. I ran on a
dual-core laptop with plenty of RAM, allowing 1GB per worker, so this is valid.
In this run, I run recommendations for all 5.7M "users". You can certainly run
for any subset of all users of course.
Phase 1 (Item ID to item index mapping)
29 minutes CPU time
$0.05
60MB output
Phase 2 (Create user vectors)
88 minutes CPU time
$0.13
Output: 1159MB
Phase 3 (Count co-occurrence)
77 hours CPU time
$6.54
Output: 23.6GB
Phase 4 (Partial multiply prep)
10.5 hours CPU time
$0.90
Output: 24.6GB
Phase 5 (Aggregate and recommend)
about 600 hours
about $51.00
about 10GB
(I estimated these rather than let it run at home for days!)
Note that phases 1 and 3 may be run less frequently, and need not be run every
time. But the cost is dominated by the last step, which is most of the work.
I've ignored storage costs.
This implies a cost of $0.01 (or about 8 instance-minutes) per 1,000 user
recommendations. That's not bad if, say, you want to update recs for you site's
100,000 daily active users for a dollar.
There are several levers one could pull internally to sacrifice accuracy for
speed, but it's currently set to pretty normal values. So this is just one
possibility.
Now that's not terrible, but it is about 8x more computing than would be needed
by a non-distributed implementation *if* you could fit the whole data set into
a very large instance's memory, which is still possible at this scale but needs
a pretty big instance. That's a very apples-to-oranges comparison of course;
different algorithms, entirely different environments. This is about the amount
of overhead I'd expect from distributing -- interesting to note how non-trivial
it is.
h2. Non-distributed recommender vs. KDD Cup data set (March 2011)
(From the [email protected] mailing list)
I've been test-driving a simple application of Mahout recommenders (the
non-distributed kind) on Amazon EC2 on the new Yahoo KDD Cup data set
(kddcup.yahoo.com).
In the spirit of open-source, like I mentioned, I'm committing the extra code
to mahout-examples that can be used to run a Recommender on the input and
output the right format. And, I'd like to publish the rough timings too. Find
all the source in org.apache.mahout.cf.taste.example.kddcup
h3. Track 1
* m2.2xlarge instance, 34.2GB RAM / 4 cores
* Steady state memory consumption: ~19GB
* Computation time: 30 hours (wall clock-time)
* CPU time per user: ~0.43 sec
* Cost on EC2: $34.20 (!)
(Helpful hint on cost I realized after the fact: you can almost surely get spot
instances for cheaper. The maximum price this sort of instance has gone for as
a spot instance is about $0.60/hour, vs "retail price" of $1.14/hour.)
Resulted in an RMSE of 29.5618 (the rating scale is 0-100), which is only good
enough for 29th place at the moment. Not terrible for "out of the box"
performance -- it's just using an item-based recommender with uncentered cosine
similarity. But not really good in absolute terms. A winning solution is going
to try to factor in time, and apply more sophisticated techniques. The best
RMSE so far is about 23.
h3. Track 2
* c1.xlarge instance: 7GB RAM / 8 cores
* Steady state memory consumption: ~3.8GB
* Computation time: 4.1 hours (wall clock-time)
* CPU time per user: ~1.1 sec
* Cost on EC2: $3.20
For this I bothered to write a simplistic item-item similarity metric to take
into account the additional info that is available: track, artist, album,
genre. The result was comparatively better: 17.92% error rate, good enough for
4th place at the moment.
Of course, the next task is to put this through the actual distributed
processing -- that's really the appropriate solution.
This shows you can still tackle fairly impressive scale with a non-distributed
solution. These results suggest that the largest instances available from EC2
would accomodate almost 1 billion ratings in memory. However at that scale
running a user's full recommendations would easily be measured in seconds, not
milliseconds.
h1. Clustering
See [MAHOUT-588|https://issues.apache.org/jira/browse/MAHOUT-588]
h1. Classification
h1. Frequent Patternset Mining
Change your notification preferences:
https://cwiki.apache.org/confluence/users/viewnotifications.action