Sample code to apply SVD to the KDD data
----------------------------------------
Key: MAHOUT-657
URL: https://issues.apache.org/jira/browse/MAHOUT-657
Project: Mahout
Issue Type: New Feature
Components: Collaborative Filtering
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
Fix For: 0.5
I was incited by some comments on twitter to make our SVD-based recommendation
code work on the KDD data. Here's the results so far:
The patch contains a tweaked version of ExpectationMaximizationSVDFactorizer
(org.apache.mahout.cf.taste.example.kddcup.track1.svd.ParallelArraysSGDFactorizer)
in the examples module, that is able to load and process the KDD dataset with
a constant memory usage of approximately 7 gb (by using primitive arrays for
everything).
It's still very slow unfortunately, a factorization using 40 features and 25
iterations took 10 hours on my desktop PC. As far as I understand the math
behind it, the algorithm is not parallelizable but maybe someone might be able
to improve my implementation or make it compute several factorizations at once.
I took a wild guess on the parameters and got an RMSE of 23.35 to the
validation set and and RMSE of 26.1287 to the secret test ratings (that's rank
63 by the time of this writing).
Would love to see people play with this code and improve it!
In order to use this, have a look at the parameters in
*org.apache.mahout.cf.taste.example.kddcup.track1.svd.Track1SVDRunner*, change
them as you see fit and run that class with the path to the kdd data directory
and the path to the file you wanna have the results stored in as arguments. In
my tests I used *-Xms6700M -Xmx6700M* to give the JVM enough memory for 40
features.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira