[ 
https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006033#comment-13006033
 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

Attached a new version of the patch. I'd like to commit this one in the next 
days, if there are no objections (and no errors found). This patches removes 
some parts of the code that were highly memory intensive and hopefully enables 
tests with a higher number of features. It introduces a set of tools that might 
enable a first realworld usage of this algorithm:

* DatasetSplitter: split a rating dataset into training and probe parts
* ParallelALSFactorizationJob: parallel ALS-WR factorization of a rating matrix
* PredictionJob: predict preferences using the factorization of a rating matrix
* InMemoryFactorizationEvaluator: compute RMSE of a rating matrix factorization 
against probes in memory
* ParallelFactorizationEvaluator: compute RMSE of a rating matrix factorization 
against probes

There are still open points, in particular how to find a good regularization 
parameter automatically and efficiently and how to create an automated 
recommender pipeline similar to that of RecommenderJob using these tools. But I 
think these issues can be tackled in the future.

Here's how to play with the code:

{noformat}
# convert the movielens 1M dataset to mahout's common format for ratings
cat /path/to/ratings.dat |sed -e s/::/,/g| cut -d, -f1,2,3 > 
/path/to/ratings.csv

# create a 90% percent training set and a 10% probe set
bin/mahout splitDataset --input /path/to/ratings.csv --output /tmp/dataset 
--trainingPercentage 0.9 --probePercentage 0.1

# run distributed ALS-WR to factorize the rating matrix based on the training 
set
bin/mahout parallelALS --input /tmp/dataset/trainingSet/ --output /tmp/als/out 
--tempDir /tmp/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065

# compute predictions against the probe set, measure the error
bin/mahout evaluateFactorizationParallel --output /tmp/als/rmse --pairs 
/tmp/dataset/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures 
/tmp/als/out/M/

# print the error
cat /tmp/als/rmse/rmse.txt 
0.8531723318490103

# alternatively you can use the factorization to predict unknown ratings
bin/mahout predictFromFactorization --output /tmp/als/predict --pairs 
/tmp/dataset/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures 
/tmp/als/out/M/ --tempDir /tmp/als/predictTmp

# look at the predictions
cat /tmp/als/predict/part-r-*
1,150,4.0842405867880975
1,1029,4.163510579205656
1,745,3.7759166479388777
1,2294,3.495085673991081
1,938,3.6820865362790594
2,2067,3.8303249557251644
2,1090,3.954322089979675
2,1196,3.912089186677311
2,498,2.820740198815573
2,593,4.090550572202017
...
{noformat}

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, 
> MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch, 
> MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering 
> algorithm that uses matrix factorization, I spent some time reading through a 
> couple of the Netflix papers and stumbled upon the "Large-scale Parallel 
> Collaborative Filtering for the Netflix Prize" available at 
> http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with 
> Weighted-λ-Regularization" to factorize the preference-matrix and gives some 
> insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using 
> Map/Reduce, so I sat down and created a prototype version. I'm not really 
> sure I got the mathematical details correct (they need some optimization 
> anyway), but I wanna put up my prototype implementation here per Yonik's law 
> of patches.
> Maybe someone has the time and motivation to work a little on this with me. 
> It would be great if someone could validate the approach taken (I'm willing 
> to help as the code might not be intuitive to read) and could try to 
> factorize some test data and give feedback then.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to