[ 
https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-542:
--------------------------------------

    Attachment: MAHOUT-542-3.patch

Thanks for the input so far Ted and Dimitriy.

Here is an updated patch that does not address the issue of automatically 
learning lambda but provides some simple tools to evaluate the predicition 
quality of the factorization manually. 

I ran some local tests against the Movielens 1M dataset on my notebook:

{noformat}
# downloaded and converted the movielens 1M dataset to mahout's common format 
for ratings
cat /path/to/ratings.dat |sed -e s/::/,/g| cut -d, -f1,2,3 > 
/path/to/ratings.csv

# create a 90% percent training set and a 10% probe set
bin/mahout splitDataset --input /path/to/ratings.csv --output /tmp/dataset 
--trainingPercentage 0.9 --probePercentage 0.1

# run distributed ALS-WR to factorize the rating matrix based on the training 
set 
bin/mahout parallelALS --input /tmp/dataset/trainingSet/ --output /tmp/als/out 
--tempDir /tmp/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065

# measure the error of the predictions against the probe set
bin/mahout evaluateALS --probes /tmp/dataset/probeSet/ --userFeatures 
/tmp/als/out/U/ --itemFeatures /tmp/als/out/M/
{noformat}

Gave RMSE of 0.8564062387241173 and MAE of 0.6791075767551951 in a testrun. 

Unfortunately I don't have a cluster available currently to test this so I 
couldn't use the Netflix dataset.... 

I still don't see how to automatically learn lambda yet without running lot's 
of subsequent M/R jobs...

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, MAHOUT-542-3.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering 
> algorithm that uses matrix factorization, I spent some time reading through a 
> couple of the Netflix papers and stumbled upon the "Large-scale Parallel 
> Collaborative Filtering for the Netflix Prize" available at 
> http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with 
> Weighted-λ-Regularization" to factorize the preference-matrix and gives some 
> insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using 
> Map/Reduce, so I sat down and created a prototype version. I'm not really 
> sure I got the mathematical details correct (they need some optimization 
> anyway), but I wanna put up my prototype implementation here per Yonik's law 
> of patches.
> Maybe someone has the time and motivation to work a little on this with me. 
> It would be great if someone could validate the approach taken (I'm willing 
> to help as the code might not be intuitive to read) and could try to 
> factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to