[ 
https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991835#comment-12991835
 ] 

Sebastian Schelter commented on MAHOUT-542:
-------------------------------------------

I also think it's a memory problem. When recomputing user features the 
algorithm needs to look at the feature vectors of all movies the user has rated 
and when recomputing movie features the algorithm has to look at the feature 
vectors of all users which have rated this movie. I'm not very familiar with 
the netflix dataset but I think there might be some very popular movies that 
have been rated by lots of users and there might also be some "power" users 
that have rated lots of movies. So the memory consumption might get very high 
in some steps.

What kind of ec2 instances did you run this on? Did you use small instances? 
From my experience those are not very helpful, maybe you could retry this with 
large or c1.medium instances.

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, 
> MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering 
> algorithm that uses matrix factorization, I spent some time reading through a 
> couple of the Netflix papers and stumbled upon the "Large-scale Parallel 
> Collaborative Filtering for the Netflix Prize" available at 
> http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with 
> Weighted-λ-Regularization" to factorize the preference-matrix and gives some 
> insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using 
> Map/Reduce, so I sat down and created a prototype version. I'm not really 
> sure I got the mathematical details correct (they need some optimization 
> anyway), but I wanna put up my prototype implementation here per Yonik's law 
> of patches.
> Maybe someone has the time and motivation to work a little on this with me. 
> It would be great if someone could validate the approach taken (I'm willing 
> to help as the code might not be intuitive to read) and could try to 
> factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to