[ 
https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991718#comment-12991718
 ] 

Danny Bickson commented on MAHOUT-542:
--------------------------------------

Hi,
Everything works now with the new patch (542-5). With the MovieLens 1M data 
everything works fine, I have tested with one, two and four slaves.
With Netflix data, I get the following exception:

2011-02-04 19:42:45,613 INFO org.apache.hadoop.mapred.TaskInProgress: Error 
from attempt_201102041322_0007_r_000000_0: Error: GC overhead limit exceeded
2011-02-04 19:42:45,614 INFO org.apache.hadoop.mapred.JobTracker: Adding task 
(cleanup)'attempt_201102041322_0007_r_000000_0' to tip 
task_201102041322_0007_r_000000, for tracker 
'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 19:42:48,617 INFO org.apache.hadoop.mapred.JobTracker: Adding task 
'attempt_201102041322_0007_r_000000_1' to tip task_201102041322_0007_r_000000, 
for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 19:42:48,618 INFO org.apache.hadoop.mapred.JobTracker: Removed 
completed task 'attempt_201102041322_0007_r_000000_0' from 
'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 21:10:48,014 INFO org.apache.hadoop.mapred.TaskInProgress: Error 
from attempt_201102041322_0007_r_000000_1: Error: GC overhead limit exceeded
2011-02-04 21:10:48,030 INFO org.apache.hadoop.mapred.JobTracker: Adding task 
(cleanup)'attempt_201102041322_0007_r_000000_1' to tip 
task_201102041322_0007_r_000000, for tracker 
'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 21:10:54,036 INFO org.apache.hadoop.mapred.JobTracker: Adding task 
'attempt_201102041322_0007_r_000000_2' to tip task_201102041322_0007_r_000000, 
for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 21:10:54,036 INFO org.apache.hadoop.mapred.JobTracker: Removed 
completed task 'attempt_201102041322_0007_r_000000_1' from 
'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 22:36:46,339 INFO org.apache.hadoop.mapred.TaskInProgress: Error 
from attempt_201102041322_0007_r_000000_2: Error: GC overhead limit exceeded
2011-02-04 22:36:46,339 INFO org.apache.hadoop.mapred.JobTracker: Adding task 
(cleanup)'attempt_201102041322_0007_r_000000_2' to tip 
task_201102041322_0007_r_000000, for tracker 
'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 22:36:49,342 INFO org.apache.hadoop.mapred.JobTracker: Adding task 
'attempt_201102041322_0007_r_000000_3' to tip task_201102041322_0007_r_000000, 
for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

2011-02-04 22:36:49,355 INFO org.apache.hadoop.mapred.JobTracker: Removed 
completed task 'attempt_201102041322_0007_r_000000_2' from 
'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'


Any ideas about how to fix this?

Thanks!!

Danny Bickson

> MapReduce implementation of ALS-WR
> ----------------------------------
>
>                 Key: MAHOUT-542
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-542
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch, 
> MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch
>
>
> As Mahout is currently lacking a distributed collaborative filtering 
> algorithm that uses matrix factorization, I spent some time reading through a 
> couple of the Netflix papers and stumbled upon the "Large-scale Parallel 
> Collaborative Filtering for the Netflix Prize" available at 
> http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with 
> Weighted-λ-Regularization" to factorize the preference-matrix and gives some 
> insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using 
> Map/Reduce, so I sat down and created a prototype version. I'm not really 
> sure I got the mathematical details correct (they need some optimization 
> anyway), but I wanna put up my prototype implementation here per Yonik's law 
> of patches.
> Maybe someone has the time and motivation to work a little on this with me. 
> It would be great if someone could validate the approach taken (I'm willing 
> to help as the code might not be intuitive to read) and could try to 
> factorize some test data and give feedback then.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to