[
https://issues.apache.org/jira/browse/MAHOUT-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118267#comment-13118267
]
Alvin AuYoung commented on MAHOUT-542:
--------------------------------------
Hi Sebastian,
First of all, many thanks for contributing this ALS implementation. It's very
useful. Like others on this list, I'm trying to run some experiments on it
using the Netflix data, but I'm seeing an error I am having trouble diagnosing.
After completing the first 4 jobs, reduce copiers are failing for the 5th job
(Mapper-SolvingReducer). I'm running on hadoop-0.20.2 and checked out mahout
from the trunk, so I believe any patches you've mentioned should be
incorporated.
Here is a description of the job I'm running: MAHOUT-JOB:
/home/auyoung/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/09/30 01:20:56 INFO common.AbstractJob: Command line arguments:
{--endPhase=2147483647, --input=training_all_triplets_norm, --lambda=0.065,
--numFeatures=25, --numIterations=5, --output=als.out, --startPhase=0,
--tempDir=temp}
Do you have any ideas what might be wrong? I'm running it on a physical cluster
of 20 slaves, each with 2 mappers and reducers, and there is > 8 GB memory (per
jvm), > 2 GB HADOOP_HEAPSIZE, and the maximum allowable io.sort.mb of 2047.
Also, there is plenty of disk space remaining. Here is a transcript of one of
the several failures on the ParallelALSFactorizationJob-Mapper-SolvingReducer:
2011-09-30 02:05:37,115 INFO org.apache.hadoop.mapred.Merger: Merging 16 sorted
segments
2011-09-30 02:05:37,115 INFO org.apache.hadoop.mapred.Merger: Down to the last
merge-pass, with 16 segments left of total size: 1039493457 bytes
2011-09-30 02:05:37,116 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201109300120_0005_r_000000_0 Merge of the inmemory files threw an
exception: java.io.IOException: Intermediate merge failed
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2576)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)
Caused by: java.lang.RuntimeException: java.io.EOFException
at
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
at org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:373)
at org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:136)
at
org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:103)
at
org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:335)
at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2560)
... 1 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at org.apache.mahout.math.Varint.readUnsignedVarInt(Varint.java:159)
at org.apache.mahout.math.Varint.readSignedVarInt(Varint.java:140)
at
org.apache.mahout.cf.taste.hadoop.als.IndexedVarIntWritable.readFields(IndexedVarIntWritable.java:64)
at
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:97)
... 8 more
2011-09-30 02:05:37,116 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201109300120_0005_r_000000_0 Merging of the local FS files threw an
exception: java.io.IOException: java.lang.RuntimeException: java.io.EOFException
at
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
at org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:373)
at org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:139)
at
org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:103)
at
org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:335)
at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2454)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at org.apache.mahout.math.Varint.readUnsignedVarInt(Varint.java:159)
at org.apache.mahout.math.Varint.readSignedVarInt(Varint.java:140)
at
org.apache.mahout.cf.taste.hadoop.als.IndexedVarIntWritable.readFields(IndexedVarIntWritable.java:64)
at
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:100)
... 7 more
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2458)
Thanks,
Alvin
> MapReduce implementation of ALS-WR
> ----------------------------------
>
> Key: MAHOUT-542
> URL: https://issues.apache.org/jira/browse/MAHOUT-542
> Project: Mahout
> Issue Type: New Feature
> Components: Collaborative Filtering
> Affects Versions: 0.5
> Reporter: Sebastian Schelter
> Assignee: Sebastian Schelter
> Fix For: 0.5
>
> Attachments: MAHOUT-452.patch, MAHOUT-542-2.patch,
> MAHOUT-542-3.patch, MAHOUT-542-4.patch, MAHOUT-542-5.patch,
> MAHOUT-542-6.patch, logs.zip
>
>
> As Mahout is currently lacking a distributed collaborative filtering
> algorithm that uses matrix factorization, I spent some time reading through a
> couple of the Netflix papers and stumbled upon the "Large-scale Parallel
> Collaborative Filtering for the Netflix Prize" available at
> http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.
> It describes a parallel algorithm that uses "Alternating-Least-Squares with
> Weighted-λ-Regularization" to factorize the preference-matrix and gives some
> insights on how the authors distributed the computation using Matlab.
> It seemed to me that this approach could also easily be parallelized using
> Map/Reduce, so I sat down and created a prototype version. I'm not really
> sure I got the mathematical details correct (they need some optimization
> anyway), but I wanna put up my prototype implementation here per Yonik's law
> of patches.
> Maybe someone has the time and motivation to work a little on this with me.
> It would be great if someone could validate the approach taken (I'm willing
> to help as the code might not be intuitive to read) and could try to
> factorize some test data and give feedback then.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira