it added that extra line, but it still fails there. this is the stack, not sure if it is any use:

2010-05-05 19:28:42,621 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=MAP, sessionId=
2010-05-05 19:28:42,747 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 1
2010-05-05 19:28:42,753 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 1000
2010-05-05 19:28:42,846 FATAL org.apache.hadoop.mapred.TaskTracker: Error 
running child : java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:781)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:350)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)



On 05/05/2010 19:11, Sean Owen wrote:
I think it's UserVectorToCooccurrenceMapper, which keeps a local count
of how many times each item has been seen. On a small cluster with a
few mappers, which see all items, you'd have a count for each item.
That's still not terrible, but, could take up a fair bit of memory.

One easy solution is to cap its size and throw out low-count entries sometimes.

Just to confirm this is the issue, you could hack in this line:

   private void countSeen(Vector userVector) {
     if (indexCounts.size()>  1000000) return;
     ...

That's not a real solution, but an easy way you could perhaps test for
everyone whether that's the problem. If that's it i can solve this in
a more robust way.

On Wed, May 5, 2010 at 7:03 PM, Tamas Jambor<jambo...@googlemail.com>  wrote:
Hi,

I came across a new problem with the mapreduce implementation. I am trying
to optimize the cluster for this implemetation. But the problem is that in
order to run
RecommenderJob-UserVectorToCooccurrenceMapper-UserVectorToCooccurrenceReducer,
I need to set -Xmx2048m, with a smaller value it fails the job. how come it
needs so much memory? maybe there is a memory leak here? generally it is
suggested to set -Xmx512m

the other problem setting it so high is that I have to reduce the number
map/reduce jobs per node, otherwise the next job brings the whole cluster
down.

Tamas


Reply via email to