I think it's UserVectorToCooccurrenceMapper, which keeps a local count of how many times each item has been seen. On a small cluster with a few mappers, which see all items, you'd have a count for each item. That's still not terrible, but, could take up a fair bit of memory.
One easy solution is to cap its size and throw out low-count entries sometimes. Just to confirm this is the issue, you could hack in this line: private void countSeen(Vector userVector) { if (indexCounts.size() > 1000000) return; ... That's not a real solution, but an easy way you could perhaps test for everyone whether that's the problem. If that's it i can solve this in a more robust way. On Wed, May 5, 2010 at 7:03 PM, Tamas Jambor <jambo...@googlemail.com> wrote: > Hi, > > I came across a new problem with the mapreduce implementation. I am trying > to optimize the cluster for this implemetation. But the problem is that in > order to run > RecommenderJob-UserVectorToCooccurrenceMapper-UserVectorToCooccurrenceReducer, > I need to set -Xmx2048m, with a smaller value it fails the job. how come it > needs so much memory? maybe there is a memory leak here? generally it is > suggested to set -Xmx512m > > the other problem setting it so high is that I have to reduce the number > map/reduce jobs per node, otherwise the next job brings the whole cluster > down. > > Tamas > >