I think this must be the issue. But my guess that it is regardless of the cluster size, because I tried to change the maximum map/reduce task capacity, and it looks that hadoop does not create more tasks for this job, even if there are more free slots available.

On 05/05/2010 19:11, Sean Owen wrote:
I think it's UserVectorToCooccurrenceMapper, which keeps a local count
of how many times each item has been seen. On a small cluster with a
few mappers, which see all items, you'd have a count for each item.
That's still not terrible, but, could take up a fair bit of memory.

One easy solution is to cap its size and throw out low-count entries sometimes.

Just to confirm this is the issue, you could hack in this line:

   private void countSeen(Vector userVector) {
     if (indexCounts.size()>  1000000) return;
     ...

That's not a real solution, but an easy way you could perhaps test for
everyone whether that's the problem. If that's it i can solve this in
a more robust way.

Reply via email to