I think this must be the issue. But my guess that it is regardless of
the cluster size, because I tried to change the maximum map/reduce task
capacity, and it looks that hadoop does not create more tasks for this
job, even if there are more free slots available.
On 05/05/2010 19:11, Sean Owen wrote:
I think it's UserVectorToCooccurrenceMapper, which keeps a local count
of how many times each item has been seen. On a small cluster with a
few mappers, which see all items, you'd have a count for each item.
That's still not terrible, but, could take up a fair bit of memory.
One easy solution is to cap its size and throw out low-count entries sometimes.
Just to confirm this is the issue, you could hack in this line:
private void countSeen(Vector userVector) {
if (indexCounts.size()> 1000000) return;
...
That's not a real solution, but an easy way you could perhaps test for
everyone whether that's the problem. If that's it i can solve this in
a more robust way.