On Wed, May 5, 2010 at 11:11 AM, Sean Owen <sro...@gmail.com> wrote:

> I think it's UserVectorToCooccurrenceMapper, which keeps a local count
> of how many times each item has been seen. On a small cluster with a
> few mappers, which see all items, you'd have a count for each item.
> That's still not terrible, but, could take up a fair bit of memory.
>

One big win would be to use Benson's collections instead of a normal map
(you may already have this).

IF this is actually the problem.

Another problem is the expectation that a map-reduce should be happy with
500MB of memory.  I don't think that is a universally reasonable
expectation.  The real expectation for scalable solutions should be that as
the problem scales, cluster size is allowed to scale, but the size of
individual cluster members does not increase.  It is reasonable to expect
that if you scale up the problem, but not the number of nodes in the cluster
that each node may need more and more memory.  It is also reasonable to
expect that the base size of each node may be fairly hefty for some problems
(as long as it doesn't increase if problemSize / clusterSize is held
constant).

It is a common to have to deal with items from an large or even apparently
infinite set.  This could be words from text, user id's or item id's.  The
number of unique items that you will have seen can grow uniformly as you see
more data, but it is a reasonable expectation that some items are much more
common than others.  Regardless, however, it is reasonable to expect that if
you see about the same amount of data that you will see about the same
number of items.  In a reasonably scaling system, the amount of data each
node in the cluster sees will be roughly constant as the input and cluster
size scale together.  In a map-reduce architecture, you even get this
property if the input size and number of mappers and reducers scale
together.

One observation that makes me suspect that the problem is not the size of
this table that Sean has mentioned is the fact that Tamas has observed that
he doesn't get more tasks when he scales the cluster size.  That may be that
Hadoop is happy with the number of splits and has to be forced to split more
(by setting max split size or number of mappers).   The problem might also
be that this program as designed just needs 3GB to complete and efforts to
force it into 500MB will always fail.

Reply via email to