On Wed, May 5, 2010 at 11:11 AM, Sean Owen <sro...@gmail.com> wrote: > I think it's UserVectorToCooccurrenceMapper, which keeps a local count > of how many times each item has been seen. On a small cluster with a > few mappers, which see all items, you'd have a count for each item. > That's still not terrible, but, could take up a fair bit of memory. >
One big win would be to use Benson's collections instead of a normal map (you may already have this). IF this is actually the problem. Another problem is the expectation that a map-reduce should be happy with 500MB of memory. I don't think that is a universally reasonable expectation. The real expectation for scalable solutions should be that as the problem scales, cluster size is allowed to scale, but the size of individual cluster members does not increase. It is reasonable to expect that if you scale up the problem, but not the number of nodes in the cluster that each node may need more and more memory. It is also reasonable to expect that the base size of each node may be fairly hefty for some problems (as long as it doesn't increase if problemSize / clusterSize is held constant). It is a common to have to deal with items from an large or even apparently infinite set. This could be words from text, user id's or item id's. The number of unique items that you will have seen can grow uniformly as you see more data, but it is a reasonable expectation that some items are much more common than others. Regardless, however, it is reasonable to expect that if you see about the same amount of data that you will see about the same number of items. In a reasonably scaling system, the amount of data each node in the cluster sees will be roughly constant as the input and cluster size scale together. In a map-reduce architecture, you even get this property if the input size and number of mappers and reducers scale together. One observation that makes me suspect that the problem is not the size of this table that Sean has mentioned is the fact that Tamas has observed that he doesn't get more tasks when he scales the cluster size. That may be that Hadoop is happy with the number of splits and has to be forced to split more (by setting max split size or number of mappers). The problem might also be that this program as designed just needs 3GB to complete and efforts to force it into 500MB will always fail.