This sounds like a bug.
The memory requirements for hadoop itself shouldn't change with the split size. At the very least, it should adapt correctly to whatever the memory limits are. Can you build a version of your program that works from random data so that you can file a bug? If you contact me off-line, I can help build a random data generator that matches your input reasonably well. On 12/25/07 2:52 PM, "Jason Venner" <[EMAIL PROTECTED]> wrote: > My mapper in this case is the identity mapper, and the reducer gets > about 10 values per key and makes a collect decision based on the data > in the values. > The reducer is very close to a no-op, and uses very little additional > memory than the values. > > I believe the problem is in the amount of buffering in the output files. > > The quandary we have is the jobs run very poorly with the standard input > split size as the mean time to finishing a split is very small, vrs > gigantic memory requirements for large split sizes. > > Time to play with parameters again ... since the answer doesn't appear > to be in working memory for the list. > > > > Ted Dunning wrote: >> What are your mappers doing that they run out of memory? Or is it your >> reducers? >> >> Often, you can write this sort of program so that you don't have higher >> memory requirements for larger splits. >> >> >> On 12/25/07 1:52 PM, "Jason Venner" <[EMAIL PROTECTED]> wrote: >> >> >>> We have tried reducing the number of splits by increasing the block >>> sizes to 10x and 5x 64meg, but then we constantly have out of memory >>> errors and timeouts. At this point each jvm is getting 768M and I can't >>> readily allocate more without dipping into swap. >>> >> >>