I am also interested in the test demonstrating OOM for large split sizes (if this is true then it is indeed a bug). Sort & Spill-to-disk should happen as soon as io.sort.mb amount of key/value data is collected. I am assuming that you didn't change (increased) the value of io.sort.mb when you increased the split size..
Thanks, Devaraj > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED] > Sent: Wednesday, December 26, 2007 4:31 AM > To: hadoop-user@lucene.apache.org > Subject: Re: question on Hadoop configuration for non cpu > intensive jobs - 0.15.1 > > > > This sounds like a bug. > > The memory requirements for hadoop itself shouldn't change > with the split size. At the very least, it should adapt > correctly to whatever the memory limits are. > > Can you build a version of your program that works from > random data so that you can file a bug? If you contact me > off-line, I can help build a random data generator that > matches your input reasonably well. > > > On 12/25/07 2:52 PM, "Jason Venner" <[EMAIL PROTECTED]> wrote: > > > My mapper in this case is the identity mapper, and the reducer gets > > about 10 values per key and makes a collect decision based > on the data > > in the values. > > The reducer is very close to a no-op, and uses very little > additional > > memory than the values. > > > > I believe the problem is in the amount of buffering in the > output files. > > > > The quandary we have is the jobs run very poorly with the standard > > input split size as the mean time to finishing a split is > very small, > > vrs gigantic memory requirements for large split sizes. > > > > Time to play with parameters again ... since the answer > doesn't appear > > to be in working memory for the list. > > > > > > > > Ted Dunning wrote: > >> What are your mappers doing that they run out of memory? Or is it > >> your reducers? > >> > >> Often, you can write this sort of program so that you don't have > >> higher memory requirements for larger splits. > >> > >> > >> On 12/25/07 1:52 PM, "Jason Venner" <[EMAIL PROTECTED]> wrote: > >> > >> > >>> We have tried reducing the number of splits by increasing > the block > >>> sizes to 10x and 5x 64meg, but then we constantly have > out of memory > >>> errors and timeouts. At this point each jvm is getting 768M and I > >>> can't readily allocate more without dipping into swap. > >>> > >> > >> > >