That info certainly gives me the ability to eliminate splitting, but fine control when splitting is in play remains a mystery. Is it possible to control at the level of Input(nFiles, mSplits) -> MapperInvocations( yMaps, xRecordsPerMap)?
Consider the following examples (and pls. verify my conclusions): 1 file per map, 1 record per file, isSplitable(true or false): yields 1 record per mapper 1 file total, n records, isSplitable(true): Yields variable n records per variable m mappers 1 file total, n records, isSplitable(false): Yields n records into 1 mapper What I am immediately looking for is a way to do: 1 file total, n records, isSplitable(true): Yields 1 record into n mappers But ultimately need to control fully control the file/record distributions. Lance IBM Software Group - Strategy Performance Architect High-Performance On Demand Solutions (HiPODS) 650-678-8425 cell Arun C Murthy <[EMAIL PROTECTED] com> To hadoop-user@lucene.apache.org 10/17/2007 01:05 cc AM Subject Re: InputFiles, Splits, Maps, Tasks Please respond to Questions 1.3 Base [EMAIL PROTECTED] e.apache.org Lance, On Tue, Oct 16, 2007 at 11:27:54PM -0700, Lance Amundsen wrote: > >I am struggling to control the behavior of the framework. The first >problem is simple: I want to run many simultaneous mapper tasks on each >node. I've scoured the forums, done the obvious, and I still typically get >only 2 tasks per node at execution time. If it is a big job, sometimes I >see 3. Note that the administrator reports 40 Tasks/Node in the config, >but the most I've ever seen running is 3 (and this with a single input file >of 10,000 records, magically yielding 443 maps). > >And magically is the next issue. I want to fine tune control the >InputFile, Input # records, to maps relationship. For my immediate >problem, I want to use a single input file with a number of records >yielding the exact same number of maps (all kicked off simultaneously BTW). >Since I did not get this behavior with the standard InputFileFormat, I >created my own input format class and record reader, and am now getting the >"1 file with n recs to nmaps" relationship.... but the problem is that I am >not even sure why.... > I'm in the process of documenting these better ( http://issues.apache.org/jira/browse/HADOOP-2046), meanwhile here are some pointers: http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces and http://wiki.apache.org/lucene-hadoop/FAQ#10 Hope this helps... Arun >Any guidance appreciated. > >