Lance,

On Wed, Oct 17, 2007 at 10:37:58AM -0700, Lance Amundsen wrote:
>That info certainly gives me the ability to eliminate splitting, but fine
>control when splitting is in play remains a mystery.  Is it possible to
>control at the level of Input(nFiles, mSplits) -> MapperInvocations( yMaps,
>xRecordsPerMap)?
>

>Consider the following examples (and pls. verify my conclusions):
>
>1 file per map, 1 record per file, isSplitable(true or false):  yields 1
>record per mapper
>1 file total, n records, isSplitable(true):  Yields variable n records per
>variable m mappers
>1 file total, n records, isSplitable(false):  Yields n records into 1
>mapper
>

Correct, all of them.

>What I am immediately looking for is a way to do:
>
>1 file total, n records, isSplitable(true): Yields 1 record into n mappers
>

>>http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces

There is a slightly obscure reference to it there, anyway I've just uploaded a 
documentation patch at HADOOP-2046 which should help.

Specifically the updated documentation for JobConf.setNumMapTasks (for now, in 
the patch only) reads:

-*-*-

*setNumMapTasks*

public void setNumMapTasks(int n)

Set the number of map tasks for this job.

Note: This is only a hint to the framework. The actual number of spawned map 
tasks depends on the number of InputSplits generated by the job's 
InputFormat.getSplits(JobConf, int). A custom InputFormat is typically used to 
accurately control the number of map tasks for the job.
How many maps?

The number of maps is usually driven by the total size of the inputs i.e. total 
number of HDFS blocks of the input files.

The right level of parallelism for maps seems to be around 10-100 maps 
per-node, although it has been set up to 300 or so for very cpu-light map 
tasks. Task setup takes awhile, so it is best if the maps take at least a 
minute to execute.

The default InputFormat behavior is to split the total number of bytes into the 
right number of fragments. However, the HDFS block size of the input files is 
treated as an upper bound for input splits.
A lower bound on the split size can be set via mapred.min.split.size.

Thus, if you expect 10TB of input data and have 128MB HDFS blocks, you'll end 
up with 82,000 maps, unless setNumMapTasks(int) is used to set it even higher.

Parameters:
n - the number of map tasks for this job.

-*-*-

hth,
Arun

>But ultimately need to control fully control the file/record distributions.
>
>
>
>Lance
>
>IBM Software Group - Strategy
>Performance Architect
>High-Performance On Demand Solutions (HiPODS)
>
>650-678-8425 cell
>
>
>
>
>                                                                           
>             Arun C Murthy                                                 
>             <[EMAIL PROTECTED]                                             
>             com>                                                       To 
>                                       hadoop-user@lucene.apache.org       
>             10/17/2007 01:05                                           cc 
>             AM                                                            
>                                                                   Subject 
>                                       Re: InputFiles, Splits, Maps, Tasks 
>             Please respond to         Questions 1.3 Base                  
>             [EMAIL PROTECTED]                                             
>               e.apache.org                                                
>                                                                           
>                                                                           
>                                                                           
>                                                                           
>
>
>
>
>Lance,
>
>On Tue, Oct 16, 2007 at 11:27:54PM -0700, Lance Amundsen wrote:
>>
>>I am struggling to control the behavior of the framework.  The first
>>problem is simple: I want to run many simultaneous mapper tasks on each
>>node.  I've scoured the forums, done the obvious, and I still typically
>get
>>only 2 tasks per node at execution time.  If it is a big job, sometimes I
>>see 3.  Note that the administrator reports 40 Tasks/Node in the config,
>>but the most I've ever seen running is 3 (and this with a single input
>file
>>of 10,000 records, magically yielding 443 maps).
>>
>>And magically is the next issue.  I want to fine tune control the
>>InputFile, Input # records, to maps relationship.  For my immediate
>>problem, I want to use a single input file with a number of records
>>yielding the exact same number of maps (all kicked off simultaneously
>BTW).
>>Since I did not get this behavior with the standard InputFileFormat, I
>>created my own input format class and record reader, and am now getting
>the
>>"1 file with n recs to nmaps" relationship.... but the problem is that I
>am
>>not even sure why....
>>
>
>I'm in the process of documenting these better (
>http://issues.apache.org/jira/browse/HADOOP-2046), meanwhile here are some
>pointers:
>http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces
>and
>http://wiki.apache.org/lucene-hadoop/FAQ#10
>
>Hope this helps...
>
>Arun
>
>>Any guidance appreciated.
>>
>>
>
>

Reply via email to