That info certainly gives me the ability to eliminate splitting, but fine
control when splitting is in play remains a mystery. Is it possible to
control at the level of Input(nFiles, mSplits) -> MapperInvocations( yMaps,
xRecordsPerMap)?
Consider the following examples (and pls. verify my conclusions):
1 file per map, 1 record per file, isSplitable(true or false): yields 1
record per mapper
1 file total, n records, isSplitable(true): Yields variable n records per
variable m mappers
1 file total, n records, isSplitable(false): Yields n records into 1
mapper
What I am immediately looking for is a way to do:
1 file total, n records, isSplitable(true): Yields 1 record into n mappers
But ultimately need to control fully control the file/record distributions.
Lance
IBM Software Group - Strategy
Performance Architect
High-Performance On Demand Solutions (HiPODS)
650-678-8425 cell
Arun C Murthy
<[EMAIL PROTECTED]
com> To
[email protected]
10/17/2007 01:05 cc
AM
Subject
Re: InputFiles, Splits, Maps, Tasks
Please respond to Questions 1.3 Base
[EMAIL PROTECTED]
e.apache.org
Lance,
On Tue, Oct 16, 2007 at 11:27:54PM -0700, Lance Amundsen wrote:
>
>I am struggling to control the behavior of the framework. The first
>problem is simple: I want to run many simultaneous mapper tasks on each
>node. I've scoured the forums, done the obvious, and I still typically
get
>only 2 tasks per node at execution time. If it is a big job, sometimes I
>see 3. Note that the administrator reports 40 Tasks/Node in the config,
>but the most I've ever seen running is 3 (and this with a single input
file
>of 10,000 records, magically yielding 443 maps).
>
>And magically is the next issue. I want to fine tune control the
>InputFile, Input # records, to maps relationship. For my immediate
>problem, I want to use a single input file with a number of records
>yielding the exact same number of maps (all kicked off simultaneously
BTW).
>Since I did not get this behavior with the standard InputFileFormat, I
>created my own input format class and record reader, and am now getting
the
>"1 file with n recs to nmaps" relationship.... but the problem is that I
am
>not even sure why....
>
I'm in the process of documenting these better (
http://issues.apache.org/jira/browse/HADOOP-2046), meanwhile here are some
pointers:
http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces
and
http://wiki.apache.org/lucene-hadoop/FAQ#10
Hope this helps...
Arun
>Any guidance appreciated.
>
>