There's lots of references on decreasing DFS block size to increase maps to
record ratios.  What is the easiest way to do this?  Is it possible with
the standard SequenceFile class?

Lance

IBM Software Group - Strategy
Performance Architect
High-Performance On Demand Solutions (HiPODS)

650-678-8425 cell




                                                                           
             Ted Dunning                                                   
             <[EMAIL PROTECTED]                                             
             m>                                                         To 
                                       <hadoop-user@lucene.apache.org>     
             10/17/2007 12:49                                           cc 
             PM                                                            
                                                                   Subject 
                                       Re: InputFiles, Splits, Maps, Tasks 
             Please respond to         Questions 1.3 Base                  
             [EMAIL PROTECTED]                                             
               e.apache.org                                                
                                                                           
                                                                           
                                                                           
                                                                           





In practice, most jobs involve many more records than there are available
mappers (even for large clusters).

This means every mapper handles many records and mapper startup is
amortized
pretty widely.

It would still be nice to have a smaller startup cost, but the limiting
factor is likely to be the job tracker spitting all of the jar files to the
task trackers, not actually the map construction time.

If you really care about map instantiation time, you could start by making
the map be in the same VM.  That doesn't sound like a good trade-off to me
which in turn tells me that I don't care about startup costs so much.

It is not all that surprising if small jobs not something that can be sped
up.  The fact that parallelism is generally easier to attain for large
problems has been noticed for some time.


On 10/17/07 11:47 AM, "Lance Amundsen" <[EMAIL PROTECTED]> wrote:

> For right now, I am testing boundary conditions related to startup costs.
> I want to build a mapper interface that performs relatively flatly WRT
> numbers of mappers.  My goal is to dramatically improve startup costs for
> one mapper, and then make sure that that startup cost does not increase
> dramatically as node, maps, and records are increased.
>
> Example: let's say I have 10K one second jobs and I want the whole thing
to
> run 2 seconds.  I currently see no way for Hadoop to achieve this,  But I
> also see how to get there, and this level of granularity would be one of
> the requirements..... I believe.
>
> Lance
>
> IBM Software Group - Strategy
> Performance Architect
> High-Performance On Demand Solutions (HiPODS)
>
> 650-678-8425 cell
>
>
>
>
>
>              Ted Dunning
>              <[EMAIL PROTECTED]
>              m>
To
>                                        <hadoop-user@lucene.apache.org>
>              10/17/2007 11:34
cc
>              AM
>
Subject
>                                        Re: InputFiles, Splits, Maps,
Tasks
>              Please respond to         Questions 1.3 Base
>              [EMAIL PROTECTED]
>                e.apache.org
>
>
>
>
>
>
>
>
>
>
>
> On 10/17/07 10:37 AM, "Lance Amundsen" <[EMAIL PROTECTED]> wrote:
>
>> 1 file per map, 1 record per file, isSplitable(true or false):  yields 1
>> record per mapper
>
> Yes.
>
>> 1 file total, n records, isSplitable(true):  Yields variable n records
> per
>> variable m mappers
>
> Yes.
>
>> 1 file total, n records, isSplitable(false):  Yields n records into 1
>> mapper
>
> Yes.
>
>> What I am immediately looking for is a way to do:
>>
>> 1 file total, n records, isSplitable(true): Yields 1 record into n
> mappers
>>
>> But ultimately need to control fully control the file/record
> distributions.
>
> Why in the world do you need this level of control?  Isn't that the point
> of
> frameworks like Hadoop? (to avoid the need for this)
>
>
>



Reply via email to