For right now, I am testing boundary conditions related to startup costs.
I want to build a mapper interface that performs relatively flatly WRT
numbers of mappers.  My goal is to dramatically improve startup costs for
one mapper, and then make sure that that startup cost does not increase
dramatically as node, maps, and records are increased.

Example: let's say I have 10K one second jobs and I want the whole thing to
run 2 seconds.  I currently see no way for Hadoop to achieve this,  But I
also see how to get there, and this level of granularity would be one of
the requirements..... I believe.

Lance

IBM Software Group - Strategy
Performance Architect
High-Performance On Demand Solutions (HiPODS)

650-678-8425 cell




                                                                           
             Ted Dunning                                                   
             <[EMAIL PROTECTED]                                             
             m>                                                         To 
                                       <hadoop-user@lucene.apache.org>     
             10/17/2007 11:34                                           cc 
             AM                                                            
                                                                   Subject 
                                       Re: InputFiles, Splits, Maps, Tasks 
             Please respond to         Questions 1.3 Base                  
             [EMAIL PROTECTED]                                             
               e.apache.org                                                
                                                                           
                                                                           
                                                                           
                                                                           







On 10/17/07 10:37 AM, "Lance Amundsen" <[EMAIL PROTECTED]> wrote:

> 1 file per map, 1 record per file, isSplitable(true or false):  yields 1
> record per mapper

Yes.

> 1 file total, n records, isSplitable(true):  Yields variable n records
per
> variable m mappers

Yes.

> 1 file total, n records, isSplitable(false):  Yields n records into 1
> mapper

Yes.

> What I am immediately looking for is a way to do:
>
> 1 file total, n records, isSplitable(true): Yields 1 record into n
mappers
>
> But ultimately need to control fully control the file/record
distributions.

Why in the world do you need this level of control?  Isn't that the point
of
frameworks like Hadoop? (to avoid the need for this)



Reply via email to