Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base

Lance Amundsen Thu, 18 Oct 2007 17:08:06 -0700

You said arbitrary.. maybe I missed something.  Can I construct a
getSplits() method that chunks up the file however I want?  I assumed I
needed to return a split map that corresponded to key, value boundaries,
which I am having troubling doing since the input format for the value can
change (I am now using ObjectWritable for example.... GenericWritable makes
this impossible I think).  Then there is the file header to take into
account..... but if I am making this too complicated, let me know

Setting the number of maps as a method to force record/file/mapper
relationships appears dicey at best.  This is the sort of stuff I am seeing
in my 9 node setup:

1 file, 1000 records, 1000 maps requested yields 43 actual maps
1 file, 10,000 records,  10,000 maps requested yields 430 actual maps

In all of these cases I can only get 2 task/node running at the same
time.... once in a while 3 run.... even though I have specified a higher
number to be allowed.

I want 1 map per record, from one file, for any number of records, and I
want it guaranteed.  Later I may want 10 records, or a 100, but now I right
now I want to force a one record per mapper relationship, an I do not want
to pay the file creation overhead of, say 1000 files, just to get 1000
maps.

BTW, I have started working on my own InputFormat and InputFileClasses as
well.... and these questions are helping me with context, but tactically I
am just  trying to understand the Hadoop mapper startup overhead with the
goals of a) reducing it, and b) making the overhead stay constant (flat)
out to n mappers on m nodes.

Lance

IBM Software Group - Strategy
Performance Architect
High-Performance On Demand Solutions (HiPODS)

650-678-8425 cell

             Doug Cutting                                                  
             <[EMAIL PROTECTED]                                             
             rg>                                                        To 
                                       [email protected]       
             10/18/2007 04:04                                           cc 
             PM                                                            
                                                                   Subject 
                                       Re: InputFiles, Splits, Maps, Tasks 
             Please respond to         Questions 1.3 Base                  
             [EMAIL PROTECTED]                                             
               e.apache.org                                                

Lance Amundsen wrote:
> Thx, I'll give that a try.   Seems to me a method to tell hadoop to split
a
> file every "n" key/value pairs would be logical.  Or maybe a
> createSplitBoundary when appending key/value records?

Splits should not require examining the data: that's not scalable.  So
they're instead on arbitrary byte boundaries.

> I just want a way, and not a real complex way, of directing the # of maps
> and the breakdown of records going to them.  Creating a separate file per
> record group is too slow for my purposes.

Just set the number of map tasks.  That should mostly do what you want
in this case.  If you want finer-grained control, implement your own
InputFormat.

Doug

Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base

Reply via email to