Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base

Lance Amundsen Wed, 24 Oct 2007 12:36:07 -0700

OK, that is encouraging.  I'll take another pass at it.  I succeeded
yesterday with an in-memory only InputFormat, but only after I commented
out some of the split referencing code, like the following in MapTask.java


    if (instantiatedSplit instanceof FileSplit) {
      FileSplit fileSplit = (FileSplit) instantiatedSplit;
      job.set("map.input.file", fileSplit.getPath().toString());
      job.setLong("map.input.start", fileSplit.getStart());
      job.setLong("map.input.length", fileSplit.getLength());
    }

But maybe I simply need to override more methods in more of the embedded
classes.   You can see why I was wondering about the file system
dependencies.



                                                                           
             Doug Cutting                                                  
             <[EMAIL PROTECTED]                                             
             rg>                                                        To 
                                       hadoop-user@lucene.apache.org       
             10/24/2007 09:02                                           cc 
             AM                                                            
                                                                   Subject 
                                       Re: InputFiles, Splits, Maps, Tasks 
             Please respond to         Questions 1.3 Base                  
             [EMAIL PROTECTED]                                             
               e.apache.org                                                
                                                                           
                                                                           
                                                                           
                                                                           




Lance Amundsen wrote:
> I am starting to wonder if it might be indeed impossible to get map jobs
> running w/o writing to the file system.... as in, not w/o some major
> changes to the job and task tracker code.
>
> I was thinking about creating an InputFormat that does no file I/O,
instead
> is queue based.  As mappers start up, their getRecordReader calls get
> re-directed to a remote queue to pull one or more records off of.  But I
am
> starting to wonder if the file system dependencies in the code are such
> that I could never completely avoid using files.  Specifically, even if I
> completely re-write an InputFormat, the framework is still going to try
to
> do Filesystem stuff on everything I return  (the extensive internal use
of
> splits is baffling me some).

Nothing internally should depend on an InputSplit representing a file.
You do need to be able to generate the complete set of splits when the
job is launched.  So if you wanted maps to poll a queue for each task
then you'd need to know how long the queue is when the job is launched
so that you could generate the right number of polling splits.

Doug

Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base

Reply via email to