Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base

Lance Amundsen Tue, 23 Oct 2007 16:24:09 -0700

I am starting to wonder if it might be indeed impossible to get map jobs
running w/o writing to the file system.... as in, not w/o some major
changes to the job and task tracker code.

I was thinking about creating an InputFormat that does no file I/O, instead
is queue based.  As mappers start up, their getRecordReader calls get
re-directed to a remote queue to pull one or more records off of.  But I am
starting to wonder if the file system dependencies in the code are such
that I could never completely avoid using files.  Specifically, even if I
completely re-write an InputFormat, the framework is still going to try to
do Filesystem stuff on everything I return  (the extensive internal use of
splits is baffling me some).

Looking for some enlightening thoughts.

             Lance                                                         
             Amundsen/Rocheste                                             
             r/[EMAIL PROTECTED]                                                
To 
                                       hadoop-user@lucene.apache.org       
             10/22/2007 09:02                                           cc 
             PM                                                            
                                                                   Subject 
                                       Re: InputFiles, Splits, Maps, Tasks 
             Please respond to         Questions 1.3 Base                  
             [EMAIL PROTECTED]                                             
               e.apache.org                                                

Just had a thought.... I may not be seeing that additional tasks because
the startup time is washing things out.  In other words, the when the
tasktracker is starting job 3, say, job 1 is already finishing.  I'll try a
pause in the mapper and see what happens.

             Lance
             Amundsen/Rocheste
             r/[EMAIL PROTECTED]                                                
To
                                       "hadoop-user"
             10/22/2007 07:42          <hadoop-user@lucene.apache.org>
             PM                                                         cc

                                                                   Subject
             Please respond to         Re: InputFiles, Splits, Maps, Tasks
             [EMAIL PROTECTED]         Questions 1.3 Base
               e.apache.org

Has had no effect for me however... Not sure why.  The admin reports 10
tasks per node possible, but am not seeing it.

----- Original Message -----
From: Ted Dunning [EMAIL PROTECTED]
Sent: 10/22/2007 08:29 PM
To: <hadoop-user@lucene.apache.org>
Subject: Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base

You probably have determined by now that there is a parameter that
determines how many concurrent maps there are.

<property>
  <name>mapred.tasktracker.tasks.maximum</name>
  <value>3</value>
  <description>The maximum number of tasks that will be run
        simultaneously by a task tracker.
  </description>
</property>

Btw... I am still curious about your approach.  Isn't it normally better to
measure marginal costs such as this startup cost by linear regression as
you
change parameters?  It seems that otherwise, you will likely be mislead by
what happens at the boundaries when what you really want it what happens in
the normal operating region.

On 10/22/07 5:53 PM, "Lance Amundsen" <[EMAIL PROTECTED]> wrote:

> ...
>
> Next I want to increase the concurrent # of tasks being executed for each
> node... currently it seems like 2 or 3 is the upper limit (at least on
the
> earlier binaries I was running).
>

Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base

Reply via email to