[ 
http://issues.apache.org/jira/browse/NUTCH-183?page=comments#action_12363616 ] 

Mike Cafarella commented on NUTCH-183:
--------------------------------------


  One more thing:

  You'll see in this patch that ReduceTasks contain a 2D array of map tasks.  
They used to
contain a 1D array, one map task for each map-split of the data.

  This 2D business is less than perfect, so let me explain...

  Each Reduce task needs to know when its map task predecessors have completed.
In the old days, this was easy.  There was one map id for each split of the 
data, so for
k splits, there were k map ids to know about.

  But in a world with speculative execution and early-start of reduce tasks, 
there could
be multiple possible map tasks that work on the same split of data.  So for 
each of the
k splits, there could be up to M tasks working on it.  Thus, the reduce task 
knows about
k * M map ids.  When any one of the M for each split has completed, the reduce 
task knows
it can move on.

   But this is a little silly.  The JobTracker has a "TaskInProgress" 
abstraction that represents
the idea of a "split's worth of work".  A single TIP contains M map task ids.  
Instead of the
reduce task looking all over for map task ids, it should just deal with TIP 
ids.  That way,
we're back to a 1D array, and the reduce task code is easier to understand.

  Anyway, it will still work as is.  I'll improve the code in a future patch.  
For the moment,
I'll let this patch stand.  I just wanted to let people know...


> MapReduce has a series of problems concerning task-allocation to worker nodes
> -----------------------------------------------------------------------------
>
>          Key: NUTCH-183
>          URL: http://issues.apache.org/jira/browse/NUTCH-183
>      Project: Nutch
>         Type: Improvement
>  Environment: All
>     Reporter: Mike Cafarella
>  Attachments: jobtracker.patch
>
> The MapReduce JobTracker is not great at allocating tasks to TaskTracker 
> worker nodes.
> Here are the problems:
> 1) There is no speculative execution of tasks
> 2) Reduce tasks must wait until all map tasks are completed before doing any 
> work
> 3) TaskTrackers don't distinguish between Map and Reduce jobs.  Also, the 
> number of
> tasks at a single node is limited to some constant.  That means you can get 
> weird deadlock
> problems upon machine failure.  The reduces take up all the available 
> execution slots, but they
> don't do productive work, because they're waiting for a map task to complete. 
>  Of course, that
> map task won't even be started until the reduce tasks finish, so you can see 
> the problem...
> 4) The JobTracker is so complicated that it's hard to fix any of these.
> The right solution is a rewrite of the JobTracker to be a lot more flexible 
> in task handling.
> It has to be a lot simpler.  One way to make it simpler is to add an 
> abstraction I'll call
> "TaskInProgress".  Jobs are broken into chunks called TasksInProgress.  All 
> the TaskInProgress
> objects must be complete, somehow, before the Job is complete.
> A single TaskInProgress can be executed by one or more Tasks.  TaskTrackers 
> are assigned Tasks.
> If a Task fails, we report it back to the JobTracker, where the 
> TaskInProgress lives.  The TIP can then
> decide whether to launch additional  Tasks or not.
> Speculative execution is handled within the TIP.  It simply launches multiple 
> Tasks in parallel.  The
> TaskTrackers have no idea that these Tasks are actually doing the same chunk 
> of work.  The TIP
> is complete when any one of its Tasks are complete.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to