using map output fetch failures to blacklist nodes is problematic
-----------------------------------------------------------------

                 Key: MAPREDUCE-1800
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1800
             Project: Hadoop Map/Reduce
          Issue Type: Bug
            Reporter: Joydeep Sen Sarma


If a mapper and a reducer cannot communicate, then either party could be at 
fault. The current hadoop protocol allows reducers to declare nodes running the 
mapper as being at fault. When sufficient number of reducers do so - then the 
map node can be blacklisted. 

In cases where networking problems cause substantial degradation in 
communication across sets of nodes - then large number of nodes can become 
blacklisted as a result of this protocol. The blacklisting is often wrong 
(reducers on the smaller side of the network partition can collectively cause 
nodes on the larger network partitioned to be blacklisted) and 
counterproductive (rerunning maps puts further load on the (already) maxed out 
network links).

We should revisit how we can better identify nodes with genuine network 
problems (and what role, if any, map-output fetch failures have in this).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to