[ https://issues.apache.org/jira/browse/MAPREDUCE-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864270#action_12864270 ]
Amareshwari Sriramadasu commented on MAPREDUCE-1276: ---------------------------------------------------- I repeated the manual testing described in my earlier comment. I tried to simulate read timeout for m_00001_0 by explicitly adding a sleep in TaskTracker.MapOutputServlet.sendMapFile(). The attempt fails with error "Too many fetch failures" as expected. But most of the times I see m_00002_0 also failing with following error: {noformat} Map output lost, rescheduling: error on sending map attempt_201005051443_0003_m_000002_0 to reduce 1 org.mortbay.jetty.EofException at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:787) at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:566) at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:946) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.sendMapFile(TaskTracker.java:3646) at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3517) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:766) at .. {noformat} Jothi/Chris, do you think this is an agreeable failure? I think we should catch this as not an inputException and do a retry. > Shuffle connection logic needs correction > ------------------------------------------ > > Key: MAPREDUCE-1276 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1276 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: task > Affects Versions: 0.21.0 > Reporter: Jothi Padmanabhan > Assignee: Amareshwari Sriramadasu > Priority: Blocker > Fix For: 0.21.0 > > Attachments: patch-1276-1.txt, patch-1276.txt > > > While looking at the code with Amareshwari, we realized that > {{Fetcher#copyFromHost}} marks connection as successful when > {{url.openConnection}} returns. This is wrong. Connection is done inside > implicitly inside {{getInputStream}}; we need to split {{getInputStream}} > into {{connect}} and {{getInputStream}} to handle the connection and read > time out strategies correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.