[
https://issues.apache.org/jira/browse/MAPREDUCE-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007365#comment-13007365
]
Todd Lipcon commented on MAPREDUCE-2389:
----------------------------------------
Easiest way to reproduce this is to run a large sleep job on a small cluster.
I've been using sleep -mt 1 -rt 1 -m 10000 -r 10000 on 5 node clusters. In such
a job I usually see 100-200 of these failures.
Exception trace:
Map output lost, rescheduling:
getMapOutput(attempt_201103152313_0001_m_000591_0,437) failed :
java.io.IOException: Error Reading IndexFile
at
org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:113)
at
org.apache.hadoop.mapred.IndexCache.getIndexInformation(IndexCache.java:66)
at
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3488)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:74)
at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:54)
at
org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:109)
... 23 more
> Spurious EOFExceptions reading SpillRecord index files
> ------------------------------------------------------
>
> Key: MAPREDUCE-2389
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2389
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: tasktracker
> Affects Versions: 0.22.0
> Environment: Seen on RHEL 5.5, RHEL 6.0, local dirs on ext3, Java
> 6u20 and 6u24
> Reporter: Todd Lipcon
> Priority: Critical
> Attachments: stap-output.txt
>
>
> In large jobs, I see around 1 shuffle fetch out of every million fetches fail
> with an EOFException reading the SpillRecord index file. After lots of
> investigation, including systemtap, it looks like the read() syscall is
> actually returning a premature "0" result for no reason, so this is likely a
> kernel or filesystem bug which is exacerbated by some workload the TT does.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira