[
https://issues.apache.org/jira/browse/MAPREDUCE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arun C Murthy updated MAPREDUCE-2450:
-------------------------------------
Fix Version/s: (was: 0.24.0)
Assignee: Rajesh Balamohan
Target Version/s: 0.23.1
Status: Patch Available (was: Open)
Thanks to Rajesh for the original patch. This helps 0.23 a *lot*, ran a number
of benchmarks.
> Calls from running tasks to TaskTracker methods sometimes fail and incur a
> 60s timeout
> --------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-2450
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2450
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 0.23.0
> Reporter: Matei Zaharia
> Assignee: Rajesh Balamohan
> Attachments: HADOOP-5380.Y.20.branch.patch, HADOOP-5380.patch,
> HADOOP_5380-Y.0.20.20x.patch, MAPREDUCE-2450.patch, mapreduce-2450.patch
>
>
> I'm seeing some map tasks in my jobs take 1 minute to commit after they
> finish the map computation. On the map side, the output looks like this:
> <code>
> 2009-03-02 21:30:54,384 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot
> initialize JVM Metrics with processName=MAP, sessionId= - already initialized
> 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask:
> numReduceTasks: 800
> 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb =
> 300
> 2009-03-02 21:30:55,493 INFO org.apache.hadoop.mapred.MapTask: data buffer =
> 239075328/298844160
> 2009-03-02 21:30:55,494 INFO org.apache.hadoop.mapred.MapTask: record buffer
> = 786432/983040
> 2009-03-02 21:31:00,381 INFO org.apache.hadoop.mapred.MapTask: Starting flush
> of map output
> 2009-03-02 21:31:07,892 INFO org.apache.hadoop.mapred.MapTask: Finished spill > 0
> 2009-03-02 21:31:07,951 INFO org.apache.hadoop.mapred.TaskRunner:
> Task:attempt_200903022127_0001_m_003163_0 is done. And is in the process of
> commiting
> 2009-03-02 21:32:07,949 INFO org.apache.hadoop.mapred.TaskRunner:
> Communication exception: java.io.IOException: Call to /127.0.0.1:50311 failed
> on local exception: java.nio.channels.ClosedChannelException
> at org.apache.hadoop.ipc.Client.wrapException(Client.java:765)
> at org.apache.hadoop.ipc.Client.call(Client.java:733)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source)
> at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:525)
> at java.lang.Thread.run(Thread.java:619)
> Caused by: java.nio.channels.ClosedChannelException
> at
> java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:167)
> at
> java.nio.channels.SelectableChannel.register(SelectableChannel.java:254)
> at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:331)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> at java.io.FilterInputStream.read(FilterInputStream.java:116)
> at
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> at java.io.DataInputStream.readInt(DataInputStream.java:370)
> at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
> 2009-03-02 21:32:07,953 INFO org.apache.hadoop.mapred.TaskRunner: Task
> 'attempt_200903022127_0001_m_003163_0' done.
> </code>
> In the TaskTracker log, it looks like this:
> <code>
> 2009-03-02 21:31:08,110 WARN org.apache.hadoop.ipc.Server: IPC Server
> Responder, call ping(attempt_200903022127_0001_m_003163_0) from
> 127.0.0.1:56884: output error
> 2009-03-02 21:31:08,111 INFO org.apache.hadoop.ipc.Server: IPC Server handler
> 10 on 50311 caught: java.nio.channels.ClosedChannelException
> at
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at
> org.apache.hadoop.ipc.Server.channelWrite(Server.java:1195)
> at org.apache.hadoop.ipc.Server.access$1900(Server.java:77)
> at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:613)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:677)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:981)
> </code>
> Note that the task actually seemed to commit - it didn't get speculatively
> executed or anything. However, the job wasn't able to continue until this one
> task was done. Both parties seem to think the channel was closed. How does
> the channel get closed externally? If closing it from outside is unavoidable,
> maybe the right thing to do is to set a much lower timeout, because 1 minute
> delay can be pretty significant for a small job.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira