[
https://issues.apache.org/jira/browse/HADOOP-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022744#comment-13022744
]
Liyin Liang commented on HADOOP-5380:
-------------------------------------
We met the same problem with hadoop 0.19.1 when run terasort. Here is the task
log:
2011-04-19 01:10:33,715 INFO org.apache.hadoop.mapred.TaskRunner: Task
attempt_201104190103_0002_r_000442_0 is allowed to commit now
2011-04-19 01:10:33,723 INFO org.apache.hadoop.mapred.FileOutputCommitter:
Saved output of task 'attempt_201104190103_0002_r_000442_0' to
hdfs://r05b02043.yh.aliyun.com:9000/group/dc/willwu/terasort-out
2011-04-19 01:11:33,725 INFO org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.io.IOException: Call to /127.0.0.1:48950 failed on local
exception: java.nio.channels.ClosedByInterruptException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:736)
at org.apache.hadoop.ipc.Client.call(Client.java:704)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
at org.apache.hadoop.mapred.Task$1.run(Task.java:417)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.nio.channels.ClosedByInterruptException
at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:341)
at
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:473)
at org.apache.hadoop.ipc.Client.call(Client.java:691)
... 4 more
2011-04-19 01:11:33,728 INFO org.apache.hadoop.mapred.TaskRunner: Task
'attempt_201104190103_0002_r_000442_0' done.
Hope this bug to be fixed ASAP.
> Calls from running tasks to TaskTracker methods sometimes fail and incur a
> 60s timeout
> --------------------------------------------------------------------------------------
>
> Key: HADOOP-5380
> URL: https://issues.apache.org/jira/browse/HADOOP-5380
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 0.20.1
> Reporter: Matei Zaharia
> Attachments: HADOOP-5380.Y.20.branch.patch,
> HADOOP_5380-Y.0.20.20x.patch
>
>
> I'm seeing some map tasks in my jobs take 1 minute to commit after they
> finish the map computation. On the map side, the output looks like this:
> <code>
> 2009-03-02 21:30:54,384 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot
> initialize JVM Metrics with processName=MAP, sessionId= - already initialized
> 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask:
> numReduceTasks: 800
> 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb =
> 300
> 2009-03-02 21:30:55,493 INFO org.apache.hadoop.mapred.MapTask: data buffer =
> 239075328/298844160
> 2009-03-02 21:30:55,494 INFO org.apache.hadoop.mapred.MapTask: record buffer
> = 786432/983040
> 2009-03-02 21:31:00,381 INFO org.apache.hadoop.mapred.MapTask: Starting flush
> of map output
> 2009-03-02 21:31:07,892 INFO org.apache.hadoop.mapred.MapTask: Finished spill > 0
> 2009-03-02 21:31:07,951 INFO org.apache.hadoop.mapred.TaskRunner:
> Task:attempt_200903022127_0001_m_003163_0 is done. And is in the process of
> commiting
> 2009-03-02 21:32:07,949 INFO org.apache.hadoop.mapred.TaskRunner:
> Communication exception: java.io.IOException: Call to /127.0.0.1:50311 failed
> on local exception: java.nio.channels.ClosedChannelException
> at org.apache.hadoop.ipc.Client.wrapException(Client.java:765)
> at org.apache.hadoop.ipc.Client.call(Client.java:733)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source)
> at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:525)
> at java.lang.Thread.run(Thread.java:619)
> Caused by: java.nio.channels.ClosedChannelException
> at
> java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:167)
> at
> java.nio.channels.SelectableChannel.register(SelectableChannel.java:254)
> at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:331)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> at java.io.FilterInputStream.read(FilterInputStream.java:116)
> at
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> at java.io.DataInputStream.readInt(DataInputStream.java:370)
> at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
> 2009-03-02 21:32:07,953 INFO org.apache.hadoop.mapred.TaskRunner: Task
> 'attempt_200903022127_0001_m_003163_0' done.
> </code>
> In the TaskTracker log, it looks like this:
> <code>
> 2009-03-02 21:31:08,110 WARN org.apache.hadoop.ipc.Server: IPC Server
> Responder, call ping(attempt_200903022127_0001_m_003163_0) from
> 127.0.0.1:56884: output error
> 2009-03-02 21:31:08,111 INFO org.apache.hadoop.ipc.Server: IPC Server handler
> 10 on 50311 caught: java.nio.channels.ClosedChannelException
> at
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at
> org.apache.hadoop.ipc.Server.channelWrite(Server.java:1195)
> at org.apache.hadoop.ipc.Server.access$1900(Server.java:77)
> at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:613)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:677)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:981)
> </code>
> Note that the task actually seemed to commit - it didn't get speculatively
> executed or anything. However, the job wasn't able to continue until this one
> task was done. Both parties seem to think the channel was closed. How does
> the channel get closed externally? If closing it from outside is unavoidable,
> maybe the right thing to do is to set a much lower timeout, because 1 minute
> delay can be pretty significant for a small job.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira