Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s 
timeout
--------------------------------------------------------------------------------------

                 Key: HADOOP-5380
                 URL: https://issues.apache.org/jira/browse/HADOOP-5380
             Project: Hadoop Core
          Issue Type: Bug
            Reporter: Matei Zaharia


I'm seeing some map tasks in my jobs take 1 minute to commit after they finish 
the map computation. On the map side, the output looks like this:

<code>
2009-03-02 21:30:54,384 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot 
initialize JVM Metrics with processName=MAP, sessionId= - already initialized
2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 
800
2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 300
2009-03-02 21:30:55,493 INFO org.apache.hadoop.mapred.MapTask: data buffer = 
239075328/298844160
2009-03-02 21:30:55,494 INFO org.apache.hadoop.mapred.MapTask: record buffer = 
786432/983040
2009-03-02 21:31:00,381 INFO org.apache.hadoop.mapred.MapTask: Starting flush 
of map output
2009-03-02 21:31:07,892 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0
2009-03-02 21:31:07,951 INFO org.apache.hadoop.mapred.TaskRunner: 
Task:attempt_200903022127_0001_m_003163_0 is done. And is in the process of 
commiting
2009-03-02 21:32:07,949 INFO org.apache.hadoop.mapred.TaskRunner: Communication 
exception: java.io.IOException: Call to /127.0.0.1:50311 failed on local 
exception: java.nio.channels.ClosedChannelException
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:765)
        at org.apache.hadoop.ipc.Client.call(Client.java:733)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source)
        at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:525)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.nio.channels.ClosedChannelException
        at 
java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:167)
        at 
java.nio.channels.SelectableChannel.register(SelectableChannel.java:254)
        at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:331)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.FilterInputStream.read(FilterInputStream.java:116)
        at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        at java.io.DataInputStream.readInt(DataInputStream.java:370)
        at 
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)

2009-03-02 21:32:07,953 INFO org.apache.hadoop.mapred.TaskRunner: Task 
'attempt_200903022127_0001_m_003163_0' done.
</code>

In the TaskTracker log, it looks like this:

<code>
2009-03-02 21:31:08,110 WARN org.apache.hadoop.ipc.Server: IPC Server 
Responder, call ping(attempt_200903022127_0001_m_003163_0) from 
127.0.0.1:56884: output error
2009-03-02 21:31:08,111 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
10 on 50311 caught: java.nio.channels.ClosedChannelException
    at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)    at 
org.apache.hadoop.ipc.Server.channelWrite(Server.java:1195)
    at org.apache.hadoop.ipc.Server.access$1900(Server.java:77)
    at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:613)
    at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:677)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:981)
</code>

Note that the task actually seemed to commit - it didn't get speculatively 
executed or anything. However, the job wasn't able to continue until this one 
task was done. Both parties seem to think the channel was closed. How does the 
channel get closed externally? If closing it from outside is unavoidable, maybe 
the right thing to do is to set a much lower timeout, because 1 minute delay 
can be pretty significant for a small job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to