[jira] [Updated] (MAPREDUCE-2450) Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout

Arun C Murthy (Updated) (JIRA) Mon, 16 Jan 2012 19:03:39 -0800

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Arun C Murthy updated MAPREDUCE-2450:
-------------------------------------

       Fix Version/s:     (was: 0.24.0)
            Assignee: Rajesh Balamohan
    Target Version/s: 0.23.1
              Status: Patch Available  (was: Open)

Thanks to Rajesh for the original patch. This helps 0.23 a *lot*, ran a number 
of benchmarks.
                
> Calls from running tasks to TaskTracker methods sometimes fail and incur a 
> 60s timeout
> --------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2450
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2450
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.23.0
>            Reporter: Matei Zaharia
>            Assignee: Rajesh Balamohan
>         Attachments: HADOOP-5380.Y.20.branch.patch, HADOOP-5380.patch, 
> HADOOP_5380-Y.0.20.20x.patch, MAPREDUCE-2450.patch, mapreduce-2450.patch
>
>
> I'm seeing some map tasks in my jobs take 1 minute to commit after they 
> finish the map computation. On the map side, the output looks like this:
> <code>
> 2009-03-02 21:30:54,384 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot 
> initialize JVM Metrics with processName=MAP, sessionId= - already initialized
> 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: 
> numReduceTasks: 800
> 2009-03-02 21:30:54,437 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 
> 300
> 2009-03-02 21:30:55,493 INFO org.apache.hadoop.mapred.MapTask: data buffer = 
> 239075328/298844160
> 2009-03-02 21:30:55,494 INFO org.apache.hadoop.mapred.MapTask: record buffer 
> = 786432/983040
> 2009-03-02 21:31:00,381 INFO org.apache.hadoop.mapred.MapTask: Starting flush 
> of map output
> 2009-03-02 21:31:07,892 INFO org.apache.hadoop.mapred.MapTask: Finished spill > 0
> 2009-03-02 21:31:07,951 INFO org.apache.hadoop.mapred.TaskRunner: 
> Task:attempt_200903022127_0001_m_003163_0 is done. And is in the process of 
> commiting
> 2009-03-02 21:32:07,949 INFO org.apache.hadoop.mapred.TaskRunner: 
> Communication exception: java.io.IOException: Call to /127.0.0.1:50311 failed 
> on local exception: java.nio.channels.ClosedChannelException
>       at org.apache.hadoop.ipc.Client.wrapException(Client.java:765)
>       at org.apache.hadoop.ipc.Client.call(Client.java:733)
>       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>       at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source)
>       at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:525)
>       at java.lang.Thread.run(Thread.java:619)
> Caused by: java.nio.channels.ClosedChannelException
>       at 
> java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:167)
>       at 
> java.nio.channels.SelectableChannel.register(SelectableChannel.java:254)
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:331)
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>       at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>       at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>       at java.io.FilterInputStream.read(FilterInputStream.java:116)
>       at 
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276)
>       at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>       at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>       at java.io.DataInputStream.readInt(DataInputStream.java:370)
>       at 
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
>       at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
> 2009-03-02 21:32:07,953 INFO org.apache.hadoop.mapred.TaskRunner: Task 
> 'attempt_200903022127_0001_m_003163_0' done.
> </code>
> In the TaskTracker log, it looks like this:
> <code>
> 2009-03-02 21:31:08,110 WARN org.apache.hadoop.ipc.Server: IPC Server 
> Responder, call ping(attempt_200903022127_0001_m_003163_0) from 
> 127.0.0.1:56884: output error
> 2009-03-02 21:31:08,111 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 10 on 50311 caught: java.nio.channels.ClosedChannelException
>     at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)    at 
> org.apache.hadoop.ipc.Server.channelWrite(Server.java:1195)
>     at org.apache.hadoop.ipc.Server.access$1900(Server.java:77)
>     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:613)
>     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:677)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:981)
> </code>
> Note that the task actually seemed to commit - it didn't get speculatively 
> executed or anything. However, the job wasn't able to continue until this one 
> task was done. Both parties seem to think the channel was closed. How does 
> the channel get closed externally? If closing it from outside is unavoidable, 
> maybe the right thing to do is to set a much lower timeout, because 1 minute 
> delay can be pretty significant for a small job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-2450) Calls from running tasks to TaskTracker methods sometimes fail and incur a 60s timeout

Reply via email to