[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208077#comment-14208077
 ] 

Junping Du commented on MAPREDUCE-6157:
---------------------------------------

Marked as duplicate of MAPREDUCE-6156.

> Connect failed in shuffle (due to NM down) could break current retry logic to 
> tolerant NM restart.
> --------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6157
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6157
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: MAPREDUCE-6157.patch
>
>
> The connection failure log during NM restart is as following:
> {noformat}
> 014-11-12 03:31:20,728 WARN [fetcher#23] 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> ip-172-31-37-212.ec2.internal:13562 with 4 map outputs
> java.net.ConnectException: Connection refused
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>         at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>         at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:579)
>         at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
>         at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>         at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>         at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
>         at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>         at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>         at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
>         at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
>         at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
>         at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.connect(Fetcher.java:685)
>         at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:386)
>         at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:292)
>         at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
> 2014-11-12 03:31:20,743 INFO [fetcher#22] 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: for 
> url=13562/mapOutput?job=job_1415762969065_0001&reduce=3&map=attempt_1415762969065_0001_m_000021_0,attempt_1415762969065_0001_m_000004_0,attempt_1415762969065_0001_m_000018_0,attempt_1415762969065_0001_m_000015_0,attempt_1415762969065_0001_m_000001_0,attempt_1415762969065_0001_m_000009_0,attempt_1415762969065_0001_m_000012_0,attempt_1415762969065_0001_m_000006_0
>  sent hash and received reply
> {noformat}
> We have some code to handle the retry logic for connection with a timeout (as 
> below). But if connection get refused quickly, we only try very limited times 
> and it get failed also quickly.
> {code}
> while (true) {
>       try {
>         connection.connect();
>         break;
>       } catch (IOException ioe) {
>         // update the total remaining connect-timeout
>         connectionTimeout -= unit;
>         // throw an exception if we have waited for timeout amount of time
>         // note that the updated value if timeout is used here
>         if (connectionTimeout == 0) {
>           throw ioe;
>         }
>         // reset the connect timeout for the last try
>         if (connectionTimeout < unit) {
>           unit = connectionTimeout;
>           // reset the connect time out for the final connect
>           connection.setConnectTimeout(unit);
>         }
>       }
> {code}
> We should fix this to make retry can continue until timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to