[ https://issues.apache.org/jira/browse/MAPREDUCE-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208077#comment-14208077 ]
Junping Du commented on MAPREDUCE-6157: --------------------------------------- Marked as duplicate of MAPREDUCE-6156. > Connect failed in shuffle (due to NM down) could break current retry logic to > tolerant NM restart. > -------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-6157 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6157 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Junping Du > Assignee: Junping Du > Priority: Critical > Attachments: MAPREDUCE-6157.patch > > > The connection failure log during NM restart is as following: > {noformat} > 014-11-12 03:31:20,728 WARN [fetcher#23] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to > ip-172-31-37-212.ec2.internal:13562 with 4 map outputs > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:579) > at sun.net.NetworkClient.doConnect(NetworkClient.java:175) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) > at sun.net.www.http.HttpClient.<init>(HttpClient.java:211) > at sun.net.www.http.HttpClient.New(HttpClient.java:308) > at sun.net.www.http.HttpClient.New(HttpClient.java:326) > at > sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) > at > sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.connect(Fetcher.java:685) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:386) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:292) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193) > 2014-11-12 03:31:20,743 INFO [fetcher#22] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=13562/mapOutput?job=job_1415762969065_0001&reduce=3&map=attempt_1415762969065_0001_m_000021_0,attempt_1415762969065_0001_m_000004_0,attempt_1415762969065_0001_m_000018_0,attempt_1415762969065_0001_m_000015_0,attempt_1415762969065_0001_m_000001_0,attempt_1415762969065_0001_m_000009_0,attempt_1415762969065_0001_m_000012_0,attempt_1415762969065_0001_m_000006_0 > sent hash and received reply > {noformat} > We have some code to handle the retry logic for connection with a timeout (as > below). But if connection get refused quickly, we only try very limited times > and it get failed also quickly. > {code} > while (true) { > try { > connection.connect(); > break; > } catch (IOException ioe) { > // update the total remaining connect-timeout > connectionTimeout -= unit; > // throw an exception if we have waited for timeout amount of time > // note that the updated value if timeout is used here > if (connectionTimeout == 0) { > throw ioe; > } > // reset the connect timeout for the last try > if (connectionTimeout < unit) { > unit = connectionTimeout; > // reset the connect time out for the final connect > connection.setConnectTimeout(unit); > } > } > {code} > We should fix this to make retry can continue until timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)