[jira] [Commented] (HDFS-15390) client fails forever when namenode ipaddr changed

2020-06-08 Thread hemanthboyina (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128243#comment-17128243
 ] 

hemanthboyina commented on HDFS-15390:
--

[~seanlook] you can click on the More option specified under the Jira title , 
In More you can select MOVE 

> client fails forever when namenode ipaddr changed
> -
>
> Key: HDFS-15390
> URL: https://issues.apache.org/jira/browse/HDFS-15390
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.0, 2.9.2, 3.2.1
>Reporter: Sean Chow
>Priority: Major
> Attachments: HDFS-15390.01.patch
>
>
> For machine replacement, I replace my standby namenode with a new ipaddr and 
> keep the same hostname. Also update the client's hosts to make it resolve 
> correctly
> When I try to run failover to transite the new namenode(let's say nn2), the 
> client will fail to read or write forever until it's restarted.
> That make yarn nodemanager in sick state. Even the new tasks will encounter 
> this exception  too. Until all nodemanager restart.
>  
> {code:java}
> 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
> nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
> 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
> nn2-192-168-1-100/192.168.1.200:9000: Connection refused
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
> at org.apache.hadoop.ipc.Client.call(Client.java:1440)
> at org.apache.hadoop.ipc.Client.call(Client.java:1401)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
> at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> {code}
>  
> We can see the client has {{Address change detected}}, but it still fails. I 
> find out that's because when method {{updateAddress()}} return true,  the 
> {{handleConnectionFailure()}} thow an exception that break the next retry 
> with the right ipaddr.
> Client.java: setupConnection()
> {code:java}
> } catch (ConnectTimeoutException toe) {
>   /* Check for an address change and update the local reference.
>* Reset the failure counter if the address was changed
>*/
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
>   handleConnectionTimeout(timeoutFailures++,
>   maxRetriesOnSocketTimeouts, toe);
> } catch (IOException ie) {
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
> // because the namenode ip changed in updateAddress(), the old namenode 
> ipaddress cannot be accessed now
> // handleConnectionFailure will thow an exception, the next retry never have 
> a chance to use the right server updated in updateAddress()
>   handleConnectionFailure(ioFailures++, ie);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15390) client fails forever when namenode ipaddr changed

2020-06-08 Thread Sean Chow (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128236#comment-17128236
 ] 

Sean Chow commented on HDFS-15390:
--

In fact, I've done some debug and the exception is thrown as below:
{code:java}
private void handleConnectionFailure(int curRetries, IOException ioe
) throws IOException {
  closeConnection();

  final RetryAction action;
  try {
action = connectionRetryPolicy.shouldRetry(ioe, curRetries, 0, true);
  } catch(Exception e) {
throw e instanceof IOException? (IOException)e: new IOException(e);
  }
  if (action.action == RetryAction.RetryDecision.FAIL) {
if (action.reason != null) {
  if (LOG.isDebugEnabled()) {
LOG.debug("Failed to connect to server: " + server + ": "
+ action.reason, ioe);
  }
}
// HERE is where the IOException throws
throw ioe;
  }
...{code}
But the strange is  {{Failed to connect to server}}  debug log is not logged.

 

We use hadoop version  hadoop-2.6.0-cdh5.4.11. And I've tested it with the 
trunk version, the same issue.

Though it only affects the client side, I move this ticket to hadoop-common.

> client fails forever when namenode ipaddr changed
> -
>
> Key: HDFS-15390
> URL: https://issues.apache.org/jira/browse/HDFS-15390
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 2.10.0, 2.9.2, 3.2.1
>Reporter: Sean Chow
>Priority: Major
> Attachments: HDFS-15390.01.patch
>
>
> For machine replacement, I replace my standby namenode with a new ipaddr and 
> keep the same hostname. Also update the client's hosts to make it resolve 
> correctly
> When I try to run failover to transite the new namenode(let's say nn2), the 
> client will fail to read or write forever until it's restarted.
> That make yarn nodemanager in sick state. Even the new tasks will encounter 
> this exception  too. Until all nodemanager restart.
>  
> {code:java}
> 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
> nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
> 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
> nn2-192-168-1-100/192.168.1.200:9000: Connection refused
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
> at org.apache.hadoop.ipc.Client.call(Client.java:1440)
> at org.apache.hadoop.ipc.Client.call(Client.java:1401)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
> at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> {code}
>  
> We can see the client has {{Address change detected}}, but it still fails. I 
> find out that's because when method {{updateAddress()}} return true,  the 
> {{handleConnectionFailure()}} thow an exception that break the next retry 
> with the right ipaddr.
> Client.java: setupConnection()
> {code:java}
> } catch (ConnectTimeoutException toe) {
>   /* Check for an address change and update the local reference.
>* Reset the failure counter if the address was changed
>*/
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
>   handleConnectionTimeout(timeoutFailures++,
>   maxRetriesOnSocketTimeouts, toe);
> } catch (IOException ie) {
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
> // because the namenode ip changed in

[jira] [Commented] (HDFS-15390) client fails forever when namenode ipaddr changed

2020-06-07 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127894#comment-17127894
 ] 

Xiaoqiao He commented on HDFS-15390:


[~seanlook], It is great catch here. Some nits,
a. This ticket should be more proper at Common project?
b. The trace stack seems not branch trunk based, I am concern what exception 
`handleConnectionFailure` throws, would you like to offer some stack log. 
Thanks.

> client fails forever when namenode ipaddr changed
> -
>
> Key: HDFS-15390
> URL: https://issues.apache.org/jira/browse/HDFS-15390
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 2.10.0, 2.9.2, 3.2.1
>Reporter: Sean Chow
>Priority: Major
> Attachments: HDFS-15390.01.patch
>
>
> For machine replacement, I replace my standby namenode with a new ipaddr and 
> keep the same hostname. Also update the client's hosts to make it resolve 
> correctly
> When I try to run failover to transite the new namenode(let's say nn2), the 
> client will fail to read or write forever until it's restarted.
> That make yarn nodemanager in sick state. Even the new tasks will encounter 
> this exception  too. Until all nodemanager restart.
>  
> {code:java}
> 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
> nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
> 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
> nn2-192-168-1-100/192.168.1.200:9000: Connection refused
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
> at org.apache.hadoop.ipc.Client.call(Client.java:1440)
> at org.apache.hadoop.ipc.Client.call(Client.java:1401)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
> at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> {code}
>  
> We can see the client has {{Address change detected}}, but it still fails. I 
> find out that's because when method {{updateAddress()}} return true,  the 
> {{handleConnectionFailure()}} thow an exception that break the next retry 
> with the right ipaddr.
> Client.java: setupConnection()
> {code:java}
> } catch (ConnectTimeoutException toe) {
>   /* Check for an address change and update the local reference.
>* Reset the failure counter if the address was changed
>*/
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
>   handleConnectionTimeout(timeoutFailures++,
>   maxRetriesOnSocketTimeouts, toe);
> } catch (IOException ie) {
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
> // because the namenode ip changed in updateAddress(), the old namenode 
> ipaddress cannot be accessed now
> // handleConnectionFailure will thow an exception, the next retry never have 
> a chance to use the right server updated in updateAddress()
>   handleConnectionFailure(ioFailures++, ie);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15390) client fails forever when namenode ipaddr changed

2020-06-06 Thread Sean Chow (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127515#comment-17127515
 ] 

Sean Chow commented on HDFS-15390:
--

Hi [~ayushtkn] , I've tried written a unit test for this but it's not easy :(

Because emulating namenode ipaddr change need the third namenode setup to 
connect.

 HDFS-4404 is good example, not for this issue.

> client fails forever when namenode ipaddr changed
> -
>
> Key: HDFS-15390
> URL: https://issues.apache.org/jira/browse/HDFS-15390
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 2.10.0, 2.9.2, 3.2.1
>Reporter: Sean Chow
>Priority: Major
> Attachments: HDFS-15390.01.patch
>
>
> For machine replacement, I replace my standby namenode with a new ipaddr and 
> keep the same hostname. Also update the client's hosts to make it resolve 
> correctly
> When I try to run failover to transite the new namenode(let's say nn2), the 
> client will fail to read or write forever until it's restarted.
> That make yarn nodemanager in sick state. Even the new tasks will encounter 
> this exception  too. Until all nodemanager restart.
>  
> {code:java}
> 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
> nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
> 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
> nn2-192-168-1-100/192.168.1.200:9000: Connection refused
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
> at org.apache.hadoop.ipc.Client.call(Client.java:1440)
> at org.apache.hadoop.ipc.Client.call(Client.java:1401)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
> at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> {code}
>  
> We can see the client has {{Address change detected}}, but it still fails. I 
> find out that's because when method {{updateAddress()}} return true,  the 
> {{handleConnectionFailure()}} thow an exception that break the next retry 
> with the right ipaddr.
> Client.java: setupConnection()
> {code:java}
> } catch (ConnectTimeoutException toe) {
>   /* Check for an address change and update the local reference.
>* Reset the failure counter if the address was changed
>*/
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
>   handleConnectionTimeout(timeoutFailures++,
>   maxRetriesOnSocketTimeouts, toe);
> } catch (IOException ie) {
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
> // because the namenode ip changed in updateAddress(), the old namenode 
> ipaddress cannot be accessed now
> // handleConnectionFailure will thow an exception, the next retry never have 
> a chance to use the right server updated in updateAddress()
>   handleConnectionFailure(ioFailures++, ie);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15390) client fails forever when namenode ipaddr changed

2020-06-05 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126883#comment-17126883
 ] 

Ayush Saxena commented on HDFS-15390:
-

Can you extend a UT for the issue?

> client fails forever when namenode ipaddr changed
> -
>
> Key: HDFS-15390
> URL: https://issues.apache.org/jira/browse/HDFS-15390
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 2.10.0, 2.9.2, 3.2.1
>Reporter: Sean Chow
>Priority: Major
> Attachments: HDFS-15390.01.patch
>
>
> For machine replacement, I replace my standby namenode with a new ipaddr and 
> keep the same hostname. Also update the client's hosts to make it resolve 
> correctly
> When I try to run failover to transite the new namenode(let's say nn2), the 
> client will fail to read or write forever until it's restarted.
> That make yarn nodemanager in sick state. Even the new tasks will encounter 
> this exception  too. Until all nodemanager restart.
>  
> {code:java}
> 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
> nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
> 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
> nn2-192-168-1-100/192.168.1.200:9000: Connection refused
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
> at org.apache.hadoop.ipc.Client.call(Client.java:1440)
> at org.apache.hadoop.ipc.Client.call(Client.java:1401)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
> at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> {code}
>  
> We can see the client has {{Address change detected}}, but it still fails. I 
> find out that's because when method {{updateAddress()}} return true,  the 
> {{handleConnectionFailure()}} thow an exception that break the next retry 
> with the right ipaddr.
> Client.java: setupConnection()
> {code:java}
> } catch (ConnectTimeoutException toe) {
>   /* Check for an address change and update the local reference.
>* Reset the failure counter if the address was changed
>*/
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
>   handleConnectionTimeout(timeoutFailures++,
>   maxRetriesOnSocketTimeouts, toe);
> } catch (IOException ie) {
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
> // because the namenode ip changed in updateAddress(), the old namenode 
> ipaddress cannot be accessed now
> // handleConnectionFailure will thow an exception, the next retry never have 
> a chance to use the right server updated in updateAddress()
>   handleConnectionFailure(ioFailures++, ie);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15390) client fails forever when namenode ipaddr changed

2020-06-05 Thread Sean Chow (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126837#comment-17126837
 ] 

Sean Chow commented on HDFS-15390:
--

Patch attached.

Now we can see the exception is ignored when address updated, and the file is 
written successfully.
{code:java}
20/06/05 20:54:51 WARN ipc.Client: Address change detected. Old: 
nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
20/06/05 20:54:51 DEBUG ipc.Client: Failed to connect to server: 
nn2-192-168-1-100/192.168.1.200:9000: try once and fail.
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
...
20/06/05 20:54:51 DEBUG hdfs.DFSOutputStream: enqueue full packet seqno: ...
20/06/05 20:54:51 DEBUG hdfs.DataStreamer: Queued packet 100076
20/06/05 20:54:51 WARN ipc.Client: Exception when handle ConnectionFailure: 
Connection refused
{code}

> client fails forever when namenode ipaddr changed
> -
>
> Key: HDFS-15390
> URL: https://issues.apache.org/jira/browse/HDFS-15390
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 2.10.0, 2.9.2, 3.2.1
>Reporter: Sean Chow
>Priority: Major
> Attachments: HDFS-15390.01.patch
>
>
> For machine replacement, I replace my standby namenode with a new ipaddr and 
> keep the same hostname. Also update the client's hosts to make it resolve 
> correctly
> When I try to run failover to transite the new namenode(let's say nn2), the 
> client will fail to read or write forever until it's restarted.
> That make yarn nodemanager in sick state. Even the new tasks will encounter 
> this exception  too. Until all nodemanager restart.
>  
> {code:java}
> 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
> nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
> 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
> nn2-192-168-1-100/192.168.1.200:9000: Connection refused
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
> at org.apache.hadoop.ipc.Client.call(Client.java:1440)
> at org.apache.hadoop.ipc.Client.call(Client.java:1401)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
> at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> {code}
>  
> We can see the client has {{Address change detected}}, but it still fails. I 
> find out that's because when method {{updateAddress()}} return true,  the 
> {{handleConnectionFailure()}} thow an exception that break the next retry 
> with the right ipaddr.
> Client.java: setupConnection()
> {code:java}
> } catch (ConnectTimeoutException toe) {
>   /* Check for an address change and update the local reference.
>* Reset the failure counter if the address was changed
>*/
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
>   handleConnectionTimeout(timeoutFailures++,
>   maxRetriesOnSocketTimeouts, toe);
> } catch (IOException ie) {
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
> // because the namenode ip changed in updateAddress(), the old namenode 
> ipaddress cannot be accessed now
> // handleConnectionFailure will thow an exception, the next retry never have 
> a chance to use the right server updated in updateAddress()
>   han

[jira] [Commented] (HDFS-15390) client fails forever when namenode ipaddr changed

2020-06-05 Thread Sean Chow (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126638#comment-17126638
 ] 

Sean Chow commented on HDFS-15390:
--

There 's two way to fix this: 
 # When updateAddress is true, do not handle ConnectionFailure this round 
 # When address change detected, update namenode proxies (only with 
{{ConfiguredFailoverProxyProvider}})

Method one is easy, and in this connection lifecycle the client will use the 
right {{server}} to connect. But when the client connection closed and create a 
new one. It will always try to getConnection from the retired ipaddr, because 
the namenode proxies is still the old one.

Method two solve the root cause. Everytime the client failover namenodes, check 
ipaddr changed or not. If changed, re-initialize the namenode failover proxies.

> client fails forever when namenode ipaddr changed
> -
>
> Key: HDFS-15390
> URL: https://issues.apache.org/jira/browse/HDFS-15390
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 2.10.0, 2.9.2, 3.2.1
>Reporter: Sean Chow
>Priority: Major
>
> For machine replacement, I replace my standby namenode with a new ipaddr and 
> keep the same hostname. Also update the client's hosts to make it resolve 
> correctly
> When I try to run failover to transite the new namenode(let's say nn2), the 
> client will fail to read or write forever until it's restarted.
> That make yarn nodemanager in sick state. Even the new tasks will encounter 
> this exception  too. Until all nodemanager restart.
>  
> {code:java}
> 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
> nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
> 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
> nn2-192-168-1-100/192.168.1.200:9000: Connection refused
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
> at org.apache.hadoop.ipc.Client.call(Client.java:1440)
> at org.apache.hadoop.ipc.Client.call(Client.java:1401)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
> at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> {code}
>  
> We can see the client has {{Address change detected}}, but it still fails. I 
> find out that's because when method {{updateAddress()}} return true,  the 
> {{handleConnectionFailure()}} thow an exception that break the next retry 
> with the right ipaddr.
> Client.java: setupConnection()
> {code:java}
> } catch (ConnectTimeoutException toe) {
>   /* Check for an address change and update the local reference.
>* Reset the failure counter if the address was changed
>*/
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
>   handleConnectionTimeout(timeoutFailures++,
>   maxRetriesOnSocketTimeouts, toe);
> } catch (IOException ie) {
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
> // because the namenode ip changed in updateAddress(), the old namenode 
> ipaddress cannot be accessed now
> // handleConnectionFailure will thow an exception, the next retry never have 
> a change to use the right server updated in updateAddress()
>   handleConnectionFailure(ioFailures++, ie);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-