haiyang1987 opened a new pull request, #6266:
URL: https://github.com/apache/hadoop/pull/6266

   ### Description of PR
   https://issues.apache.org/jira/browse/HDFS-17250
   
   **issue:**
   When the NameNode attempts to trigger a log roll and the cachedActiveProxy 
is a "shut down NameNode," it is unable to establish a network connection. This 
results in a timeout during the socket connection phase, which has a set 
timeout of 90 seconds. Since the asynchronous call for "Triggering log roll" 
has a waiting time of 60 seconds, it triggers a timeout and initiates a 
"cancel" operation, causing the executing thread to receive an "Interrupted" 
signal and throwing a "java.io.InterruptedIOException" exception.
   
   Currently, the logic not to handle interrupted signal, and the 
"getActiveNodeProxy" method hasn't reached the maximum retry limit, the overall 
execution process doesn't exit and it continues to attempt to
   call the "rollEditLog" on the next NameNode in the list. However when a 
socket connection is established, it throws a 
"java.nio.channels.ClosedByInterruptException" exception due to the thread 
being in an "Interrupted" state.
   this cycle repeats until it reaches the maximum retry limit (nnCount * 
maxRetries) will exits.
   
   However in the next cycle of "Triggering log roll," it continues to traverse 
the NameNode list and encounters the same issue and the cachedActiveProxy is 
still a "shut down NameNode."
   This eventually results in the NameNode being unable to successfully 
complete the "Triggering log roll" operation.
   
   To optimize this, we need to handle the thread being interrupted and exit 
the execution.
   
   
   **Detailed logs such as:**
   the Observer node "ob1" will execute "Triggering log roll" is as follows:
   nns list is [ob2(shut down node),nn1(active),nn2(standy)]
   
   - The Observer node "ob1" periodically executes "triggerActiveLogRoll" and 
asynchronously calls "getNameNodeProxy" to request "ob2" for the "rollEditLog" 
operation, since the "ob2" machine is shut down, it cannot establish a network 
connection, this results in a timeout during the socket connection phase (here 
set timeout is 90 seconds).
   
   ```
   2023-11-03 10:27:41,734 INFO ha.EditLogTailer 
(EditLogTailer.java:triggerActiveLogRoll(465)) [Edit log tailer] - Triggering 
log roll on remote NameNode
   
   2023-11-03 10:28:41,734 WARN ha.EditLogTailer 
(EditLogTailer.java:triggerActiveLogRoll(478)) [Edit log tailer] - Unable to 
finish rolling edits in 60000 ms
   ```
   
   As the asynchronous call for "Triggering log roll" has a waiting time of 60 
seconds, it triggers a timeout and initiates a "cancel" operation, causing the 
executing thread to receive an "Interrupted" signal and will throw 
"java.io.InterruptedIOException".
   ```
   2023-11-03 10:28:41,753 WARN ipc.Client 
(Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
while trying for connection
   2023-11-03 10:28:41,972 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) 
[pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
[nnId=ob2, ipcAddress=xxx, httpAddress=http://xxx], try next.
   java.io.InterruptedIOException: DestHost:destPort xxx , LocalHost:localPort 
xxx. Failed on local exception: java.io.InterruptedIOException: Interrupted 
while waiting for IO on channel 
java.nio.channels.SocketChannel[connection-pending remote=xxx:8040]. Total 
timeout mills is 90000, 30038 millis timeout left.
   at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown Source)
   at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:906)
   at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583)
   at org.apache.hadoop.ipc.Client.call(Client.java:1511)
   at org.apache.hadoop.ipc.Client.call(Client.java:1402)
   at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
   at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
   at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)
   at 
org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152)
   at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:454)
   at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:451)
   at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:615)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
   Caused by: java.io.InterruptedIOException: Interrupted while waiting for IO 
on channel java.nio.channels.SocketChannel[connection-pending remote=xxx:8040]. 
Total timeout mills is 90000, 30038 millis timeout left.
   at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:351)
   at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604)
   at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:662)
   at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:783)
   at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:346)
   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1653)
   
   ```
   
   - Since current logic not to handle interrupted signal and hasn't reached 
the maximum retry limit, the overall execution process doesn't exit and it 
continues to attempt to call the "rollEditLog" on the next NameNode in the 
list. However when a socket connection is established, it throws a 
"java.nio.channels.ClosedByInterruptException" exception due to the thread 
being in an "Interrupted" state. this cycle repeats until it reaches 9 retry 
limit (3*3) and then exits.
   
   ```
   2023-11-03 10:28:41,991 WARN ipc.Client 
(Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
while trying for connection
   2023-11-03 10:28:41,991 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) 
[pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
[nnId=nn1, ipcAddress=xxx:8040, httpAddress=http://xxx:50070], try next.
   java.nio.channels.ClosedByInterruptException
   at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:656)
   at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604)
   at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:662)
   at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:783)
   at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:346)
   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1653)
   at org.apache.hadoop.ipc.Client.call(Client.java:1449)
   at org.apache.hadoop.ipc.Client.call(Client.java:1402)
   at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
   at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
   at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)
   at 
org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152)
   at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:454)
   at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:451)
   at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:615)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
   
   2023-11-03 10:28:42,095 WARN ipc.Client 
(Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
while trying for connection
   2023-11-03 10:28:42,095 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) 
[pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
[nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
   java.nio.channels.ClosedByInterruptException
   
   2023-11-03 10:28:42,104 WARN ipc.Client 
(Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
while trying for connection
   2023-11-03 10:28:42,104 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) 
[pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
[nnId=ob2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
   java.nio.channels.ClosedByInterruptException
   
   2023-11-03 10:28:42,118 WARN ipc.Client 
(Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
while trying for connection
   2023-11-03 10:28:42,118 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) 
[pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
[nnId=nn1, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
   
   2023-11-03 10:28:42,119 WARN ipc.Client 
(Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
while trying for connection
   2023-11-03 10:28:42,119 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) 
[pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
[nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
   java.nio.channels.ClosedByInterruptException
   
   2023-11-03 10:28:42,148 WARN ipc.Client 
(Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
while trying for connection
   2023-11-03 10:28:42,148 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) 
[pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
[nnId=ob2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
   java.nio.channels.ClosedByInterruptException
   
   2023-11-03 10:28:42,148 WARN ipc.Client 
(Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
while trying for connection
   2023-11-03 10:28:42,148 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) 
[pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
[nnId=nn1, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
   java.nio.channels.ClosedByInterruptException
   
   2023-11-03 10:28:42,149 WARN ipc.Client 
(Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
while trying for connection
   2023-11-03 10:28:42,149 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) 
[pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
[nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
   java.nio.channels.ClosedByInterruptException
   ```
   
   - In the next cycle of "Triggering log roll," it continues to traverse the 
NameNode list and proceeds to "ob2," causing the same issue to recur. This 
eventually results in the NameNode being unable to successfully complete the 
"Triggering log roll" operation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to