[GitHub] [hadoop] functioner commented on pull request #2737: HDFS-15869. Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang

GitBox Mon, 19 Apr 2021 09:09:47 -0700


functioner commented on pull request #2737:
URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822591028



   > In that case, please change the title of the Jira and the description to 
remove references to "hanging" problems.
   
   @amahussein I still would like to argue about this "hanging" issue.
   
   There has been reported TCP network I/O issues which hangs for >15min 
without throwing any exception. 
[ZOOKEEPER-2201](https://issues.apache.org/jira/browse/ZOOKEEPER-2201) is a 
perfect example, and you can find the TCP level explanation for this hanging 
issue in https://www.usenix.org/conference/srecon16/program/presentation/nadolny
   Similar hanging bugs are also accepted by ZooKeeper community, such as:
   - [ZOOKEEPER-3531](https://issues.apache.org/jira/browse/ZOOKEEPER-3531): 
very similar to ZK-2201; the patch is merged
   - [ZOOKEEPER-4074](https://issues.apache.org/jira/browse/ZOOKEEPER-4074): a 
similar network hanging bug I reported; already confirmed by community; more 
discussion can be found in https://github.com/apache/zookeeper/pull/1582
   
   However, in our scenario 
([HDFS-15869](https://issues.apache.org/jira/browse/HDFS-15869)), a possible 
counterargument is: the `call.sendResponse()` invocation eventually invokes 
`channel.write(buffer)` (line 3611), which is non-blocking mode, so it might 
not be affected by this potential issue.
   
https://github.com/apache/hadoop/blob/3c57512d104e3a92391c9a03ce4005a00267c07f/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java#L3607-L3616
   However, as we point out in 
[HDFS-15869](https://issues.apache.org/jira/browse/HDFS-15869), when the 
payload is huge,  in line 3611, it won't invoke `channel.write(buffer)`; 
instead, it invokes `channelIO(null, channel, buffer)` which brings us to:
   
https://github.com/apache/hadoop/blob/3c57512d104e3a92391c9a03ce4005a00267c07f/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java#L3646-L3672
   If the payload is split in two batches, the second batch will have to wait 
for the first batch to be sent out, which may encounter high packet loss rate 
and thus slow I/O.
   
   Hence, I would say the hanging problem still exists.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hadoop] functioner commented on pull request #2737: HDFS-15869. Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang

Reply via email to