functioner commented on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822591028
> In that case, please change the title of the Jira and the description to remove references to "hanging" problems. @amahussein I still would like to argue about this "hanging" issue. There has been reported TCP network I/O issues which hangs for >15min without throwing any exception. [ZOOKEEPER-2201](https://issues.apache.org/jira/browse/ZOOKEEPER-2201) is a perfect example, and you can find the TCP level explanation for this hanging issue in https://www.usenix.org/conference/srecon16/program/presentation/nadolny Similar hanging bugs are also accepted by ZooKeeper community, such as: - [ZOOKEEPER-3531](https://issues.apache.org/jira/browse/ZOOKEEPER-3531): very similar to ZK-2201; the patch is merged - [ZOOKEEPER-4074](https://issues.apache.org/jira/browse/ZOOKEEPER-4074): a similar network hanging bug I reported; already confirmed by community; more discussion can be found in https://github.com/apache/zookeeper/pull/1582 However, in our scenario ([HDFS-15869](https://issues.apache.org/jira/browse/HDFS-15869)), a possible counterargument is: the `call.sendResponse()` invocation eventually invokes `channel.write(buffer)` (line 3611), which is non-blocking mode, so it might not be affected by this potential issue. https://github.com/apache/hadoop/blob/3c57512d104e3a92391c9a03ce4005a00267c07f/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java#L3607-L3616 However, as we point out in [HDFS-15869](https://issues.apache.org/jira/browse/HDFS-15869), when the payload is huge, in line 3611, it won't invoke `channel.write(buffer)`; instead, it invokes `channelIO(null, channel, buffer)` which brings us to: https://github.com/apache/hadoop/blob/3c57512d104e3a92391c9a03ce4005a00267c07f/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java#L3646-L3672 If the payload is split in two batches, the second batch will have to wait for the first batch to be sent out, which may encounter high packet loss rate and thus slow I/O. Hence, I would say the hanging problem still exists. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
