functioner commented on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822617097
> > In that case, please change the title of the Jira and the description to remove references to "hanging" problems. > > @amahussein I still would like to argue about this "hanging" issue. Another aspect of the argument is the design of availability and fault tolerance. Actually distributed systems can tolerate such hanging issues in many scenarios, but sometimes it's seen as a bug like [ZOOKEEPER-2201](https://issues.apache.org/jira/browse/ZOOKEEPER-2201). So an important question is: when it's a bug; and when it's not (i.e., it's a feature) I've been doing research on fault injection for some time and I have submitted multiple bug reports accepted by the open source community (e.g., [HADOOP-17552](https://issues.apache.org/jira/browse/HADOOP-17552)). My criteria for determining whether it is bug, are: 1. if we inject a fault in **module X** and it only affects **module X**, then it’s not a bug. 2. if we inject a fault in **module X** and it affects not only **module X** but also **module Y** which should not relate to **module X**, then probably it would be a bug, because in the system design, each module should be responsible for itself and report the problem (e.g., by logging), rather than affect another irrelevant module. In our scenario ([HDFS-15869](https://issues.apache.org/jira/browse/HDFS-15869)), this possible hanging (_if you agree with my argument of network hanging_) can block the `FSEditLogAsync` thread, because now `call.sendResponse()` is invoked by the `FSEditLogAsync` thread. So, `call.sendResponse()` (network service) affects `FSEditLogAsync` (edit log sync service). So, I would say it's a bug. The network service should be responsible for all its behaviors, and handle all the possible network issues (e.g., IOException, disconnection, hanging). It should determine how to handle them, e.g., by logging the error, rather than affecting other services like `FSEditLogAsync`. I'm not saying that we have to use a complete and slow RPC framework for this network service. But IMO, decoupling it from `FSEditLogAsync` by delegating to a thread pool is at least a better design. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
