functioner commented on pull request #2737:
URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822617097


   > > In that case, please change the title of the Jira and the description to 
remove references to "hanging" problems.
   > 
   > @amahussein I still would like to argue about this "hanging" issue.
   
   Another aspect of the argument is the design of availability and fault 
tolerance. Actually distributed systems can tolerate such hanging issues in 
many scenarios, but sometimes it's seen as a bug like 
[ZOOKEEPER-2201](https://issues.apache.org/jira/browse/ZOOKEEPER-2201).
   
   So an important question is: when it's a bug; and when it's not (i.e., it's 
a feature)
   
   I've been doing research on fault injection for some time and I have 
submitted multiple bug reports accepted by the open source community (e.g., 
[HADOOP-17552](https://issues.apache.org/jira/browse/HADOOP-17552)). My 
criteria for determining whether it is bug, are:
   1. if we inject a fault in **module X** and it only affects **module X**, 
then it’s not a bug.
   2. if we inject a fault in **module X** and it affects not only **module X** 
but also **module Y** which should not relate to **module X**, then probably it 
would be a bug, because in the system design, each module should be responsible 
for itself and report the problem (e.g., by logging), rather than affect 
another irrelevant module.
   
   In our scenario 
([HDFS-15869](https://issues.apache.org/jira/browse/HDFS-15869)), this possible 
hanging (_if you agree with my argument of network hanging_) can block the 
`FSEditLogAsync` thread, because now `call.sendResponse()` is invoked by the 
`FSEditLogAsync` thread.
   
   So, `call.sendResponse()` (network service) affects `FSEditLogAsync` (edit 
log sync service). So, I would say it's a bug.
   
   The network service should be responsible for all its behaviors, and handle 
all the possible network issues (e.g., IOException, disconnection, hanging). It 
should determine how to handle them, e.g., by logging the error, rather than 
affecting other services like `FSEditLogAsync`.
   
   I'm not saying that we have to use a complete and slow RPC framework for 
this network service. But IMO, decoupling it from `FSEditLogAsync` by 
delegating to a thread pool is at least a better design.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to