[GitHub] [hadoop] daryn-sharp commented on a change in pull request #2878: HDFS-15957. The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang

GitBox Fri, 16 Apr 2021 07:05:58 -0700


daryn-sharp commented on a change in pull request #2878:
URL: https://github.com/apache/hadoop/pull/2878#discussion_r614865039




##########
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
##########
@@ -378,13 +381,18 @@ public void logSyncWait() {
 
     @Override
     public void logSyncNotify(RuntimeException syncEx) {
-      try {
-        if (syncEx == null) {
-          call.sendResponse();
-        } else {
-          call.abortResponse(syncEx);
+      for (int retries = 0; retries <= RESPONSE_SEND_RETRIES; retries++) {

Review comment:
       > For example, for transient connection issues, retry would help ... 
it's not like retry won't help with connection issue, it's just a matter of 
whether our fix implements the retry correctly or not, i.e., whether we should 
re-create a connection object or not.
   
   There is no scenario in which an IOE is recoverable except in an artificial 
fault injection condition.  There is no  "transient" failure writing to a 
socket –  it's game over.  An IOE bubbling up to this point means the 
underlying low-level ipc code has already closed the connection and simply 
re-thrown the error to inform the caller.
   
   Re-creating a connection object is fundamentally impossible.  Ignoring that 
impossibility, even if a _speculated_ future bug left the connection open it's 
in an unknown/inconsistent state with possible partial data written so writing 
anything more is a corrupted or duplicate response for the client.
   
   > Therefore, I think it's still worth adding the retry logic here, although 
it might not be able to handle the two scenarios you describe here. Do you 
agree?
   
   I 100% disagree with a broken solution in search of a non-existent problem.  
_Any retry for any reason is inherently a bug_.
   
   Which brings us back to:
   
   > we are doing fault injection testing, and we inject an IOException in 
call.sendResponse(), and then we observe the symptom that the client gets 
stuck. In this scenario, retry can help.
   
   This is a perfect illustration of the danger of using fault injection to 
create an artificial bug, and then "solving" that artificial bug _with a real 
bug_.
   
   -1 on this patch.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hadoop] daryn-sharp commented on a change in pull request #2878: HDFS-15957. The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang

Reply via email to