functioner commented on a change in pull request #2878:
URL: https://github.com/apache/hadoop/pull/2878#discussion_r616213541



##########
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
##########
@@ -378,13 +381,18 @@ public void logSyncWait() {
 
     @Override
     public void logSyncNotify(RuntimeException syncEx) {
-      try {
-        if (syncEx == null) {
-          call.sendResponse();
-        } else {
-          call.abortResponse(syncEx);
+      for (int retries = 0; retries <= RESPONSE_SEND_RETRIES; retries++) {

Review comment:
       @daryn-sharp Thanks for your explanation!
   Now I understand that:
   1. If the response cannot be sent it's either because the connection is 
already closed or there's a bug preventing the encoding of the response.
   2. If a speculated future bug left the connection open it's in an 
unknown/inconsistent state with possible partial data written so writing 
anything more is a corrupted or duplicate response for the client.
   
   And I have more questions:
   1. What kind of messages will be sent in this specific `call.sendResponse()`?
   2. Among these messages, are there critical ones? I mean, maybe some 
messages are important to the protocol and we don't want to lose them. If we 
delegate them to a dedicated RPC thread/service/framework, the 
disconnection/exception can be automatically handled so that those important 
messages can reliably reach the receiver's side.
   
   The `call.sendResponse()` is invoked after the edit log sync finishes, so 
the transaction has been done. However, it swallows the possible exception so 
it doesn't care whether it successfully replies (to datanode/client). Even if 
we have the implication of disconnection in this scenario, I was wondering if 
it's possible that a bug is hidden by this possibly ignored message?
   
   For example, If the receiver (datanode/client) is doing some end-to-end 
retry (e.g., can't get the reply and request the same transaction again), then 
this identical transaction may get rejected because it's already committed in 
the edit log. Probably we should at least add some warning/logging when 
`call.sendResponse()` fails to send the message.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to