M1eyu2018 opened a new issue, #4261:
URL: https://github.com/apache/bookkeeper/issues/4261

   **BUG REPORT**
   When a client adds entries synchronously to an opened ledger and a bookie 
crashes, the client may get stuck.
   
   ***Describe the bug***
   
   When a client adds entries synchronously to an opened ledger and a bookie 
crashes, the ensemble change for the crashed bookie may be called twice.
   The first ensemble change is caused by the third failed response of 'Bookie 
handle was not available'.
   A moment later, The Second ensemble change is caused by the third failed 
response of 'Bookie operation timeout'.
   As the same crashed bookie is replaced twice, in the second time 
unsetSuccessAndSendWriteRequest can't be called because no bookie is replaced 
so that successful callback of current adding entry can't be sent and client 
gets stuck.
   
   ***Example***
   In this example, a client add 81920 entries for a ledger of 10M with 3-3-2 
policy, and the ensemble is (A,B,C).
   1、At the beginning,entry#0-#6773 is normally written.
   2、When add entry#6774, the bookie A crashes for some reason like power 
outage or run 'kill -9 bookie A process id'.
   3、However, two successful responses are received, so it does not affect the 
ability to continue adding entry#6774-#11604.
   4、Before add entry#11605, the third responses for entry#6774-#11604 come 
back one after another. As the failed response is 'Bookie handle was not 
available', the failed bookie A is put into delayedWriteFailedBookies.
   5、When add entry#11605, maybeHandleDelayedWriteBookieFailure is called, as 
delayedWriteFailedBookies is not empty, ensemble change begins.
   6、After two successful responses of entry#11605 are received, 
sendAddSuccessCallbacks is called. However, pendingAddOp.submitCallback is not 
called until ensemble change finishes.
   7、When ensemble change finishes, bookie A is replaced by bookie D. 
Successful callback of entry#11605 is also sent and adding entry is continue.
   
   So far, the logic is correct. But there will be a problem below.
   
   8、entry#11606-#42623 is normally written to (D,B,C) after ensemble change.
   9、Before add entry #42624, the third responses for entry#6774-#11604 which 
has not come back still come back one after another. But in this time, the 
failed response is 'Bookie operation timeout', the failed bookie A is put into 
delayedWriteFailedBookies again.
   10、When add entry#426245, maybeHandleDelayedWriteBookieFailure is called, as 
delayedWriteFailedBookies is not empty, ensemble change begin again.
   11、After three successful responses of entry#426245 from (D,B,C) are 
received, sendAddSuccessCallbacks is called. However, 
pendingAddOp.submitCallback is not called until ensemble change finishes.
   12、In this time, as failed bookie A need to be replaced again, but ensemble 
has been (D,B,C), so no bookie is replaced. Successful callback of entry#426245 
can't be sent as unsetSuccessAndSendWriteRequest is not called.
   13、As add entries synchronously, the client gets stuck.
   
   
   ***To Reproduce***
   
   1、create bookkeeper client
   2、open a ledger
   3、add entries synchronously
   4、kill -9  one bookie process id when add entries
   5、the client may get stuck forever
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to