graysonzeng opened a new pull request, #4171:
URL: https://github.com/apache/bookkeeper/pull/4171

   Master Issue: 
   https://github.com/apache/pulsar/issues/21860
   
   ### Motivation
   
   When bookies do a rolling restart, pulsar topic fence state may be triggered 
due to race conditions. And it can not recover.
   
   After check the heap dump of the broker, we can see the pendingWriteOps is 
161, this is the reason why the topic can not recover from the fenced state.
   
   
![image](https://github.com/apache/bookkeeper/assets/21187269/7076d933-c45e-4676-ad51-69a9463c636b)
   
   
![image](https://github.com/apache/bookkeeper/assets/21187269/0b13907a-8b2f-4f04-ba6d-39a091d65311)
   
   The topic will only change to unfenced when pendingWriteOps is reduced to 0. 
See unfenced  condition
   
   ```java
       private void decrementPendingWriteOpsAndCheck() {
           long pending = pendingWriteOps.decrementAndGet();
          // unfenced  condition
           if (pending == 0 && isFenced && !isClosingOrDeleting) {
               synchronized (this) {
                   if (isFenced && !isClosingOrDeleting) {
                       messageDeduplication.resetHighestSequenceIdPushed();
                       log.info("[{}] Un-fencing topic...", topic);
                       // signal to managed ledger that we are ready to resume 
by creating a new ledger
                       ledger.readyToCreateNewLedger();
                       unfence();
                   }
   
               }
           }
       }
   ```
   
   After a deep investigation, we found the cause of the error。The root cause 
due to `sendAddSuccessCallbacks` may be multiple called  at the same time. One 
is that `unsetSuccessAndSendWriteRequest` is called by the 
BookKeeperClientWorker-OrderedExecutor thread, and the other is 
`writeComplete`in pulsar-io thread. We should prevent sendAddSuccessCallbacks 
from being called again before it completes.
   
   ### Changes
   
   If we find that method `sendAddSuccessCallbacks` is being called when we try 
to call it, return directly.
   
   ### Rejected Alternatives
   If we add `synchronized` to the `sendAddSuccessCallbacks`, it will  impact 
the performance and may lead to deadlock. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to