graysonzeng opened a new pull request, #4171: URL: https://github.com/apache/bookkeeper/pull/4171
Master Issue: https://github.com/apache/pulsar/issues/21860 ### Motivation When bookies do a rolling restart, pulsar topic fence state may be triggered due to race conditions. And it can not recover. After check the heap dump of the broker, we can see the pendingWriteOps is 161, this is the reason why the topic can not recover from the fenced state.   The topic will only change to unfenced when pendingWriteOps is reduced to 0. See unfenced condition ```java private void decrementPendingWriteOpsAndCheck() { long pending = pendingWriteOps.decrementAndGet(); // unfenced condition if (pending == 0 && isFenced && !isClosingOrDeleting) { synchronized (this) { if (isFenced && !isClosingOrDeleting) { messageDeduplication.resetHighestSequenceIdPushed(); log.info("[{}] Un-fencing topic...", topic); // signal to managed ledger that we are ready to resume by creating a new ledger ledger.readyToCreateNewLedger(); unfence(); } } } } ``` After a deep investigation, we found the cause of the error。The root cause due to `sendAddSuccessCallbacks` may be multiple called at the same time. One is that `unsetSuccessAndSendWriteRequest` is called by the BookKeeperClientWorker-OrderedExecutor thread, and the other is `writeComplete`in pulsar-io thread. We should prevent sendAddSuccessCallbacks from being called again before it completes. ### Changes If we find that method `sendAddSuccessCallbacks` is being called when we try to call it, return directly. ### Rejected Alternatives If we add `synchronized` to the `sendAddSuccessCallbacks`, it will impact the performance and may lead to deadlock. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
