TakaHiR07 opened a new pull request, #4467: URL: https://github.com/apache/bookkeeper/pull/4467
Fix https://github.com/apache/bookkeeper/issues/4457, https://github.com/apache/pulsar/issues/22986 ### Motivation There is a dead lock issue in pulsar-3.0, which is easily to reproduce when some bookies become unavailable. The issue occurred when openLedger and do recoverAdd in ReadOnlyLedgerHandle. - this issue is not exist in branch-4.14, but exist after branch-4.16 - this issue only exist in ReadOnlyLedgerHandle, but not exist in LedgerHandle. Because LedgerHandle#handleBookieFailure's implementation is different from ReadOnlyLedgerHandle. It would not lock metadataLock and then lock other pendingAddOp of the same ledger. We have analysed the stack in issue. The deadlock process can be simplify as: 1. In thread pulsar-io-2, request2 lock pendingAddOp2 in writeComplete(), because of channel disconnect 2. In thread pulsar-io-1, request1 lock pendingAddOp1 in writeComplete(), because of channel disconnect. And lock metadataLock in ReadOnlyLedgerHandle#handleBookieFailure. 3. In thread pulsar-io-2, request2 try to apply metadataLock 4. In thread pulsar-io-1, request1 try to apply pendingAddOp2 because of unsetSuccessAndSendWriteRequest Therefore, the deadlock occur. And I guess it occur after this pr https://github.com/apache/bookkeeper/pull/3784, because the pr add synchronized in pendingAddOp. But I think this is not the root reason. As our known for bookkeeper, handleBookieFailure and the other operations of entry should always run in specific thread for ledgerId. However, we saw that the two pendingAddOps of the same ledger is run in different "pulsar-io" thread. This is not reasonable. After diving into code, we can find that "pulsar-io" threadPool is used for PerChannelBookieClient connect, and "BookKeeperClientWorker" threadPool is used for handling write or read entry. But if connect fail and throw exception, bookieClient.completeAdd still use "pulsar-io" to execute. That is the root reason of the dead lock issue. If we can make sure all the pendingAddOps of the same ledger run in the specific executor, the dead lock will not occur. ### Changes If channel is not ready for addEntry, execute completeAddOperation in specific orderedExecutor of ledger, instead of in eventLoopGroup. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
