sijie opened a new pull request #7506: URL: https://github.com/apache/pulsar/pull/7506
*Motivation* Currently broker has a timeout mechanism on loading topics. However the underlying managed ledger library doesn't provide a timeout mechanism. This will get into a situation that: A TopicLoad operation times out after 30 seconds. But the CompletableFuture of opening a managed ledger is still kept in the cache of managed ledger factory. The completable future will never returns. So any sub-sequent topic lookups will fail because any attempts to load a topic will never attempt to re-open a managed ledger. *Modification* Introduce a timeout mechanism in managed ledger factory. If a managed ledger is not open within a given timeout period, the CompletableFuture will be removed. This allows any sub-sequent attempts to load topics can try to open the managed ledger again. *Tests* This problem can be constantly reproduced in a chaos test in kubernetes by killing k8s worker nodes. It can cause producer stuck forever until the owner broker pod is restarted. The change has been verified in a chaos testing environment. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
