@sijie, From @merlimat description 

> After getting ZK callback from ZK event thread, we need to jump to a 
> background thread before doing synchronous call to 
> admin.openLedgerNoRecovery(ledgerId); which will try to make a ZK request a 
> wait for a response (which would be coming through same ZK event thread 
> currently blocked..)

I understood it as that, "admin.openLedgerNoRecovery" 
https://github.com/apache/bookkeeper/commit/f782a9d818a12479d08c580a68b2566715da4c89#diff-7525f06ad3a1ad0a00a462df4deb4698L645
 will be blocked consistently. Thats why I was wondering how were we ok so far 
(5 years since 
https://github.com/apache/bookkeeper/commit/005b62cc60093dd5b32d4abecd06c2e441bc62ae
 ) is introduced, since the ZK thread deadlock will eventually lead to Auditor 
being non-functional.

if you say that because of race condition in ZK library we would run into 
issue, then it makes some sense for why this issue was not identified 
completely so far. Being said that I'm just wondering at very high level how 
probabilistic is it to get into this zk thread deadlock issue? Since this will 
effectively makes Auditor non-functional, I would like to ascertain how 
vulnerable we were so far.

> The race condition can be happening at any "checkAllLedgers" run, not 
> necessarily to be the first one. if you look into the code, for each Auditor 
> checkAllLedgers, a new zookeeper client is established, so the race condition 
> can happen any any CheckAllLedgers run. but once it is blocked, no future 
> checkAllLedgers will be run.

[ Full content available at: https://github.com/apache/bookkeeper/pull/1608 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to