Thanks @sijie for the detailed analysis. Despite your detailed analysis it needs more digging/understanding to know exactly what is going on.
I think the reason for confusion/uncertainty/convolutedness is because of the fact that each call of Auditor.checkAllLedgers, partially has it own set of resources but not all. Auditor.checkAllLedgers creates its own 'newzk', 'client', 'admin', and 'checker', but for 'ledgerManager' and 'ledgerUnderreplicationManager' (ProcessLostFragmentsCb in Auditor.checkAllLedgers uses 'ledgerUnderreplicationManager') it uses Auditor's instance variables. Because of this we have 2 ZK clients * 2 threads (ZKClients IOThread and event thread), 2 BK clients * 1 mainWorkerPool (OrderedExecutor for each BKClient) and Auditor's singlethreaded executor are in play here. And our maze of callbacks in this component, makes understanding of control/execution transfer between threads super complicated. I'm not sure if this is intentional, but to reason out any such issues, it needs multiple pair of eyes and multiple hours of debugging. Can you please consider evaluating if Auditor.checkAllLedgers needs its own set of resources or not? If it needs its own set of resources then is it ok to go partial path? Anyhow I'm glad I started this conversation, since I'm not convinced with the original Issue description and description of the fix. [ Full content available at: https://github.com/apache/bookkeeper/pull/1608 ] This message was relayed via gitbox.apache.org for [email protected]
