Thanks @sijie for the detailed analysis. Despite your detailed analysis it 
needs more digging/understanding to know exactly what is going on.

I think the reason for confusion/uncertainty/convolutedness is because of the 
fact that each call of Auditor.checkAllLedgers, partially has it own set of 
resources but not all. Auditor.checkAllLedgers creates its own 'newzk', 
'client', 'admin', and 'checker', but for 'ledgerManager' and 
'ledgerUnderreplicationManager' (ProcessLostFragmentsCb in 
Auditor.checkAllLedgers uses 'ledgerUnderreplicationManager') it uses Auditor's 
instance variables. Because of this we have 2 ZK clients * 2 threads (ZKClients 
IOThread and event thread), 2 BK clients * 1 mainWorkerPool (OrderedExecutor 
for each BKClient) and Auditor's singlethreaded executor are in play here. And 
our maze of callbacks in this component, makes understanding of 
control/execution transfer between threads super complicated. I'm not sure if 
this is intentional, but to reason out any such issues, it needs multiple pair 
of eyes and multiple hours of debugging.

Can you please consider evaluating if Auditor.checkAllLedgers needs its own set 
of resources or not? If it needs its own set of resources then is it ok to go 
partial path?

Anyhow I'm glad I started this conversation, since I'm not convinced with the 
original Issue description and description of the fix.

[ Full content available at: https://github.com/apache/bookkeeper/pull/1608 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to