Thanks @sijie for the detailed analysis. Despite your detailed analysis it 
needs more digging/understanding to know exactly what is going on.

I think the reason for confusion/uncertainty/convolutedness is because of the 
fact that each call of Auditor.checkAllLedgers, partially has it own set of 
resources but not all. Auditor.checkAllLedgers creates its own 'newzk', 
'client', 'admin', and 'checker', but for 'ledgerManager' and 
'ledgerUnderreplicationManager' (ProcessLostFragmentsCb in 
Auditor.checkAllLedgers uses 'ledgerUnderreplicationManager') it uses Auditor's 
instance variables. Because of this we have 2 ZK clients * 2 threads (ZKClients 
IOThread and event thread), 2 BK clients * 1 mainWorkerPool (OrderedExecutor 
for each BKClient) and Auditor's singlethreaded executor in play here. And our 
maze of callbacks in this component, makes understanding of control/execution 
transfer between threads super complicated. I'm not sure if this is 
intentional, but to reason out any such issues, it needs multiple pair of eyes 
and multiple hours of debugging.

Can you please consider evaluating if Auditor.checkAllLedgers needs its own set 
of resources or not? If it needs its own set of resources then is it ok to go 
partial path?

Anyhow I'm glad I raised this conversation, since I'm not convinced with the 
original Issue description and description of the fix.

[ Full content available at: https://github.com/apache/bookkeeper/pull/1608 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to