@sijie @merlimat from the callstack trace reported in the issue 
https://github.com/apache/bookkeeper/issues/1578, we can say that Auditor's 
single threaded executor ('executor') is hung while waiting on 
"processDone.await()" in checkAllLedgers method. So technically even with this 
fix, there is still scope for 'processDone' countdownlatch not being counted 
down to zero (for what so ever reasons). So again in this case, executor will 
be blocked and Auditor will become non-functional. So I believe the important 
fix needed here is to not wait forever on this latch - 
https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java#L701
 . Instead have some timeout and move on. Ideally I would move the checkers 
functionality to some other threadpool/executor, so that it wont impact the 
core functionality of Auditor, which is super critical in Autoreplication 
system. 

"AuditorBookie-XXXXX:3181" #40 daemon prio=5 os_prio=0 tid=0x00007f049c117830 
nid=0x5da4 waiting on condition [0x00007f0477dfc000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000e04e54f8> (a 
java.util.concurrent.CountDownLatch$Sync)
..
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
        at 
org.apache.bookkeeper.replication.Auditor.checkAllLedgers(Auditor.java:696)
        at org.apache.bookkeeper.replication.Auditor$5.run(Auditor.java:359)

[ Full content available at: https://github.com/apache/bookkeeper/pull/1608 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to