@sijie @merlimat from the callstack trace reported in the issue
https://github.com/apache/bookkeeper/issues/1578, we can say that Auditor's
single threaded executor ('executor') is hung while waiting on
"processDone.await()" in checkAllLedgers method. So technically even with this
fix, there is still scope for 'processDone' countdownlatch not being counted
down to zero (for what so ever reasons). So again in this case, executor will
be blocked and Auditor will become non-functional. So I believe the important
fix needed here is to not wait forever on this latch -
https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java#L701
. Instead have some timeout and move on. Ideally I would move the checkers
functionality to some other threadpool/executor, so that it wont impact the
core functionality of Auditor, which is super critical in Autoreplication
system.
"AuditorBookie-XXXXX:3181" #40 daemon prio=5 os_prio=0 tid=0x00007f049c117830
nid=0x5da4 waiting on condition [0x00007f0477dfc000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000e04e54f8> (a
java.util.concurrent.CountDownLatch$Sync)
..
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at
org.apache.bookkeeper.replication.Auditor.checkAllLedgers(Auditor.java:696)
at org.apache.bookkeeper.replication.Auditor$5.run(Auditor.java:359)
[ Full content available at: https://github.com/apache/bookkeeper/pull/1608 ]
This message was relayed via gitbox.apache.org for [email protected]