@reddycharan : let me start with a summary of my thoughts: As if you read the comment above, I raised the concerns before around AutoRecovery calling sync methods in async callbacks is a very bad practise. so I wouldn't be surprised if there are more other deadlocks found in other places in AutoRecovery. So this bug is not intended to address the whole "calling-sync-methods-in-async-callback" problem in AutoRecovery, the PR is intended to address the problem reported in in #1578 first and get the bugfix available for 4.7.2. so the scope is limited to address the bug reported in #1578.
regarding the test, I think it is a bit hard to reproduce the sequence of this race condition. and the change is limited to scope to address the sync call in the stack trace reported in #1578, moving the sync call to a detailed thread pool without blocking zookeeper callback thread is a simple and straightforward problem. As the aim for this PR is to address the specific stack trace in #1578, this change is okay to go in for a bugfix to address the immediate concerns in AutoRecovery. A long term fix for AutoRecovery is to audit all the sync calls in callbacks and make then async, which I have raised that up before. http://mail-archives.apache.org/mod_mbox/bookkeeper-dev/201411.mbox/%3CCAO2yDyZo5AzYgE%3D%3Dk8C5bifA3Miv-2A9w%2B_-6aS%2Bwa4zxRGcOw%40mail.gmail.com%3E --- more details to your questions: > Is this bug regression or is it been like this since beginning? I believe this has been since beginning not just a regression. as I pointed out, the AutoRecovery has multiple places calling sync calls in callbacks. > Because of this deadlock is it just 'checkAllLedgers' checker which is > blocked? or other components which use 'executor' ("auditBookies" checker and > core Auditor functionality as well? you will only have this issue when you call a sync method (waiting for zookeeper result) in a zookeeper callback thread. not all the "call-sync-methods-in-async-callback" will have this issues. I don't think other components are concerns. However as a good practice, we need to clean up the sync calls in async callbacks. that is a very bad practice. However that is a big problem to address in AutoRecovery. > Which will make 'executor' blocked, since 'executor' is > singleThreadScheduledExecutor, then IFIUC all of the Auditor functionality is > blocked, right? yes. that's why the issue reporter says "Auditor periodic check only run once", because Auditor executor is blocked. > why does Issue description say "Auditor run Periodic check only once", if the > analysis made for this fix is correct then "checkAllLedgers" shouldn't run > even once right? I think "Auditor run periodic check only once" is the reporter observed behavior. The race condition can be happening at any "checkAllLedgers" run, not necessarily to be the first one. if you look into the code, for each Auditor checkAllLedgers, a new zookeeper client is established, so the race condition can happen any any `CheckAllLedgers` run. but once it is blocked, no future checkAllLedgers will be run. > I'm not sure if there is comprehensive testcase for this checker, I think this race condition depends on timing. any timing related stuffs usually very hard to be covered or caught via test cases. > I'm little surprised that this commit is merged / issue is closed, with no > testcase to prove the analysis of the fix and validness of the fix I think it is a bit hard to reproduce the sequence of this race condition. especially this related to timing at zookeeper client callback. and the fix is straightforward by moving the blocking call to a separate thread. that's why we make the exception to merge this for bugfix for 4.7.2. [ Full content available at: https://github.com/apache/bookkeeper/pull/1608 ] This message was relayed via gitbox.apache.org for [email protected]
