@reddycharan : let me start with a summary of my thoughts: As if you read the 
comment above, I raised the concerns before around AutoRecovery calling sync 
methods in async callbacks is a very bad practise. so I wouldn't be surprised 
if there are more other deadlocks found in other places in AutoRecovery. So 
this bug is not intended to address the whole 
"calling-sync-methods-in-async-callback" problem in AutoRecovery, the PR is 
intended to address the problem reported in in #1578  first and get the bugfix 
available for 4.7.2. so the scope is limited to address the bug reported in 
#1578. 

regarding the test, I think it is a bit hard to reproduce the sequence of this 
race condition. and the change is limited to scope to address the sync call in 
the stack trace reported in #1578, moving the sync call to a detailed thread 
pool without blocking zookeeper callback thread is a simple and straightforward 
problem. As the aim for this PR is to address the specific stack trace in 
#1578, this change is okay to go in for a bugfix to address the immediate 
concerns in AutoRecovery. 

A long term fix for AutoRecovery is to audit all the sync calls in callbacks 
and make then async, which I have raised that up before. 
http://mail-archives.apache.org/mod_mbox/bookkeeper-dev/201411.mbox/%3CCAO2yDyZo5AzYgE%3D%3Dk8C5bifA3Miv-2A9w%2B_-6aS%2Bwa4zxRGcOw%40mail.gmail.com%3E

--- 

more details to your questions:

> Is this bug regression or is it been like this since beginning?

I believe this has been since beginning not just a regression. as I pointed 
out, the AutoRecovery has multiple places calling sync calls in callbacks.

> Because of this deadlock is it just 'checkAllLedgers' checker which is 
> blocked? or other components which use 'executor' ("auditBookies" checker and 
> core Auditor functionality as well?

you will only have this issue when you call a sync method (waiting for 
zookeeper result) in a zookeeper callback thread.  not all the 
"call-sync-methods-in-async-callback" will have this issues. I don't think 
other components are concerns. However as a good practice, we need to clean up 
the sync calls in async callbacks. that is a very bad practice. However that is 
a big problem to address in AutoRecovery.

> Which will make 'executor' blocked, since 'executor' is 
> singleThreadScheduledExecutor, then IFIUC all of the Auditor functionality is 
> blocked, right?

yes. that's why the issue reporter says "Auditor periodic check only run once", 
because Auditor executor is blocked.

> why does Issue description say "Auditor run Periodic check only once", if the 
> analysis made for this fix is correct then "checkAllLedgers" shouldn't run 
> even once right?

I think "Auditor run periodic check only once" is the reporter observed 
behavior.

The race condition can be happening at any "checkAllLedgers" run, not 
necessarily to be the first one. if you look into the code, for each Auditor 
checkAllLedgers, a new zookeeper client is established, so the race condition 
can happen any any `CheckAllLedgers` run. but once it is blocked, no future 
checkAllLedgers will be run.

> I'm not sure if there is comprehensive testcase for this checker, 

I think this race condition depends on timing. any timing related stuffs 
usually very hard to be covered or caught via test cases. 

> I'm little surprised that this commit is merged / issue is closed, with no 
> testcase to prove the analysis of the fix and validness of the fix

I think it is a bit hard to reproduce the sequence of this race condition. 
especially this related to timing at zookeeper client callback. and the fix is 
straightforward by moving the blocking call to a separate thread. that's why we 
make the exception to merge this for bugfix for 4.7.2. 


[ Full content available at: https://github.com/apache/bookkeeper/pull/1608 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to