Hi,

I recently got a number of stacktraces, where a variable number of threads is 
blocked on "m_bundleLock.wait()" inside Felix.acquireBundleLock. From pure code 
inspection, it should not possible for the framework to stall there.

It seems that Guillaume Nodet also hit this problem and added code to throw an 
IllegalStateException if the wait() call is interrupted (FELIX-2784 [1]).

Other times I hit a similar issue in the Felix.acquireGlobalLock method for 
which I reported FELIX-3067 [2].

It looks like both lock acquisition methods are prone to some kind of deadlock. 
The acquireGlobalLock method has the ability to fail by reporting such failure 
with a return code. The acquireBundleLock has no ability to fail (except 
throwing an exception).

I think similar to my FELIX-3067 proposal the acquireBundleLock method should 
only wait a limited time and then retry acquiring the lock. Only if this fails 
for a number of times, acquisition should be aborted and the method fail with 
an IllegalStateException.

WDYT ?

Regards
Felix

PS: We are still running 3.0.8 but the acquireBundleLock and acquireGlobalLock 
are the same as in trunk (except for Guillaume's FELIX-2784 fix).

[1] https://issues.apache.org/jira/browse/FELIX-2784
[2] https://issues.apache.org/jira/browse/FELIX-3067

Reply via email to