Carsten Ziegeler created SLING-13132:
----------------------------------------
Summary: ABBA deadlock between
JobConsumerManager.topicToConsumerMap and QueueJobCache.cache locks
Key: SLING-13132
URL: https://issues.apache.org/jira/browse/SLING-13132
Project: Sling
Issue Type: Bug
Components: Event
Reporter: Carsten Ziegeler
A concurrency review identified a lock-order inversion (ABBA deadlock) between
two locks:
Thread A (OSGi unbind path): JobConsumerManager.unbindService() holds
synchronized(topicToConsumerMap), then calls context.asyncProcessingFinished()
which triggers the async handler chain: finishedJob() -> reschedule() ->
requeue() -> cache.reschedule() -> synchronized(cache)
Thread B (job processing): JobQueueImpl.startJobs() -> cache.getNextJob() holds
synchronized(cache), then calls jobConsumerManager.getExecutor() ->
synchronized(topicToConsumerMap)
Lock ordering:
- Thread A: topicToConsumerMap -> cache
- Thread B: cache -> topicToConsumerMap
This is a classic ABBA deadlock. With retry delay <= 0, the requeue path is
synchronous and the deadlock is reachable in production. This can cause a
complete system freeze.
Proposed fix: In unbindService(), collect the async callback contexts into a
local list while holding the topicToConsumerMap lock, then invoke
asyncProcessingFinished() outside the lock. This breaks the topicToConsumerMap
-> cache path while preserving the cache -> topicToConsumerMap path unchanged.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)