[ 
https://issues.apache.org/jira/browse/SLING-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18062506#comment-18062506
 ] 

Carsten Ziegeler commented on SLING-12078:
------------------------------------------

Concurrency review finding related to this issue:

The root cause appears to be in TopologyCapabilities.detectTarget() where 
roundRobinMap is a plain HashMap read and written without synchronization. 
Since detectTarget() is called from JobManagerImpl.addJobInternal() from any 
thread, concurrent get/put on HashMap can corrupt its internal structure 
(infinite loop in bucket chains, lost updates).

Proposed fix: Replace HashMap<String, Integer> with ConcurrentHashMap and use 
compute() for the atomic read-modify-write in detectTarget().

This unsynchronized access likely contributes to the race condition described 
in this issue, where the target slingId determination during addJob can produce 
incorrect results under concurrent topology changes.

> Suspected race condition between TOPOLOGY_INIT and JobManager.addJob
> --------------------------------------------------------------------
>
>                 Key: SLING-12078
>                 URL: https://issues.apache.org/jira/browse/SLING-12078
>             Project: Sling
>          Issue Type: Bug
>          Components: Event
>    Affects Versions: Event 4.3.12
>            Reporter: Stefan Egli
>            Priority: Major
>
> Two regular cases where a job is stored as part of JobManager.addJob():
>  * when a topology is defined, it directly gets stored to the appropriate 
> assigned/target slingId subtree. This is the most frequent case by far.
>  * if no topology is defined (no TOPOLOGY_INIT received) yet, it gets put 
> into the unassigned subtree. Later upon receiving TOPOLOGY_INIT 
> CheckTopologyTask.fullRun() finds such unassigned jobs and moves them to the 
> corresponding assigned subtree.
> There is a suspect race condition (test case to be provided), which happens 
> between the thread doing JobManager.addJob() and the thread handling the 
> TOPOLOGY_INIT:
>  * JobManager.addJob determines the target slingId - which is not yet 
> defined, as TOPOLOGY_INIT is just being handled concurrently
>  * CheckTopologyTask.fullRun(), as part of TOPOLOGY_INIT handling, however 
> does not yet find the above new job in unassigned, as the job is just being 
> stored concurrently.
> The result is a job in the unassigned subtree, which waits until the next 
> TopologyEvent happens - which then invokes CheckTopologyTask.fullRun() - 
> which then finds the unassigned job and re/assigns it accordingly. So the job 
> is never lost, but substantially delayed due to this. (the frequency of 
> TopologyEvents depends on actual cluster/property changes happening in the 
> topology and can thus vary).
> Tasks:
> * provide a test case to reproduce
> * fix the race-condition
> * undo 
> [this|https://github.com/apache/sling-org-apache-sling-event/commit/d16686705908099b26d0a3233f61c4e209880f93]
>  and 
> [this|https://github.com/apache/sling-org-apache-sling-event/commit/dea04990b770a92f29c2504aa33d8158d68da58f]
>  commit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to