Stefan Egli created SLING-12078:
-----------------------------------

             Summary: Suspected race condition between TOPOLOGY_INIT and 
JobManager.addJob
                 Key: SLING-12078
                 URL: https://issues.apache.org/jira/browse/SLING-12078
             Project: Sling
          Issue Type: Bug
          Components: Event
    Affects Versions: Event 4.3.12
            Reporter: Stefan Egli


Two regular cases where a job is stored as part of JobManager.addJob():
 * when a topology is defined, it directly gets stored to the appropriate 
assigned/target slingId subtree. This is the most frequent case by far.
 * if no topology is defined (no TOPOLOGY_INIT received) yet, it gets put into 
the unassigned subtree. Later upon receiving TOPOLOGY_INIT 
CheckTopologyTask.fullRun() finds such unassigned jobs and moves them to the 
corresponding assigned subtree.

There is a suspect race condition (test case to be provided), which happens 
between the thread doing JobManager.addJob() and the thread handling the 
TOPOLOGY_INIT:
 * JobManager.addJob determines the target slingId - which is not yet defined, 
as TOPOLOGY_INIT is just being handled concurrently
 * CheckTopologyTask.fullRun(), as part of TOPOLOGY_INIT handling, however does 
not yet find the above new job in unassigned, as the job is just being stored 
concurrently.

The result is a job in the unassigned subtree, which waits until the next 
TopologyEvent happens - which then invokes CheckTopologyTask.fullRun() - which 
then finds the unassigned job and re/assigns it accordingly. So the job is 
never lost, but substantially delayed due to this. (the frequency of 
TopologyEvents depends on actual cluster/property changes happening in the 
topology and can thus vary)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to