[jira] [Commented] (SLING-12078) Suspected race condition between TOPOLOGY_INIT and JobManager.addJob

2023-10-05 Thread Stefan Egli (Jira)


[ 
https://issues.apache.org/jira/browse/SLING-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17772259#comment-17772259
 ] 

Stefan Egli commented on SLING-12078:
-

* additional IT test affected, added [same workaround 
there|https://github.com/apache/sling-org-apache-sling-event/commit/dea04990b770a92f29c2504aa33d8158d68da58f]

> Suspected race condition between TOPOLOGY_INIT and JobManager.addJob
> 
>
> Key: SLING-12078
> URL: https://issues.apache.org/jira/browse/SLING-12078
> Project: Sling
>  Issue Type: Bug
>  Components: Event
>Affects Versions: Event 4.3.12
>Reporter: Stefan Egli
>Priority: Major
>
> Two regular cases where a job is stored as part of JobManager.addJob():
>  * when a topology is defined, it directly gets stored to the appropriate 
> assigned/target slingId subtree. This is the most frequent case by far.
>  * if no topology is defined (no TOPOLOGY_INIT received) yet, it gets put 
> into the unassigned subtree. Later upon receiving TOPOLOGY_INIT 
> CheckTopologyTask.fullRun() finds such unassigned jobs and moves them to the 
> corresponding assigned subtree.
> There is a suspect race condition (test case to be provided), which happens 
> between the thread doing JobManager.addJob() and the thread handling the 
> TOPOLOGY_INIT:
>  * JobManager.addJob determines the target slingId - which is not yet 
> defined, as TOPOLOGY_INIT is just being handled concurrently
>  * CheckTopologyTask.fullRun(), as part of TOPOLOGY_INIT handling, however 
> does not yet find the above new job in unassigned, as the job is just being 
> stored concurrently.
> The result is a job in the unassigned subtree, which waits until the next 
> TopologyEvent happens - which then invokes CheckTopologyTask.fullRun() - 
> which then finds the unassigned job and re/assigns it accordingly. So the job 
> is never lost, but substantially delayed due to this. (the frequency of 
> TopologyEvents depends on actual cluster/property changes happening in the 
> topology and can thus vary).
> Tasks:
> * provide a test case to reproduce
> * fix the race-condition
> * undo 
> [this|https://github.com/apache/sling-org-apache-sling-event/commit/d16686705908099b26d0a3233f61c4e209880f93]
>  and 
> [this|https://github.com/apache/sling-org-apache-sling-event/commit/dea04990b770a92f29c2504aa33d8158d68da58f]
>  commit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (SLING-12078) Suspected race condition between TOPOLOGY_INIT and JobManager.addJob

2023-10-05 Thread Stefan Egli (Jira)


[ 
https://issues.apache.org/jira/browse/SLING-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17772242#comment-17772242
 ] 

Stefan Egli commented on SLING-12078:
-

* added a [workaround 
attempt|https://github.com/apache/sling-org-apache-sling-event/commit/d16686705908099b26d0a3233f61c4e209880f93]
 for 2 in/frequently failing tests - that must be reverted as part of this 
ticket, once the race-condition is confirmed/fixed

> Suspected race condition between TOPOLOGY_INIT and JobManager.addJob
> 
>
> Key: SLING-12078
> URL: https://issues.apache.org/jira/browse/SLING-12078
> Project: Sling
>  Issue Type: Bug
>  Components: Event
>Affects Versions: Event 4.3.12
>Reporter: Stefan Egli
>Priority: Major
>
> Two regular cases where a job is stored as part of JobManager.addJob():
>  * when a topology is defined, it directly gets stored to the appropriate 
> assigned/target slingId subtree. This is the most frequent case by far.
>  * if no topology is defined (no TOPOLOGY_INIT received) yet, it gets put 
> into the unassigned subtree. Later upon receiving TOPOLOGY_INIT 
> CheckTopologyTask.fullRun() finds such unassigned jobs and moves them to the 
> corresponding assigned subtree.
> There is a suspect race condition (test case to be provided), which happens 
> between the thread doing JobManager.addJob() and the thread handling the 
> TOPOLOGY_INIT:
>  * JobManager.addJob determines the target slingId - which is not yet 
> defined, as TOPOLOGY_INIT is just being handled concurrently
>  * CheckTopologyTask.fullRun(), as part of TOPOLOGY_INIT handling, however 
> does not yet find the above new job in unassigned, as the job is just being 
> stored concurrently.
> The result is a job in the unassigned subtree, which waits until the next 
> TopologyEvent happens - which then invokes CheckTopologyTask.fullRun() - 
> which then finds the unassigned job and re/assigns it accordingly. So the job 
> is never lost, but substantially delayed due to this. (the frequency of 
> TopologyEvents depends on actual cluster/property changes happening in the 
> topology and can thus vary)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)