[ 
https://issues.apache.org/jira/browse/SLING-7778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Egli updated SLING-7778:
-------------------------------
    Description: 
h2. case 1 : observation faster than addOrUpdateJob can finish
The creation of a scheduled job is handled in 
[{{JobSchedulerImpl.addScheduledJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L501]
 which in turn [calls 
{{scheduledJobHandler.addOrUpdateJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L523].
 There, the job is first persisted to the repository via [calling 
{{writeScheduledJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L213].

Normally the sequence of events is then as follows:
* [A] the [synchronized block in 
{{addOrUpdateJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L216]
 is entered and the job is scheduled with quartz.
* [B] then shortly afterwards, observation is triggered and 
{{handleAddOrUpdate}} is called. This will, as a *side-effect* [add the 
scheduled job to the {{scheduledJobs}} 
map|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L446]
* the result of this being that the scheduled job is now in 
{{ScheduledJobHandler.scheduledJobs}}
* later on, when the scheduled job is finished, {{handleRemove()}} (or similar) 
is called, and it checks if it finds the scheduled job in 
{{ScheduledJobHandler.scheduledJobs}}. if it does, all good. And here we're in 
the all good case, so all is fine

However, there's the following race-condition:
* first [B] happens, ie first the observation event is triggered, so 
{{handleAddOrUpdate}} is called, and again the side-effect hits: the scheduled 
job is added to the {{ScheduledJobHandler.scheduledJobs}}. all good so far
* then [A] continues into the synchronized block. *But* now, the [remove of the 
scheduled 
job|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L217]
 actually does damage: the scheduled job ends up *not* being in the 
{{ScheduledJobHandler.scheduledJobs}} map.
* so in this scenario, when a later {{handleRemove()}} (or similar) happens, it 
can't find the scheduled job (see 
[here|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L370]
 h being null)
* the result being that 
[{{JobScheduler.unscheduleJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L371]
 is *not invoked*, *leaving the scheduled job with the JobSchedulerImpl*.

And the result of this race-condition is that 
{{JobSchedulerImpl.scheduledJobs}} map has a zombie scheduled job, which it 
shouldn't actually have. Note *this situation can be verified in the 
/system/console/slingevent webconsole* under "Scheduled Jobs".

So later on for example, a topology event happens, such as a mere 
PROPERTIES_CHANGED_EVENT. That in turn will trigger 
[{{JobSchedulerImpl.configurationChanged()}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L152]
 which will now *re-schedule the previously scheduled job*. And we have a 
duplicate scheduled job execution.

h2. case 2: scheduled job finishes before observation event triggered
Besides case 1, there's also the *more likely* case 2: where a scheduled job 
finishes (thus calls {{ScheduledJobHandler.remove()}}) before the observation 
event for {{ADDED}} was ever called (leading to 
{{ScheduledJobHandler.handleAddOrUpdate()}}).

In this case, the {{remove()}} fails to delete the scheduled job from the 
(parent) {{JobSchedulerImpl}}, as it doesn't have it in the {{scheduledJob}} 
list yet.

Which results - same as case 1 - in an inconsistency leading to a stale 
{{ScheduledJobInfoImpl}} in {{JobSchedulerImpl.scheduledJobs}} map - which when 
a PROPERTIES_CHANGED_EVENT is called leads to duplicate scheduled job execution.

  was:
The creation of a scheduled job is handled in 
[{{JobSchedulerImpl.addScheduledJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L501]
 which in turn [calls 
{{scheduledJobHandler.addOrUpdateJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L523].
 There, the job is first persisted to the repository via [calling 
{{writeScheduledJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L213].

Normally the sequence of events is then as follows:
* [A] the [synchronized block in 
{{addOrUpdateJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L216]
 is entered and the job is scheduled with quartz.
* [B] then shortly afterwards, observation is triggered and 
{{handleAddOrUpdate}} is called. This will, as a *side-effect* [add the 
scheduled job to the {{scheduledJobs}} 
map|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L446]
* the result of this being that the scheduled job is now in 
{{ScheduledJobHandler.scheduledJobs}}
* later on, when the scheduled job is finished, {{handleRemove()}} (or similar) 
is called, and it checks if it finds the scheduled job in 
{{ScheduledJobHandler.scheduledJobs}}. if it does, all good. And here we're in 
the all good case, so all is fine

However, there's the following race-condition:
* first [B] happens, ie first the observation event is triggered, so 
{{handleAddOrUpdate}} is called, and again the side-effect hits: the scheduled 
job is added to the {{ScheduledJobHandler.scheduledJobs}}. all good so far
* then [A] continues into the synchronized block. *But* now, the [remove of the 
scheduled 
job|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L217]
 actually does damage: the scheduled job ends up *not* being in the 
{{ScheduledJobHandler.scheduledJobs}} map.
* so in this scenario, when a later {{handleRemove()}} (or similar) happens, it 
can't find the scheduled job (see 
[here|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L370]
 h being null)
* the result being that 
[{{JobScheduler.unscheduleJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L371]
 is *not invoked*, *leaving the scheduled job with the JobSchedulerImpl*.

And the result of this race-condition is that 
{{JobSchedulerImpl.scheduledJobs}} map has a zombie scheduled job, which it 
shouldn't actually have. Note *this situation can be verified in the 
/system/console/slingevent webconsole* under "Scheduled Jobs".

So later on for example, a topology event happens, such as a mere 
PROPERTIES_CHANGED_EVENT. That in turn will trigger 
[{{JobSchedulerImpl.configurationChanged()}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L152]
 which will now *re-schedule the previously scheduled job*. And we have a 
duplicate scheduled job execution.


> race-condition in scheduled job creation might keep scheduled job in-memory, 
> and cause a duplicate scheduled job execution
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SLING-7778
>                 URL: https://issues.apache.org/jira/browse/SLING-7778
>             Project: Sling
>          Issue Type: Bug
>          Components: Event
>    Affects Versions: Event 4.2.0
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>            Priority: Major
>             Fix For: Event 4.2.12
>
>
> h2. case 1 : observation faster than addOrUpdateJob can finish
> The creation of a scheduled job is handled in 
> [{{JobSchedulerImpl.addScheduledJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L501]
>  which in turn [calls 
> {{scheduledJobHandler.addOrUpdateJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L523].
>  There, the job is first persisted to the repository via [calling 
> {{writeScheduledJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L213].
> Normally the sequence of events is then as follows:
> * [A] the [synchronized block in 
> {{addOrUpdateJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L216]
>  is entered and the job is scheduled with quartz.
> * [B] then shortly afterwards, observation is triggered and 
> {{handleAddOrUpdate}} is called. This will, as a *side-effect* [add the 
> scheduled job to the {{scheduledJobs}} 
> map|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L446]
> * the result of this being that the scheduled job is now in 
> {{ScheduledJobHandler.scheduledJobs}}
> * later on, when the scheduled job is finished, {{handleRemove()}} (or 
> similar) is called, and it checks if it finds the scheduled job in 
> {{ScheduledJobHandler.scheduledJobs}}. if it does, all good. And here we're 
> in the all good case, so all is fine
> However, there's the following race-condition:
> * first [B] happens, ie first the observation event is triggered, so 
> {{handleAddOrUpdate}} is called, and again the side-effect hits: the 
> scheduled job is added to the {{ScheduledJobHandler.scheduledJobs}}. all good 
> so far
> * then [A] continues into the synchronized block. *But* now, the [remove of 
> the scheduled 
> job|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L217]
>  actually does damage: the scheduled job ends up *not* being in the 
> {{ScheduledJobHandler.scheduledJobs}} map.
> * so in this scenario, when a later {{handleRemove()}} (or similar) happens, 
> it can't find the scheduled job (see 
> [here|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L370]
>  h being null)
> * the result being that 
> [{{JobScheduler.unscheduleJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L371]
>  is *not invoked*, *leaving the scheduled job with the JobSchedulerImpl*.
> And the result of this race-condition is that 
> {{JobSchedulerImpl.scheduledJobs}} map has a zombie scheduled job, which it 
> shouldn't actually have. Note *this situation can be verified in the 
> /system/console/slingevent webconsole* under "Scheduled Jobs".
> So later on for example, a topology event happens, such as a mere 
> PROPERTIES_CHANGED_EVENT. That in turn will trigger 
> [{{JobSchedulerImpl.configurationChanged()}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L152]
>  which will now *re-schedule the previously scheduled job*. And we have a 
> duplicate scheduled job execution.
> h2. case 2: scheduled job finishes before observation event triggered
> Besides case 1, there's also the *more likely* case 2: where a scheduled job 
> finishes (thus calls {{ScheduledJobHandler.remove()}}) before the observation 
> event for {{ADDED}} was ever called (leading to 
> {{ScheduledJobHandler.handleAddOrUpdate()}}).
> In this case, the {{remove()}} fails to delete the scheduled job from the 
> (parent) {{JobSchedulerImpl}}, as it doesn't have it in the {{scheduledJob}} 
> list yet.
> Which results - same as case 1 - in an inconsistency leading to a stale 
> {{ScheduledJobInfoImpl}} in {{JobSchedulerImpl.scheduledJobs}} map - which 
> when a PROPERTIES_CHANGED_EVENT is called leads to duplicate scheduled job 
> execution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to