[
https://issues.apache.org/jira/browse/SLING-7778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stefan Egli updated SLING-7778:
-------------------------------
Description:
h2. case 1 : (extremely unlikely) observation faster than addOrUpdateJob can
even finish
The creation of a scheduled job is handled in
[{{JobSchedulerImpl.addScheduledJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L501]
which in turn [calls
{{scheduledJobHandler.addOrUpdateJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L523].
There, the job is first persisted to the repository via [calling
{{writeScheduledJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L213].
Normally the sequence of events is then as follows:
* [A] the [synchronized block in
{{addOrUpdateJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L216]
is entered and the job is scheduled with quartz.
* [B] then shortly afterwards, observation is triggered and
{{handleAddOrUpdate}} is called. This will, as a *side-effect* [add the
scheduled job to the {{scheduledJobs}}
map|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L446]
* the result of this being that the scheduled job is now in
{{ScheduledJobHandler.scheduledJobs}}
* later on, when the scheduled job is finished, {{handleRemove()}} (or
similar) is called, and it checks if it finds the scheduled job in
{{ScheduledJobHandler.scheduledJobs}}. if it does, all good. And here we're in
the all good case, so all is fine
However, there's the following race-condition:
* first [B] happens, ie first the observation event is triggered, so
{{handleAddOrUpdate}} is called, and again the side-effect hits: the scheduled
job is added to the {{ScheduledJobHandler.scheduledJobs}}. all good so far
* then [A] continues into the synchronized block. *But* now, the [remove of
the scheduled
job|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L217]
actually does damage: the scheduled job ends up *not* being in the
{{ScheduledJobHandler.scheduledJobs}} map.
* so in this scenario, when a later {{handleRemove()}} (or similar) happens,
it can't find the scheduled job (see
[here|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L370]
h being null)
* the result being that
[{{JobScheduler.unscheduleJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L371]
is *not invoked*, *leaving the scheduled job with the JobSchedulerImpl*.
And the result of this race-condition is that
{{JobSchedulerImpl.scheduledJobs}} map has a zombie scheduled job, which it
shouldn't actually have. Note *this situation can be verified in the
/system/console/slingevent webconsole* under "Scheduled Jobs".
So later on for example, a topology event happens, such as a mere
PROPERTIES_CHANGED_EVENT. That in turn will trigger
[{{JobSchedulerImpl.configurationChanged()}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L152]
which will now *re-schedule the previously scheduled job*. And we have a
duplicate scheduled job execution.
h2. case 2: (likely) scheduled job finishes before observation event triggered
Besides case 1, there's also the *more likely* case 2: where a scheduled job
finishes (thus calls {{ScheduledJobHandler.remove()}}) before the observation
event for {{ADDED}} was ever called (leading to
{{ScheduledJobHandler.handleAddOrUpdate()}}).
In this case, the {{remove()}} fails to delete the scheduled job from the
(parent) {{JobSchedulerImpl}}, as it doesn't have it in the {{scheduledJob}}
list yet.
Which results - same as case 1 - in an inconsistency leading to a stale
{{ScheduledJobInfoImpl}} in {{JobSchedulerImpl.scheduledJobs}} map - which when
a PROPERTIES_CHANGED_EVENT is called leads to duplicate scheduled job execution.
Note that this is only likely to happen if the schedule is set for 'very soon
now', eg {{at(NOW+1000)}} or something along those lines. If the schedule is
couple minutes in the future, the likelyhood of observation not triggering
before drops towards zero.
So overall, this scenario is likely to happen only with such 'immediate
schedules' and could be mitigated by increasing the delay to eg 1 or a couple
of minutes.
was:
h2. case 1 : (extremely unlikely) observation faster than addOrUpdateJob can
even finish
The creation of a scheduled job is handled in
[{{JobSchedulerImpl.addScheduledJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L501]
which in turn [calls
{{scheduledJobHandler.addOrUpdateJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L523].
There, the job is first persisted to the repository via [calling
{{writeScheduledJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L213].
Normally the sequence of events is then as follows:
* [A] the [synchronized block in
{{addOrUpdateJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L216]
is entered and the job is scheduled with quartz.
* [B] then shortly afterwards, observation is triggered and
{{handleAddOrUpdate}} is called. This will, as a *side-effect* [add the
scheduled job to the {{scheduledJobs}}
map|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L446]
* the result of this being that the scheduled job is now in
{{ScheduledJobHandler.scheduledJobs}}
* later on, when the scheduled job is finished, {{handleRemove()}} (or
similar) is called, and it checks if it finds the scheduled job in
{{ScheduledJobHandler.scheduledJobs}}. if it does, all good. And here we're in
the all good case, so all is fine
However, there's the following race-condition:
* first [B] happens, ie first the observation event is triggered, so
{{handleAddOrUpdate}} is called, and again the side-effect hits: the scheduled
job is added to the {{ScheduledJobHandler.scheduledJobs}}. all good so far
* then [A] continues into the synchronized block. *But* now, the [remove of
the scheduled
job|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L217]
actually does damage: the scheduled job ends up *not* being in the
{{ScheduledJobHandler.scheduledJobs}} map.
* so in this scenario, when a later {{handleRemove()}} (or similar) happens,
it can't find the scheduled job (see
[here|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L370]
h being null)
* the result being that
[{{JobScheduler.unscheduleJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L371]
is *not invoked*, *leaving the scheduled job with the JobSchedulerImpl*.
And the result of this race-condition is that
{{JobSchedulerImpl.scheduledJobs}} map has a zombie scheduled job, which it
shouldn't actually have. Note *this situation can be verified in the
/system/console/slingevent webconsole* under "Scheduled Jobs".
So later on for example, a topology event happens, such as a mere
PROPERTIES_CHANGED_EVENT. That in turn will trigger
[{{JobSchedulerImpl.configurationChanged()}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L152]
which will now *re-schedule the previously scheduled job*. And we have a
duplicate scheduled job execution.
h2. case 2: (likely) scheduled job finishes before observation event triggered
Besides case 1, there's also the *more likely* case 2: where a scheduled job
finishes (thus calls {{ScheduledJobHandler.remove()}}) before the observation
event for {{ADDED}} was ever called (leading to
{{ScheduledJobHandler.handleAddOrUpdate()}}).
In this case, the {{remove()}} fails to delete the scheduled job from the
(parent) {{JobSchedulerImpl}}, as it doesn't have it in the {{scheduledJob}}
list yet.
Which results - same as case 1 - in an inconsistency leading to a stale
{{ScheduledJobInfoImpl}} in {{JobSchedulerImpl.scheduledJobs}} map - which when
a PROPERTIES_CHANGED_EVENT is called leads to duplicate scheduled job execution.
> race-condition in scheduled job creation might keep scheduled job in-memory,
> and cause a duplicate scheduled job execution
> --------------------------------------------------------------------------------------------------------------------------
>
> Key: SLING-7778
> URL: https://issues.apache.org/jira/browse/SLING-7778
> Project: Sling
> Issue Type: Bug
> Components: Event
> Affects Versions: Event 4.2.0
> Reporter: Stefan Egli
> Assignee: Stefan Egli
> Priority: Major
> Fix For: Event 4.2.12
>
>
> h2. case 1 : (extremely unlikely) observation faster than addOrUpdateJob can
> even finish
> The creation of a scheduled job is handled in
> [{{JobSchedulerImpl.addScheduledJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L501]
> which in turn [calls
> {{scheduledJobHandler.addOrUpdateJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L523].
> There, the job is first persisted to the repository via [calling
> {{writeScheduledJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L213].
> Normally the sequence of events is then as follows:
> * [A] the [synchronized block in
> {{addOrUpdateJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L216]
> is entered and the job is scheduled with quartz.
> * [B] then shortly afterwards, observation is triggered and
> {{handleAddOrUpdate}} is called. This will, as a *side-effect* [add the
> scheduled job to the {{scheduledJobs}}
> map|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L446]
> * the result of this being that the scheduled job is now in
> {{ScheduledJobHandler.scheduledJobs}}
> * later on, when the scheduled job is finished, {{handleRemove()}} (or
> similar) is called, and it checks if it finds the scheduled job in
> {{ScheduledJobHandler.scheduledJobs}}. if it does, all good. And here we're
> in the all good case, so all is fine
> However, there's the following race-condition:
> * first [B] happens, ie first the observation event is triggered, so
> {{handleAddOrUpdate}} is called, and again the side-effect hits: the
> scheduled job is added to the {{ScheduledJobHandler.scheduledJobs}}. all good
> so far
> * then [A] continues into the synchronized block. *But* now, the [remove of
> the scheduled
> job|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L217]
> actually does damage: the scheduled job ends up *not* being in the
> {{ScheduledJobHandler.scheduledJobs}} map.
> * so in this scenario, when a later {{handleRemove()}} (or similar) happens,
> it can't find the scheduled job (see
> [here|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L370]
> h being null)
> * the result being that
> [{{JobScheduler.unscheduleJob}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/ScheduledJobHandler.java#L371]
> is *not invoked*, *leaving the scheduled job with the JobSchedulerImpl*.
> And the result of this race-condition is that
> {{JobSchedulerImpl.scheduledJobs}} map has a zombie scheduled job, which it
> shouldn't actually have. Note *this situation can be verified in the
> /system/console/slingevent webconsole* under "Scheduled Jobs".
> So later on for example, a topology event happens, such as a mere
> PROPERTIES_CHANGED_EVENT. That in turn will trigger
> [{{JobSchedulerImpl.configurationChanged()}}|https://github.com/apache/sling-org-apache-sling-event/blob/ca009499cbe30ae30df5069effe051e0bef9ef72/src/main/java/org/apache/sling/event/impl/jobs/scheduling/JobSchedulerImpl.java#L152]
> which will now *re-schedule the previously scheduled job*. And we have a
> duplicate scheduled job execution.
> h2. case 2: (likely) scheduled job finishes before observation event triggered
> Besides case 1, there's also the *more likely* case 2: where a scheduled job
> finishes (thus calls {{ScheduledJobHandler.remove()}}) before the observation
> event for {{ADDED}} was ever called (leading to
> {{ScheduledJobHandler.handleAddOrUpdate()}}).
> In this case, the {{remove()}} fails to delete the scheduled job from the
> (parent) {{JobSchedulerImpl}}, as it doesn't have it in the {{scheduledJob}}
> list yet.
> Which results - same as case 1 - in an inconsistency leading to a stale
> {{ScheduledJobInfoImpl}} in {{JobSchedulerImpl.scheduledJobs}} map - which
> when a PROPERTIES_CHANGED_EVENT is called leads to duplicate scheduled job
> execution.
> Note that this is only likely to happen if the schedule is set for 'very soon
> now', eg {{at(NOW+1000)}} or something along those lines. If the schedule is
> couple minutes in the future, the likelyhood of observation not triggering
> before drops towards zero.
> So overall, this scenario is likely to happen only with such 'immediate
> schedules' and could be mitigated by increasing the delay to eg 1 or a couple
> of minutes.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)