-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65680/
-----------------------------------------------------------

(Updated Feb. 17, 2018, 1:22 a.m.)


Review request for Aurora, David McLaughlin, Renan DelValle, and Santhosh Kumar 
Shanmugham.


Changes
-------

Fix PMD rule violation


Repository: aurora


Description
-------

There is a pretty rare situation that can occur that will cause the scheduler 
to crash.

Stacktrace:
```
Feb 16, 2018 12:22:36 AM 
com.google.common.util.concurrent.ServiceManager$ServiceListener failed
SEVERE: Service CronBatchWorker [FAILED] has failed in the RUNNING state.
java.lang.IllegalArgumentException: Instance ID collision detected.
        at 
org.apache.aurora.scheduler.state.StateManagerImpl.insertPendingTasks(StateManagerImpl.java:126)
        at 
org.apache.aurora.scheduler.cron.quartz.AuroraCronJob.lambda$null$0(AuroraCronJob.java:213)
        at 
org.apache.aurora.scheduler.BatchWorker.lambda$processBatch$3(BatchWorker.java:210)
        at 
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144)
        at 
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139)
        at 
org.apache.aurora.scheduler.storage.durability.DurableStorage.lambda$doInTransaction$0(DurableStorage.java:199)
        at 
org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:89)
        at 
org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
        at 
org.apache.aurora.scheduler.storage.durability.DurableStorage.doInTransaction(DurableStorage.java:198)
        at 
org.apache.aurora.scheduler.storage.durability.DurableStorage.write(DurableStorage.java:221)
        at 
org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.write(CallOrderEnforcingStorage.java:132)
        at 
org.apache.aurora.scheduler.BatchWorker.processBatch(BatchWorker.java:207)
        at org.apache.aurora.scheduler.BatchWorker.run(BatchWorker.java:199)
        at 
com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
        at com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
        at java.lang.Thread.run(Thread.java:748)
```

The steps are:
```
1. Schedule and start a cron (runs every minute, graceful shutdown period > 1 
minute)
2. Perform 2 runs of the cron
3. Deschedule the cron
4. Reschedule the cron
5. Perform 3 runs of the cron
6. Scheduler will crash on the 3rd run due to an ID collision between the 
already running cron and a new cron trying to start
```
The reason for this bug is that some state is persisted between cron 
scheduling/descheduling via `killFollowups`. We use Quartz `JobDataMap` to hold 
a "work in progress" token, while the `killFollowups` set indicates 
"completion" in order to ensure there are no concurrent runs. Descheduling a 
cron will remove the "work in progress" token while ignoring the "completion" 
token in `killFollowups`. Later, a "work in progress" token may be added and a 
"completion" token may be seen mistakenly from a previous schedule, causing a 
concurrent run.

For the example above, the runs in step 2 will add the key to the set to show 
that all runs are finished and another run can start. The 3rd run in step 5 
will mistakenly see that the 2nd run has started and finished since the 
"completion" token was preserved from the first set of runs in step 2. This 
will erroneously trigger a concurrent run causing a ID collision.

We should not preserve any state between cron scheduling/descheduling outside 
of the given Quartz `JobDataMap` abstraction. We can use the presence of a 
value here to achieve the same thing as `killFollowups`.


Diffs (updated)
-----

  src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
3604dd4c84043def3fe098ca0f87bf8bab99b3db 
  src/test/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJobTest.java 
5083fcd407a069617634b82db1ad799b5d4d4551 


Diff: https://reviews.apache.org/r/65680/diff/3/

Changes: https://reviews.apache.org/r/65680/diff/2-3/


Testing
-------

Repro'd bug using steps above. Validated code changes solved issue.

Added small change in unit test to ensure state is now preserved in the Quartz 
job context.

`./gradlew test`


Thanks,

Jordan Ly

Reply via email to