----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/65680/#review197687 -----------------------------------------------------------
Master (59f66ff) is red with this patch. ./build-support/jenkins/build.sh :distTar :distZip :assemble :compileTestJavaNote: Some input files use or override a deprecated API. Note: Recompile with -Xlint:deprecation for details. Note: /home/jenkins/jenkins-slave/workspace/AuroraBot/src/test/java/org/apache/aurora/scheduler/storage/durability/DurableStorageTest.java uses unchecked or unsafe operations. Note: Recompile with -Xlint:unchecked for details. :processTestResources :testClasses :compileJmhJavaNote: /home/jenkins/jenkins-slave/workspace/AuroraBot/src/jmh/java/org/apache/aurora/benchmark/fakes/FakeSchedulerDriver.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. :processJmhResources NO-SOURCE :jmhClasses :checkstyleJmh :checkstyleMain :checkstyleTest :licenseJmh UP-TO-DATE :licenseMain UP-TO-DATE :licenseTest UP-TO-DATE :license UP-TO-DATE :pmdJmh :pmdMain /home/jenkins/jenkins-slave/workspace/AuroraBot/src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java:149: Avoid if (x != y) ..; else ..; :pmdMain FAILED FAILURE: Build failed with an exception. * What went wrong: Execution failed for task ':pmdMain'. > 1 PMD rule violations were found. See the report at: > file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/pmd/main.html * Try: Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. * Get more help at https://help.gradle.org BUILD FAILED in 5m 5s 38 actionable tasks: 29 executed, 9 up-to-date I will refresh this build result if you post a review containing "@ReviewBot retry" - Aurora ReviewBot On Feb. 16, 2018, 7:45 p.m., Jordan Ly wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/65680/ > ----------------------------------------------------------- > > (Updated Feb. 16, 2018, 7:45 p.m.) > > > Review request for Aurora, David McLaughlin, Renan DelValle, and Santhosh > Kumar Shanmugham. > > > Repository: aurora > > > Description > ------- > > There is a pretty rare situation that can occur that will cause the scheduler > to crash. > > Stacktrace: > ``` > Feb 16, 2018 12:22:36 AM > com.google.common.util.concurrent.ServiceManager$ServiceListener failed > SEVERE: Service CronBatchWorker [FAILED] has failed in the RUNNING state. > java.lang.IllegalArgumentException: Instance ID collision detected. > at > org.apache.aurora.scheduler.state.StateManagerImpl.insertPendingTasks(StateManagerImpl.java:126) > at > org.apache.aurora.scheduler.cron.quartz.AuroraCronJob.lambda$null$0(AuroraCronJob.java:213) > at > org.apache.aurora.scheduler.BatchWorker.lambda$processBatch$3(BatchWorker.java:210) > at > org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144) > at > org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139) > at > org.apache.aurora.scheduler.storage.durability.DurableStorage.lambda$doInTransaction$0(DurableStorage.java:199) > at > org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:89) > at > org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) > at > org.apache.aurora.scheduler.storage.durability.DurableStorage.doInTransaction(DurableStorage.java:198) > at > org.apache.aurora.scheduler.storage.durability.DurableStorage.write(DurableStorage.java:221) > at > org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.write(CallOrderEnforcingStorage.java:132) > at > org.apache.aurora.scheduler.BatchWorker.processBatch(BatchWorker.java:207) > at org.apache.aurora.scheduler.BatchWorker.run(BatchWorker.java:199) > at > com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66) > at > com.google.common.util.concurrent.Callables$4.run(Callables.java:122) > at java.lang.Thread.run(Thread.java:748) > ``` > > The steps are: > ``` > 1. Schedule and start a cron (runs every minute, graceful shutdown period > 1 > minute) > 2. Perform 2 runs of the cron > 3. Deschedule the cron > 4. Reschedule the cron > 5. Perform 3 runs of the cron > 6. Scheduler will crash on the 3rd run due to an ID collision between the > already running cron and a new cron trying to start > ``` > The reason for this bug is that some state is persisted between cron > scheduling/descheduling via `killFollowups`. We use Quartz `JobDataMap` to > hold a "work in progress" token, while the `killFollowups` set indicates > "completion" in order to ensure there are no concurrent runs. Descheduling a > cron will remove the "work in progress" token while ignoring the "completion" > token in `killFollowups`. Later, a "work in progress" token may be added and > a "completion" token may be seen mistakenly from a previous schedule, causing > a concurrent run. > > For the example above, the runs in step 2 will add the key to the set to show > that all runs are finished and another run can start. The 3rd run in step 5 > will mistakenly see that the 2nd run has started and finished since the > "completion" token was preserved from the first set of runs in step 2. This > will erroneously trigger a concurrent run causing a ID collision. > > We should not preserve any state between cron scheduling/descheduling outside > of the given Quartz `JobDataMap` abstraction. We can use the presence of a > value here to achieve the same thing as `killFollowups`. > > > Diffs > ----- > > src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java > 3604dd4c84043def3fe098ca0f87bf8bab99b3db > > src/test/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJobTest.java > 5083fcd407a069617634b82db1ad799b5d4d4551 > > > Diff: https://reviews.apache.org/r/65680/diff/2/ > > > Testing > ------- > > Repro'd bug using steps above. Validated code changes solved issue. > > Added small change in unit test to ensure state is now preserved in the > Quartz job context. > > `./gradlew test` > > > Thanks, > > Jordan Ly > >