----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/61804/#review183414 -----------------------------------------------------------
Master (aae2b0d) is red with this patch. ./build-support/jenkins/build.sh :generateBuildProperties :processResources :classes :jar :startScripts :distTar :distZip :assemble :compileJmhJavaNote: /home/jenkins/jenkins-slave/workspace/AuroraBot/src/jmh/java/org/apache/aurora/benchmark/fakes/FakeSchedulerDriver.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. :processJmhResources UP-TO-DATE :jmhClasses :checkstyleJmh :jsHint :checkstyleMain :compileTestJava/home/jenkins/jenkins-slave/workspace/AuroraBot/src/test/java/org/apache/aurora/scheduler/thrift/aop/MockDecoratedThrift.java:38: Note: Wrote forwarder org.apache.aurora.scheduler.thrift.aop.MockDecoratedThriftForwarder @Forward(AnnotatedAuroraAdmin.class) ^ Note: Some input files use or override a deprecated API. Note: Recompile with -Xlint:deprecation for details. :processTestResources :testClasses :checkstyleTest[ant:checkstyle] [ERROR] /home/jenkins/jenkins-slave/workspace/AuroraBot/src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java:544: File contains a sequence of empty lines. [RegexpMultiline] [ant:checkstyle] [ERROR] /home/jenkins/jenkins-slave/workspace/AuroraBot/src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java:547:5: Redundant 'public' modifier. [RedundantModifier] FAILED FAILURE: Build failed with an exception. * What went wrong: Execution failed for task ':checkstyleTest'. > Checkstyle rule violations were found. See the report at: > file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/checkstyle/test.html * Try: Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. BUILD FAILED Total time: 1 mins 40.897 secs I will refresh this build result if you post a review containing "@ReviewBot retry" - Aurora ReviewBot On Aug. 21, 2017, 5:38 p.m., Jordan Ly wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/61804/ > ----------------------------------------------------------- > > (Updated Aug. 21, 2017, 5:38 p.m.) > > > Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, > Stephan Erb, and Zameer Manji. > > > Bugs: AURORA-1945 > https://issues.apache.org/jira/browse/AURORA-1945 > > > Repository: aurora > > > Description > ------- > > The current race condition for offers is possible: > ``` > 1. Scheduler receives an offer and adds it to the executor queue for > processing. > 2. The executor processes the offer and adds it to the HostOffers list. > 3. Scheduler receives a rescind for that offer and adds it to the executor > queue for processing. However, there is a lot of load on the executor so > there might be a delay between receiving the rescind and processing it. > 4. Scheduler accepts the offer before the rescind is processed by the > executor. This will result in launching a task with an invalid offer leading > to TASK_LOST. > ``` > The following logs show this in action: > > Mesos: > ``` > I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with > revocable resources... > W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X > since it is no longer valid > W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers > '[ OFFER_X ]': Offer OFFER_X is no longer valid > I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST > for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid' > ``` > Aurora: > ``` > I0810 14:28:45.676 [SchedulerImpl-0, > MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X > I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] > Accepting offer OFFER_X with ops [LAUNCH] > I0810 14:34:24.186 [Thread-4471585, > MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for > task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: > Task launched with invalid offers: Offer_X is no longer valid > I0810 14:34:32.972 [SchedulerImpl-0, > MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X > W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to > cancel offer: OFFER_X. > ``` > I would like to temporarily ban offers if we receive a rescind but the offer > has not yet been added (ie. still in the executor queue). Then, when we > actually process the offer we will not assign it to tasks since we know it > has been rescinded already. When we ban the offer, we will also add a command > to unban the offer to the executor queue so that future offers will not be > affected. This solution should also avoid the race condition fixed in: > https://issues.apache.org/jira/browse/AURORA-1933 > > > Diffs > ----- > > src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java > 6f2ca35c5d83dde29c24865b4826d4932e96da80 > src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java > 2a42cac651729b8edec839c86ce406f76b17f810 > src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java > a55f8add763f1d5ffbd964afd6e4615ff0021ea5 > src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java > 25399e4a4b8f290065eacaf1e3ec1a36c131266b > > src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java > b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e > src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java > be02449eee97643b258792127521445a2c7fc0d3 > > src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java > 25c1137920553774c32047088ace34279a71bbda > > > Diff: https://reviews.apache.org/r/61804/diff/1/ > > > Testing > ------- > > `./gradlew test` > > Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully. > > I will verify this patch on a live cluster as well before submitting. > > > Thanks, > > Jordan Ly > >
