----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/61804/#review183514 -----------------------------------------------------------
Ship it! Ship It! - David McLaughlin On Aug. 22, 2017, 5:05 p.m., Jordan Ly wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/61804/ > ----------------------------------------------------------- > > (Updated Aug. 22, 2017, 5:05 p.m.) > > > Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, > Stephan Erb, and Zameer Manji. > > > Bugs: AURORA-1945 > https://issues.apache.org/jira/browse/AURORA-1945 > > > Repository: aurora > > > Description > ------- > > The current race condition for offers is possible: > ``` > 1. Scheduler receives an offer and adds it to the executor queue for > processing. > 2. The executor processes the offer and adds it to the HostOffers list. > 3. Scheduler receives a rescind for that offer and adds it to the executor > queue for processing. However, there is a lot of load on the executor so > there might be a delay between receiving the rescind and processing it. > 4. Scheduler accepts the offer before the rescind is processed by the > executor. This will result in launching a task with an invalid offer leading > to TASK_LOST. > ``` > The following logs show this in action: > > Mesos: > ``` > I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with > revocable resources... > W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X > since it is no longer valid > W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers > '[ OFFER_X ]': Offer OFFER_X is no longer valid > I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST > for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid' > ``` > Aurora: > ``` > I0810 14:28:45.676 [SchedulerImpl-0, > MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X > I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] > Accepting offer OFFER_X with ops [LAUNCH] > I0810 14:34:24.186 [Thread-4471585, > MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for > task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: > Task launched with invalid offers: Offer_X is no longer valid > I0810 14:34:32.972 [SchedulerImpl-0, > MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X > W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to > cancel offer: OFFER_X. > ``` > I would like to temporarily ban offers if we receive a rescind but the offer > has not yet been added (ie. still in the executor queue). Then, when we > actually process the offer we will not assign it to tasks since we know it > has been rescinded already. When we ban the offer, we will also add a command > to unban the offer to the executor queue so that future offers will not be > affected. This solution should also avoid the race condition fixed in: > https://issues.apache.org/jira/browse/AURORA-1933 > > > Diffs > ----- > > src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java > 6f2ca35c5d83dde29c24865b4826d4932e96da80 > src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java > 2a42cac651729b8edec839c86ce406f76b17f810 > src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java > a55f8add763f1d5ffbd964afd6e4615ff0021ea5 > src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java > 25399e4a4b8f290065eacaf1e3ec1a36c131266b > > src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java > b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e > src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java > be02449eee97643b258792127521445a2c7fc0d3 > > src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java > 25c1137920553774c32047088ace34279a71bbda > > > Diff: https://reviews.apache.org/r/61804/diff/3/ > > > Testing > ------- > > `./gradlew test` > > Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully. > > I will verify this patch on a live cluster as well before submitting. > > > Thanks, > > Jordan Ly > >