----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/59853/#review177079 -----------------------------------------------------------
Ship it! Master (2cbaeec) is green with this patch. ./build-support/jenkins/build.sh I will refresh this build result if you post a review containing "@ReviewBot retry" - Aurora ReviewBot On June 6, 2017, 7:42 p.m., Zameer Manji wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/59853/ > ----------------------------------------------------------- > > (Updated June 6, 2017, 7:42 p.m.) > > > Review request for Aurora, David McLaughlin and Santhosh Kumar Shanmugham. > > > Bugs: AURORA-1933 > https://issues.apache.org/jira/browse/AURORA-1933 > > > Repository: aurora > > > Description > ------- > > In a a production environment I was able to observe the following: > ``` > I0606 00:31:32.510 [Thread-77638, > MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: > 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 > I0606 00:31:32.903 [SchedulerImpl-0, > MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: > 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 > I0606 00:31:34.815 [TaskGroupBatchWorker, > VersionedSchedulerDriverService:123] Accepting offer > 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH] > ``` > > Notice that the offer rescind was processed before the actual offer. This is > possible because there is a race in the `MesosCallbackHandlerImpl`. The offer > is > processed in the executor (to prevent blocking) and the rescind is handled > directly. This means the offer procecssing thread (`SchedulerImpl-0`) is > racing > against the callback thread (`Thread-77638`). > > In normal operation, there will be seconds to minutes between a rescind and an > offer, but in some cases an offer can be rescinded very quickly in clusters > that > use oversubscription modules. > > To fix this, we move the rescind processing into the same executor as the > offer > processing to ensure they are processed in the order they are recived. Without > fixing this, the rescinded offer exists in the offer manager and can be used > later to launch a task. This task will immediately fail to launch because the > offer is invalid. > > In this patch, I have also added a metric and logging to record when we fail > to > remove an offer from the offer manager, and cleaned up the logging to allow > operators to see when an offer was recieved. With this logging, an operator > can > grep for the offer id and see the entire lifecycle of the offer in the > scheduler. > > > Diffs > ----- > > src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java > 5a5281aeaea1e2a4e0eab67069605838ee809c6c > > src/main/java/org/apache/aurora/scheduler/mesos/VersionedSchedulerDriverService.java > 5e86504c70083065278864e6ab1cc85c83a45a28 > src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java > 17e577b069df9232d57cde171a078d9f6db707ea > src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java > 97febf25cea2024e0ca43366b3d4578e67734884 > > > Diff: https://reviews.apache.org/r/59853/diff/1/ > > > Testing > ------- > > > Thanks, > > Zameer Manji > >
