-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183514
-----------------------------------------------------------


Ship it!




Ship It!

- David McLaughlin


On Aug. 22, 2017, 5:05 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2017, 5:05 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, 
> Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for 
> processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor 
> queue for processing. However, there is a lot of load on the executor so 
> there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the 
> executor. This will result in launching a task with an invalid offer leading 
> to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with 
> revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X 
> since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers 
> '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST 
> for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, 
> MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] 
> Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, 
> MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for 
> task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: 
> Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, 
> MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to 
> cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer 
> has not yet been added (ie. still in the executor queue). Then, when we 
> actually process the offer we will not assign it to tasks since we know it 
> has been rescinded already. When we ban the offer, we will also add a command 
> to unban the offer to the executor queue so that future offers will not be 
> affected. This solution should also avoid the race condition fixed in: 
> https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 
> 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 
> 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 
> a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   
> src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java 
> b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 
> be02449eee97643b258792127521445a2c7fc0d3 
>   
> src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 
> 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/3/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>

Reply via email to