> On Aug. 21, 2017, 9:33 p.m., Santhosh Kumar Shanmugham wrote:
> > src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java
> > Line 237 (original), 247-248 (patched)
> > <https://reviews.apache.org/r/61804/diff/2/?file=1801326#file1801326line252>
> >
> >     Maybe add a metric to track the unban-s?
> >     
> >     Do we really need to `unban` the offer before `cancelling` it? Can we 
> > fold `unbanOffer` into the `cancelOffer` method?
> 
> Jordan Ly wrote:
>     Added metrics to check bans and unbans.
>     
>     I would prefer not to fold `unbanOffer` into the `cancelOffer` method as 
> I believe they are distinct enough to leave them as separate calls. I can see 
> use cases where a developer would call one but not the other.
> 
> David McLaughlin wrote:
>     Is cancelOffer currently being called anywhere were we wouldn't want to 
> then ban it if it returns false?

I do not think we should be developing code for future usecases that may or may 
not arise.

I see `cancelOffer` as the equivalent of clearing a particular offer. In that 
case we should clean up all the state related to that particular offer. Leaving 
it in the banned set means, we are expecting developers to do the clean up 
themselves.


- Santhosh Kumar


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183442
-----------------------------------------------------------


On Aug. 22, 2017, 10:05 a.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2017, 10:05 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, 
> Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for 
> processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor 
> queue for processing. However, there is a lot of load on the executor so 
> there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the 
> executor. This will result in launching a task with an invalid offer leading 
> to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with 
> revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X 
> since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers 
> '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST 
> for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, 
> MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] 
> Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, 
> MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for 
> task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: 
> Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, 
> MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to 
> cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer 
> has not yet been added (ie. still in the executor queue). Then, when we 
> actually process the offer we will not assign it to tasks since we know it 
> has been rescinded already. When we ban the offer, we will also add a command 
> to unban the offer to the executor queue so that future offers will not be 
> affected. This solution should also avoid the race condition fixed in: 
> https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 
> 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 
> 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 
> a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   
> src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java 
> b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 
> be02449eee97643b258792127521445a2c7fc0d3 
>   
> src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 
> 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/3/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>

Reply via email to