> On Aug. 21, 2017, 9:33 p.m., Santhosh Kumar Shanmugham wrote:
> > src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java
> > Line 237 (original), 247-248 (patched)
> > <https://reviews.apache.org/r/61804/diff/2/?file=1801326#file1801326line252>
> >
> >     Maybe add a metric to track the unban-s?
> >     
> >     Do we really need to `unban` the offer before `cancelling` it? Can we 
> > fold `unbanOffer` into the `cancelOffer` method?
> 
> Jordan Ly wrote:
>     Added metrics to check bans and unbans.
>     
>     I would prefer not to fold `unbanOffer` into the `cancelOffer` method as 
> I believe they are distinct enough to leave them as separate calls. I can see 
> use cases where a developer would call one but not the other.
> 
> David McLaughlin wrote:
>     Is cancelOffer currently being called anywhere were we wouldn't want to 
> then ban it if it returns false?
> 
> Santhosh Kumar Shanmugham wrote:
>     I do not think we should be developing code for future usecases that may 
> or may not arise.
>     
>     I see `cancelOffer` as the equivalent of clearing a particular offer. In 
> that case we should clean up all the state related to that particular offer. 
> Leaving it in the banned set means, we are expecting developers to do the 
> clean up themselves.
> 
> Jordan Ly wrote:
>     That makes sense. Is it possible for an offer to not be sent to the 
> framework, but a rescind to be sent? In this case, I don't think cancelOffer 
> would have enough signal to determine whether to ban or unban an offer, it 
> would return false both times as the offer will never be in HostOffers.
> 
> Jordan Ly wrote:
>     Please ignore the above, RB posted my unrevised comment :(
>     
>     What I originally meant to post:
>     ```
>     I do not see it as developing potentially unneeded code but rather 
> maintaining the abstraction around OfferManager. I do not think a developer 
> would know to call `cancelOffer` twice in order to undo some hidden state for 
> a particular implementation -- I would rather make it explicit. On the other 
> hand, this could also mean that the abstraction I am looking to create is 
> flawed and can be done better/simpler.
>     
>     I definitely understand where you are coming from. We do not want to add 
> superfluous code that future developers must think about and maintain. Is it 
> possible for an offer to not be sent to the framework, but a rescind to be 
> sent? In this case, I don't think cancelOffer would have enough signal to 
> determine whether to ban or unban an offer, it would return false both times 
> as the offer will never be in HostOffers. However, if not, it is definitely 
> possible to do this refactor.
>     ```

We are not relying on the return value during the second `cancelOffer` call.

We already have `banOfferForTaskGroup` without a corresponding 
`unbanOfferForTaskGroup` and we are now setting a different pattern.


- Santhosh Kumar


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183442
-----------------------------------------------------------


On Aug. 22, 2017, 1:33 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2017, 1:33 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, 
> Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for 
> processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor 
> queue for processing. However, there is a lot of load on the executor so 
> there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the 
> executor. This will result in launching a task with an invalid offer leading 
> to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with 
> revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X 
> since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers 
> '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST 
> for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, 
> MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] 
> Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, 
> MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for 
> task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: 
> Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, 
> MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to 
> cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer 
> has not yet been added (ie. still in the executor queue). Then, when we 
> actually process the offer we will not assign it to tasks since we know it 
> has been rescinded already. When we ban the offer, we will also add a command 
> to unban the offer to the executor queue so that future offers will not be 
> affected. This solution should also avoid the race condition fixed in: 
> https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 
> 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 
> 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 
> a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   
> src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java 
> b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 
> be02449eee97643b258792127521445a2c7fc0d3 
>   
> src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 
> 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/4/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>

Reply via email to