> On April 19, 2017, 1 a.m., Santhosh Kumar Shanmugham wrote:
> > src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java
> > Lines 98-106 (original), 100-108 (patched)
> > <https://reviews.apache.org/r/58259/diff/2/?file=1689849#file1689849line100>
> >
> >     In order to reduce the duration the offer is help in the 
> > `UpdateAgentReserver`, wonder if this might be a better place to add to the 
> > reserver? Killing a task can take arbitrarily long and we risk expiring the 
> > offer.
> 
> David McLaughlin wrote:
>     There are race conditions that appear when moving the logic to ADD_TASK. 
> Some other pending task could be assigned into the offer in the gap between 
> the task being killed by Mesos and the JobUpdateController processing it. We 
> use batch workers to batch up work, so there's definitely gaps where Mesos 
> can send the resources freed up by the kill to Aurora and they are used for 
> some other pending task during this window. 
>     
>     It might seem like a rare race condition, but we've seen at Twitter 
> issues where a bad host causes chronic issues because there are no slots for 
> some large tasks. So in low capacity situations, we'd see things like this 
> happen a lot.  
>     
>     Also note: killing a task doesn't have unbounded execution. There is an 
> upper bound of 5s at the executor level. One other issue here is the 
> turnaround from a task being killed to Mesos sending Aurora a new offer with 
> the freed up resources. Obviously there are best-effort network calls 
> involved in both killing and the offer management, so there are definitely 
> issues where offers could expire unnecessarily due to latency. 
>     
>     At least by reserving the agent upfront when you learn that you're going 
> to be sending a future ADD_TASK, you have the control to extend the timeout 
> to increase cache hits (although I'd be careful extending it too long). 
>     
>     We will have run several simulations in our scale test environment to 
> confirm some of this before it ships.

Let us scale test to confirm the assumptions. Otherwise I am satisfied with the 
patch.


- Santhosh Kumar


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58259/#review172324
-----------------------------------------------------------


On April 12, 2017, 12:51 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58259/
> -----------------------------------------------------------
> 
> (Updated April 12, 2017, 12:51 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer 
> Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> In the Dynamic Reservations review (and on the mailing list), I mentioned 
> that we could implement update affinity with less complexity using the same 
> technique as preemption. Here is how that would work. 
> 
> This just adds a simple wrapper around the preemptor's BiCache structure and 
> then optimistically tries to keep an agent free for a task during the update 
> process. 
> 
> 
> Note: I don't bother even checking the resources before reserving the agent. 
> I figure there is a chance the agent has enough room, and if not we'll catch 
> it when we attempt to veto the offer. We need to always check the offer like 
> this anyway in case constraints change. In the worst case it adds some delay 
> in the rare cases you increase resources. 
> 
> We also don't persist the reservations, so if the Scheduler fails over during 
> an update, the worst case is that any instances between the KILLED and 
> ASSIGNED in-flight batch need to fall back to the current first-fit 
> scheduling algorithm.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 
> f0b148cd158d61cd89cc51dca9f3fa4c6feb1b49 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
> 203f62bacc47470545d095e4d25f7e0f25990ed9 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> a177b301203143539b052524d14043ec8a85a46d 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceAction.java 
> b4cd01b3e03029157d5ca5d1d8e79f01296b57c2 
>   
> src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
> f25dc0c6d9c05833b9938b023669c9c36a489f68 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
>  e14112479807b4477b82554caf84fe733f62cf58 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/main/java/org/apache/aurora/scheduler/updater/UpdateAgentReserver.java 
> PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/updater/UpdaterModule.java 
> 13cbdadad606d9acaadc541320b22b0ae538cc5e 
>   
> src/test/java/org/apache/aurora/scheduler/scheduling/TaskSchedulerImplTest.java
>  fa1a81785802b82542030e1aae786fe9570d9827 
>   src/test/java/org/apache/aurora/scheduler/state/TaskAssignerImplTest.java 
> cf2d25ec2e407df7159e0021ddb44adf937e1777 
>   src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 
> b2c4c66850dd8f35e06a631809530faa3b776252 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> c78c7fbd7d600586136863c99ce3d7387895efee 
>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
> 30b44f88a5b8477e917da21d92361aea1a39ceeb 
>   src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java 
> 833fd62c870f96b96343ee5e0eed0d439536381f 
>   
> src/test/java/org/apache/aurora/scheduler/updater/NullAgentReserverTest.java 
> PRE-CREATION 
>   
> src/test/java/org/apache/aurora/scheduler/updater/UpdateAgentReserverImplTest.java
>  PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/58259/diff/2/
> 
> 
> Testing
> -------
> 
> ./gradlew build
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>

Reply via email to