Re: Review Request 51929: Scheduling multiple tasks per round.

Maxim Khutornenko Thu, 15 Sep 2016 18:55:31 -0700


> On Sept. 16, 2016, 1:20 a.m., Aurora ReviewBot wrote:
> > Master (783baae) is red with this patch.
> >   ./build-support/jenkins/build.sh
> > 
> >                      [1m        # Create file stdout for capturing output. 
> > We can't use StringIO mock[0m
> >                      [1m        # because TestProcess is running fork.[0m
> >                      [1m        with open(os.path.join(td, 'sys_stdout'), 
> > 'w+') as stdout:[0m
> >                      [1m          with open(os.path.join(td, 
> > 'sys_stderr'), 'w+') as stderr:[0m
> >                      [1m            with mutable_sys():[0m
> >                      [1m              sys.stdout, sys.stderr = stdout, 
> > stderr[0m
> >                      [1m    [0m
> >                      [1m              p = TestProcess('process', 'echo 
> > hello world; echo >&2 hello stderr', 0,[0m
> >                      [1m                              taskpath, sandbox, 
> > logger_destination=LoggerDestination.BOTH)[0m
> >                      [1m              p.start()[0m
> >                      [1m              rc = 
> > wait_for_rc(taskpath.getpath('process_checkpoint'))[0m
> >                      [1m    [0m
> >                      [1m              assert rc == 0[0m
> >                      [1m              # Check log files were created in 
> > std path with correct content[0m
> >                      [1m>             assert_log_content(taskpath, 
> > 'stdout', 'hello world\n')[0m
> >                      
> >                      
> > src/test/python/apache/thermos/core/test_process.py:487: 
> >                      _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> >                      
> >                      taskpath = <apache.thermos.common.path.TaskPath object 
> > at 0x7fdd3cd73b10>
> >                      log_name = 'stdout'
> >                      expected_content = 'hello world\n'
> >                      
> >                      [1m    def assert_log_content(taskpath, log_name, 
> > expected_content):[0m
> >                      [1m      log = 
> > taskpath.with_filename(log_name).getpath('process_logdir')[0m
> >                      [1m      assert os.path.exists(log)[0m
> >                      [1m      with open(log, 'r') as fp:[0m
> >                      [1m>       assert fp.read() == expected_content[0m
> >                      [1m[31mE       assert '' == 'hello world\n'[0m
> >                      [1m[31mE         + hello world[0m
> >                      
> >                      
> > src/test/python/apache/thermos/core/test_process.py:313: AssertionError
> >                       generated xml file: 
> > /home/jenkins/jenkins-slave/workspace/AuroraBot/dist/test-results/415337499eb72578eab327a6487c1f5c9452b3d6.xml
> >  
> >                      [1m[31m 1 failed, 710 passed, 6 skipped, 1 warnings 
> > in 226.09 seconds [0m
> >                      
> > FAILURE
> > 
> > 
> > 01:19:57 04:18   [complete][31m
> >                FAILURE[0m
> > 
> > 
> > I will refresh this build result if you post a review containing 
> > "@ReviewBot retry"


@ReviewBot retry


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51929/#review149162
-----------------------------------------------------------


On Sept. 16, 2016, 12:51 a.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51929/
> -----------------------------------------------------------
> 
> (Updated Sept. 16, 2016, 12:51 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is phase 2 of scheduling perf improvement effort started in 
> https://reviews.apache.org/r/51759/.
> 
> We can now take multiple (configurable) number of task IDs from a given 
> `TaskGroup` per scheduling. The idea is to go deeper through the offer queue 
> and assign more than one task if possible. This approach delivers 
> substantially better MTTA and still ensures fairness across multiple 
> `TaskGroups`. We have observed almost linear improvement in MTTA (4x+ with 5 
> tasks per round), which suggest the `max_tasks_per_schedule_attempt` can be 
> set even higher if the majority of cluster jobs have large number of 
> instances and/or update batch sizes.
> 
> As far as a single round perf goes, we can consider the following 2 
> worst-case scenarios:
> - master: single task scheduling fails after trying all offers in the queue
> - this patch: N tasks launched with the very last N offers in the queue + `(N 
> x single_task_launch_latency)`
> 
> Assuming that matching N tasks against M offers takes exactly the same time 
> as 1 task against M offers (as they all share the same `TaskGroup`), the only 
> measurable difference comes from the additional `N x 
> single_task_launch_latency` overhead. Based on real cluster observations, the 
> `single_task_launch_latency` is less than 1% of a single task scheduling 
> attempt, which is << than the savings from avoided additional scheduling 
> rounds. 
> 
> As far as jmh results go, the new approach (batching + multiple tasks per 
> round) is only slightly more demanding (~8%). Both results though are MUCH 
> higher than the real cluster perf, which just confirms we are not bound by 
> CPU time here:
> 
> Master:
> ```
> Benchmark                                                                    
> Mode  Cnt      Score     Error  Units
> SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark.runBenchmark  
> thrpt   10  17126.183 Â± 488.425  ops/s
> ```
> 
> This patch:
> ```
> Benchmark                                                                    
> Mode  Cnt      Score     Error  Units
> SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark.runBenchmark  
> thrpt   10  15838.051 Â± 187.890  ops/s
> ```
> 
> NOTE: this will not apply cleanly as it branched off of 
> https://reviews.apache.org/r/51765, which itself depends on 
> https://reviews.apache.org/r/51759/.
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java 
> 9d0d40b82653fb923bed16d06546288a1576c21d 
>   src/main/java/org/apache/aurora/scheduler/filter/AttributeAggregate.java 
> 87b9e1928ab2d44668df1123f32ffdc4197c0c70 
>   src/main/java/org/apache/aurora/scheduler/scheduling/SchedulingModule.java 
> 11e8033438ad0808e446e41bb26b3fa4c04136c7 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroup.java 
> 5d319557057e27fd5fc6d3e553e9ca9139399c50 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> c044ebe6f72183a67462bbd8e5be983eb592c3e9 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
> d266f6a25ae2360db2977c43768a19b1f1efe8ff 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> 7f7b4358ef05c0f0d0e14daac1a5c25488467dc9 
>   
> src/test/java/org/apache/aurora/scheduler/events/NotifyingSchedulingFilterTest.java
>  ece476b918e6f2c128039e561eea23a94d8ed396 
>   
> src/test/java/org/apache/aurora/scheduler/filter/AttributeAggregateTest.java 
> 209f9298a1d55207b9b41159f2ab366f92c1eb70 
>   
> src/test/java/org/apache/aurora/scheduler/filter/SchedulingFilterImplTest.java
>  0cf23df9f373c0d9b27e55a12adefd5f5fd81ba5 
>   src/test/java/org/apache/aurora/scheduler/http/AbstractJettyTest.java 
> c2ceb4e7685a9301f8014a9183e02fbad65bca26 
>   
> src/test/java/org/apache/aurora/scheduler/preemptor/PreemptionVictimFilterTest.java
>  ee5c6528af89cc62a35fdb314358c489556d8131 
>   src/test/java/org/apache/aurora/scheduler/preemptor/PreemptorImplTest.java 
> 98048fabc00f233925b6cca015c2525980556e2b 
>   
> src/test/java/org/apache/aurora/scheduler/preemptor/PreemptorModuleTest.java 
> 2c3e5f32c774be07a5fa28c8bcf3b9a5d88059a1 
>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskGroupsTest.java 
> 95cf25eda0a5bfc0cc4c46d1439ebe9d5359ce79 
>   
> src/test/java/org/apache/aurora/scheduler/scheduling/TaskSchedulerImplTest.java
>  72562e6bd9a9860c834e6a9faa094c28600a8fed 
>   src/test/java/org/apache/aurora/scheduler/state/TaskAssignerImplTest.java 
> b4d27f69ad5d4cce03da9f04424dc35d30e8af29 
> 
> Diff: https://reviews.apache.org/r/51929/diff/
> 
> 
> Testing
> -------
> 
> All types of testing including deploying to test and production clusters.
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>

Re: Review Request 51929: Scheduling multiple tasks per round.

Reply via email to