Re: Review Request 42126: Accept resource offers from multiple framework roles.

Zhitao Li Thu, 14 Jan 2016 08:31:41 -0800


> On Jan. 14, 2016, 1:06 a.m., Dmitriy Shirchenko wrote:
> > src/test/java/org/apache/aurora/scheduler/mesos/MesosTaskFactoryImplTest.java,
> >  lines 216-217
> > <https://reviews.apache.org/r/42126/diff/8/?file=1196146#file1196146line216>
> >
> >     Is this intentionally commented out?
> 
> Zhitao Li wrote:
>     Yes this is intentially removed. Old behavior of changing this `config` 
> here resulted in `resources` in static final object `OFFER` not sufficient to 
> be allocated, and I don't see any point to use a different executor for this 
> test case, so I've deleted this code.
>     
>     If any reviewer thinks there is a reason to test this with a different 
> executor, I can reconstruct an `Offer` object with correct resource and add 
> this back.
> 
> Bill Farner wrote:
>     This is slightly suspicious...are you sure resource accounting wasn't 
> impacted in some way?  I would expect old test cases to logically work with 
> this feature.
> 
> Zhitao Li wrote:
>     I recalled why I commented this out in the first pass now (this was a bit 
> tricky and I mistakenly thought it's logical bug while it's due to floating 
> point precision).
>     
>     In TaskExecutors.java, `SOME_OVERHEAD_EXECUTOR` uses `0.01` cpus:
>     ```
>       public static final ExecutorSettings SOME_OVERHEAD_EXECUTOR =
>           TestExecutorSettings.thermosOnlyWithOverhead(
>               new ResourceSlot(0.01, Amount.of(256L, Data.MB), Amount.of(0L, 
> Data.MB), 0));
>     ```
>     Unfortunately, `5.01 - 5 - 0.01` **is not zero** in floating point math 
> of Java, so the test keeps throwing `InsufficientResourcesException` even if 
> I constructed an `Offer` whose resource is correctly equivalent to 
> `ResourceSlot.from(TASK_CONFIG).add(TaskExecutors.SOME_OVERHEAD_EXECUTOR.getExecutorOverhead())`.
>     
>     **My proposal is to change all comparison with zero in `AcceptedOffer` 
> class (3 places in current patch) to compare with an `EPSILON` whose value is 
> `1e-6` or something even smaller.** I don't think there is any reasonable 
> usage of resource unit smaller than this granularity, and the floating point 
> math in the class would be more robust. We can document this behavior 
> somewhere if it sounds interesting.
>     
>     The default resource usage for `THERMOS_EXECUTOR` is 0.25 so I don't 
> encounter this elsewhere.
>     
>     Some alternatives which I liked less:
>     1. Change `TaskExecutors.SOME_OVERHEAD_EXECUTOR`'s cpus to a value which 
> could be precisely represented by floating point (0.125 or something similar);
>     2. Manually add a bit more cpu resource to the `Offer` object to trick 
> the test to pass.
>     
>     If the above proposed solution sounds fine, I can send an update to the 
> patch.
> 
> Bill Farner wrote:
>     I'm in favor of using an EPSILON value.  Please be sure to extract a 
> function to handle the equivalence check.
>     
>     It's too bad mesos doesn't use fixed point for scalar resources to avoid 
> these issues.  Looks like there's been recent momentum to make that change: 
> https://issues.apache.org/jira/browse/MESOS-3997
>     
>     They recently addressed an instance of this issue with this change:
>     
> https://github.com/apache/mesos/commit/5b5dd566aea71f654c51b2706a958ca1b5cb07a3
>     
>     ```
>     CHECK_NEAR(result.cpus().get(), cpus().get(), MIN_CPUS);
>     ```
>     
>     For reference, `MIN_CPUS` is 0.01.


Will do with a `nearZero()` helper function.

For the value of `ESPILON`, I took a quick look at related Mesos code, and the 
minimum gradunalar which could make a difference is `MIN_CPUS` / 
`CPU_SHARES_PER_CPU` (1024) in cgroup based isolation, which is still larger 
than `1e-6`, so using the latter value should ensure Aurora user won't lose 
precision even at extreme scheduling condition.


- Zhitao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42126/#review114353
-----------------------------------------------------------


On Jan. 14, 2016, 2:46 p.m., Zhitao Li wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42126/
> -----------------------------------------------------------
> 
> (Updated Jan. 14, 2016, 2:46 p.m.)
> 
> 
> Review request for Aurora, Maxim Khutornenko, Dmitriy Shirchenko, and Bill 
> Farner.
> 
> 
> Bugs: AURORA-1109
>     https://issues.apache.org/jira/browse/AURORA-1109
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This review is a prototype for introducing multiple role support in Aurora.
> This creates a new class OfferAllocation, which allcoates resources to 
> resources field in TaskInfo and ExecutorInfo from an offer.
> 
> Current implementation prefers reserved resources over shared resources ('*' 
> role) if both are present
> 
> Several caveats:
> 1. This performs the allocate after scheduling decision in 
> TaskAssigner.maybeAssign is done, which leaves possibility of inconsistency 
> and late failure.
> 
> 
> Diffs
> -----
> 
>   NEWS acaff9eb2ab184b0ef750f8b8a00c20131997f6b 
>   src/main/java/org/apache/aurora/scheduler/AcceptedOffer.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/ResourceSlot.java 
> 7c3d681c216b78eeecebbe950186e5a79c6fe982 
>   src/main/java/org/apache/aurora/scheduler/Resources.java 
> db422a959ee7b982c2a44323de41ad75d1a40754 
>   
> src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
>  2255dd407cd1810c7df5baf17cfa85f79bfffeb8 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosTaskFactory.java 
> 8fdadda67478bb3110aa442b7d78493cf9c3edb4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> 7e8e456e288986eb0ce92a123b294e1e25d8ed18 
>   src/test/java/org/apache/aurora/scheduler/AcceptedOfferTest.java 
> PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/ResourceSlotTest.java 
> e4ae943303823ac4bfbe999ed22f5999484462d8 
>   
> src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java
>  33149ab415292eff04f38b61f2b1d1eac79f347a 
>   
> src/test/java/org/apache/aurora/scheduler/mesos/MesosTaskFactoryImplTest.java 
> a5793bffabf4e5d6195b1b99f2363d241c0cecf9 
>   src/test/java/org/apache/aurora/scheduler/state/TaskAssignerImplTest.java 
> 3cbe9acd75def14ae2e0986914ba621fb164b3e4 
> 
> Diff: https://reviews.apache.org/r/42126/diff/
> 
> 
> Testing
> -------
> 
> 1. Unit tested with old and new tests;
> 2. vagrant integration tests: I manually separate out the vagrant box's cpu 
> and memory between 'aurora-test' role and '*' and verified that jobs can 
> still be launched (I can post the vagrant change in another follow upon 
> request).
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>

Re: Review Request 42126: Accept resource offers from multiple framework roles.

Reply via email to