> On Jan. 14, 2016, 1:06 a.m., Dmitriy Shirchenko wrote: > > src/test/java/org/apache/aurora/scheduler/mesos/MesosTaskFactoryImplTest.java, > > lines 216-217 > > <https://reviews.apache.org/r/42126/diff/8/?file=1196146#file1196146line216> > > > > Is this intentionally commented out? > > Zhitao Li wrote: > Yes this is intentially removed. Old behavior of changing this `config` > here resulted in `resources` in static final object `OFFER` not sufficient to > be allocated, and I don't see any point to use a different executor for this > test case, so I've deleted this code. > > If any reviewer thinks there is a reason to test this with a different > executor, I can reconstruct an `Offer` object with correct resource and add > this back. > > Bill Farner wrote: > This is slightly suspicious...are you sure resource accounting wasn't > impacted in some way? I would expect old test cases to logically work with > this feature. > > Zhitao Li wrote: > I recalled why I commented this out in the first pass now (this was a bit > tricky and I mistakenly thought it's logical bug while it's due to floating > point precision). > > In TaskExecutors.java, `SOME_OVERHEAD_EXECUTOR` uses `0.01` cpus: > ``` > public static final ExecutorSettings SOME_OVERHEAD_EXECUTOR = > TestExecutorSettings.thermosOnlyWithOverhead( > new ResourceSlot(0.01, Amount.of(256L, Data.MB), Amount.of(0L, > Data.MB), 0)); > ``` > Unfortunately, `5.01 - 5 - 0.01` **is not zero** in floating point math > of Java, so the test keeps throwing `InsufficientResourcesException` even if > I constructed an `Offer` whose resource is correctly equivalent to > `ResourceSlot.from(TASK_CONFIG).add(TaskExecutors.SOME_OVERHEAD_EXECUTOR.getExecutorOverhead())`. > > **My proposal is to change all comparison with zero in `AcceptedOffer` > class (3 places in current patch) to compare with an `EPSILON` whose value is > `1e-6` or something even smaller.** I don't think there is any reasonable > usage of resource unit smaller than this granularity, and the floating point > math in the class would be more robust. We can document this behavior > somewhere if it sounds interesting. > > The default resource usage for `THERMOS_EXECUTOR` is 0.25 so I don't > encounter this elsewhere. > > Some alternatives which I liked less: > 1. Change `TaskExecutors.SOME_OVERHEAD_EXECUTOR`'s cpus to a value which > could be precisely represented by floating point (0.125 or something similar); > 2. Manually add a bit more cpu resource to the `Offer` object to trick > the test to pass. > > If the above proposed solution sounds fine, I can send an update to the > patch. > > Bill Farner wrote: > I'm in favor of using an EPSILON value. Please be sure to extract a > function to handle the equivalence check. > > It's too bad mesos doesn't use fixed point for scalar resources to avoid > these issues. Looks like there's been recent momentum to make that change: > https://issues.apache.org/jira/browse/MESOS-3997 > > They recently addressed an instance of this issue with this change: > > https://github.com/apache/mesos/commit/5b5dd566aea71f654c51b2706a958ca1b5cb07a3 > > ``` > CHECK_NEAR(result.cpus().get(), cpus().get(), MIN_CPUS); > ``` > > For reference, `MIN_CPUS` is 0.01.
Will do with a `nearZero()` helper function. For the value of `ESPILON`, I took a quick look at related Mesos code, and the minimum gradunalar which could make a difference is `MIN_CPUS` / `CPU_SHARES_PER_CPU` (1024) in cgroup based isolation, which is still larger than `1e-6`, so using the latter value should ensure Aurora user won't lose precision even at extreme scheduling condition. - Zhitao ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/42126/#review114353 ----------------------------------------------------------- On Jan. 14, 2016, 2:46 p.m., Zhitao Li wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/42126/ > ----------------------------------------------------------- > > (Updated Jan. 14, 2016, 2:46 p.m.) > > > Review request for Aurora, Maxim Khutornenko, Dmitriy Shirchenko, and Bill > Farner. > > > Bugs: AURORA-1109 > https://issues.apache.org/jira/browse/AURORA-1109 > > > Repository: aurora > > > Description > ------- > > This review is a prototype for introducing multiple role support in Aurora. > This creates a new class OfferAllocation, which allcoates resources to > resources field in TaskInfo and ExecutorInfo from an offer. > > Current implementation prefers reserved resources over shared resources ('*' > role) if both are present > > Several caveats: > 1. This performs the allocate after scheduling decision in > TaskAssigner.maybeAssign is done, which leaves possibility of inconsistency > and late failure. > > > Diffs > ----- > > NEWS acaff9eb2ab184b0ef750f8b8a00c20131997f6b > src/main/java/org/apache/aurora/scheduler/AcceptedOffer.java PRE-CREATION > src/main/java/org/apache/aurora/scheduler/ResourceSlot.java > 7c3d681c216b78eeecebbe950186e5a79c6fe982 > src/main/java/org/apache/aurora/scheduler/Resources.java > db422a959ee7b982c2a44323de41ad75d1a40754 > > src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java > 2255dd407cd1810c7df5baf17cfa85f79bfffeb8 > src/main/java/org/apache/aurora/scheduler/mesos/MesosTaskFactory.java > 8fdadda67478bb3110aa442b7d78493cf9c3edb4 > src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java > 7e8e456e288986eb0ce92a123b294e1e25d8ed18 > src/test/java/org/apache/aurora/scheduler/AcceptedOfferTest.java > PRE-CREATION > src/test/java/org/apache/aurora/scheduler/ResourceSlotTest.java > e4ae943303823ac4bfbe999ed22f5999484462d8 > > src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java > 33149ab415292eff04f38b61f2b1d1eac79f347a > > src/test/java/org/apache/aurora/scheduler/mesos/MesosTaskFactoryImplTest.java > a5793bffabf4e5d6195b1b99f2363d241c0cecf9 > src/test/java/org/apache/aurora/scheduler/state/TaskAssignerImplTest.java > 3cbe9acd75def14ae2e0986914ba621fb164b3e4 > > Diff: https://reviews.apache.org/r/42126/diff/ > > > Testing > ------- > > 1. Unit tested with old and new tests; > 2. vagrant integration tests: I manually separate out the vagrant box's cpu > and memory between 'aurora-test' role and '*' and verified that jobs can > still be launched (I can post the vagrant change in another follow upon > request). > > > Thanks, > > Zhitao Li > >