----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/45961/#review141757 -----------------------------------------------------------
Adding to the previous partial review after discussion. ## Invariant checking and documentation An overall comment is that I think since we keep track of shared resource across Master, allocator and sorter, each for a different aspect (allocated vs. used vs. available, etc.) and with different semantics and expactations on uniquesness/number of copies etc., we should very explicitly document these variables and method arguments and check their invariants. Specifically, the following diagram helped me understand the relationship between these variables. Sorter | |- allocations: hashmap<std::string, Allocation> | |- resources: hashmap<SlaveID, Resources> |- scalarQuantities `Allocation.resources` expects 1 copy of each shared resource because it's tracking whether a client has been allocated this resource and doesn't care about how many times it's been allocated or used by the client. Therefore distinct shared resources is an invariant that we should check in `allocated()`. Allocator | |- slaves: hashmap<SlaveID, Slave> | |- allocated: Resources `Slave.allocated` expects 1 copy of each shared resource **per client** and it's grouped into one Resources object. Master | |- Slave | | | |- usedResources: hashmap<FrameworkID, Resources> | |- Framework | |- usedResources: hashmap<SlaveID, Resources> `Slave.usedResources` and `Framework.usedResources` expect 1 copy of each shared resource **per task** as they are tracking "usage". src/master/allocator/mesos/hierarchical.cpp (lines 450 - 459) <https://reviews.apache.org/r/45961/#comment207146> Seems like in `slaves[slaveId].allocated` we should be maintaining one copy of each shared resource **per framework** but this if condition here assumes one copy of each shared resource across framework. src/master/allocator/mesos/hierarchical.cpp (lines 1386 - 1400) <https://reviews.apache.org/r/45961/#comment207150> Ignore my previous comment on this. It wasn't clear to me what invariant for `slaves[slaveId].allocated` is. Now I have looked again, we define shared resources in `slaves[slaveId].allocated` as: We maintain one copy of shared resource in `slaves[slaveId].allocated` **for each framework it is allocated to**. The first question is why do we do this? I can understand that this way if a shared resource is allocated to N frameworks, when it's recovered from all N frameworks it disappears from `slaves[slaveId].allocated`. We should document this where this variable is defined and check this invariant wherever appropriate. And in here, is this equivalent to the block of code? ``` slaves[slaveId].allocated += resources - allocation.shared(); ``` src/master/allocator/mesos/hierarchical.cpp (lines 1406 - 1409) <https://reviews.apache.org/r/45961/#comment207160> Since `allocated` above already excludes shared resources already allocated to this framework, why couldn't be just pass it along to `frameworkSorters[role]->add()|allocated()`? For `roleSorter->allocated()` and `quotaRoleSorter->allocated()` we can use `resources - slaves[slaveId].allocated.shared()` right? src/master/allocator/mesos/hierarchical.cpp (lines 1570 - 1571) <https://reviews.apache.org/r/45961/#comment207164> To expand on this: I understand that if shared resources are offered out in the 1st stage, it'll be exluded from `remainingClusterResources` so by excluding shared resources from `scalarQuantity` we are comparing apples to apples. On the other hand it's possible for some shared resources to be not offered in the 1st stage so it'll remain in `remainingClusterResources`, if we exclude these shared resources from `scalarQuantity` we could be over-allocating nonshared resources. Perhaps we should be consistently checking nonshared resources to prevent over-allocation in stage 2. i.e., ``` // We exlude shared resources from over-allocation check because shared resources are alwas allocatable. if (!remainingClusterResources.nonShared().contains( (allocatedStage2 + scalarQuantity).nonShared())) { continue; } ``` src/master/allocator/sorter/drf/sorter.hpp (line 168) <https://reviews.apache.org/r/45961/#comment207137> Some comments about shared resources would be nice. ``` // We maintain one copy of each shared resource allocated to a client here. A shared resource may be offered to the same client multiple times but here we are only concerned with whether it's in the client's allocation or not. ``` src/master/allocator/sorter/drf/sorter.hpp (line 171) <https://reviews.apache.org/r/45961/#comment207127> Add to this comment about omitting sharedness. src/master/allocator/sorter/drf/sorter.cpp (lines 163 - 170) <https://reviews.apache.org/r/45961/#comment207140> Adding to it, I wonder if it would be cleaner to require the caller not to call `allocated()` with shared resources already allocated to it. If so we should document this in the API and also check the invariant here: ``` const Resources allocatedShared = allocations[name].resources[slaveId].shared(); foreach (const Resource& resource, resources.shared()) { CHECK(!allocatedShared.contains(resource)); } ``` If the invariant holds, then for `scalarQuantities` the below is trivially guranteed: ``` allocations[name].scalarQuantities += resources.createStrippedScalarQuantity(); ``` src/master/master.hpp (line 274) <https://reviews.apache.org/r/45961/#comment207215> Chatted offline. We should refactor this into a `Master::recoverResources()` call to encapsulate this logic and call `allocator->recoverResources()` inside with adjusted arguments. src/master/master.hpp (line 343) <https://reviews.apache.org/r/45961/#comment207217> As discussed, it takes some work to account for `pendingResources` correctly when we could alternatively add task to `usedResources` and remove the task if validation fails. If we choose do it later as a separate review let's make sure we don't change too much and make it hard to refactor later and at least add a TODO here. ``` TODO: In stead of tracking `pendingResources` separately, consider adding tasks (and consider it as STAGING when the accept call arrives and removing it if it fails authorization. This way we only need to keep track of `usedResources`. ``` src/master/validation.hpp (lines 157 - 158) <https://reviews.apache.org/r/45961/#comment207252> Seems like we only need to change 4 lines in master_validation_tests.cpp due to adding the additional arguments as non-optional. Sounds reasonable to me to fix them. src/tests/master_validation_tests.cpp (line 600) <https://reviews.apache.org/r/45961/#comment207270> s/is done only if shared count is 0/is only valid when the volumes are no longer in use/. src/tests/master_validation_tests.cpp (lines 605 - 607) <https://reviews.apache.org/r/45961/#comment207261> Don't use duplicated persistent ID here? src/tests/master_validation_tests.cpp (lines 630 - 632) <https://reviews.apache.org/r/45961/#comment207269> Here we are testing that can't destroy a nonshared persistent volume `disk2` because it's in use? Perhaps it's still reasonable to test it (even thought it can't happen right now) but it's not relevant to this test? src/tests/sorter_tests.cpp (lines 493 - 494) <https://reviews.apache.org/r/45961/#comment207278> The same volume is used a couple of times, use a variable instead of creating it each time? src/tests/sorter_tests.cpp (lines 499 - 500) <https://reviews.apache.org/r/45961/#comment207273> This is not a realistic example. "id1" is already a volume, you shouldn't claim 1/10 of a volume. (I don't see us validating this though) src/tests/sorter_tests.cpp (lines 520 - 521) <https://reviews.apache.org/r/45961/#comment207276> Even if a and b both use 1/10 of the persistent volume, shouldn't they each get 0.1 share? src/tests/sorter_tests.cpp (line 541) <https://reviews.apache.org/r/45961/#comment207280> Where's the volume gone? One blank line above. - Jiang Yan Xu On July 7, 2016, 8:44 a.m., Anindya Sinha wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/45961/ > ----------------------------------------------------------- > > (Updated July 7, 2016, 8:44 a.m.) > > > Review request for mesos, Benjamin Mahler, Joris Van Remoortere, Michael > Park, and Jiang Yan Xu. > > > Bugs: MESOS-4431 > https://issues.apache.org/jira/browse/MESOS-4431 > > > Repository: mesos > > > Description > ------- > > o Each shared resource is accouted via its share count. This count is > updated based on the resource operations (such as add and subtract) > in various scenarios such as task launch and terminate at multiple > modules such as master, allocator, sorter, etc. > o Only allow DESTROY if there are no running or pending tasks using > the volume. However, if the volume is in a pending offer to one or > more frameworks, resind that offer before processing the DESTROY. > > > Diffs > ----- > > src/common/resources.cpp f6ff92b591c15bc8e93fd85e1896349c3a7bb968 > src/common/resources_utils.cpp 8e881a09500b0966b577e514077460f724868a8d > src/master/allocator/mesos/hierarchical.cpp > c1e00039606164599e25ff5f76245e4d35ec3112 > src/master/allocator/sorter/drf/sorter.hpp > e29feebd70277c79f7c3f6fb233e7a36501cf220 > src/master/allocator/sorter/drf/sorter.cpp > 7df4dd641b21ea0705368861bf4679fed1ef078d > src/master/http.cpp bff1fd53462bfec19e4a025c49a00e2855faf4f3 > src/master/master.hpp 60efd22280c62801b080365986fe9773737ca563 > src/master/master.cpp 79e3d78ba45060bc2f2532fdc3d105d1cc888d0f > src/master/validation.hpp 43d876b00511d4ef652c2685d4950c994b5eef31 > src/master/validation.cpp 50ba372d1301d16d1738a3e4f4882b51f9ce06cd > src/tests/master_validation_tests.cpp > 9eb82a569f7c48caa96d4d54a93b199ccac74385 > src/tests/sorter_tests.cpp 0a921347586808863e615ca3dcc23fae92b629f5 > src/v1/resources.cpp 8c3f2d1c1529915a59d47fe37bb3fc7a3267079a > > Diff: https://reviews.apache.org/r/45961/diff/ > > > Testing > ------- > > Tests successful. > > > Thanks, > > Anindya Sinha > >