Good point. YARN seems to have added this admission control as part of YARN-2604, YARN-3079. YARN-2604 seems to have been added because of a genuine problem where an app's AM container size exceeds the size of the largest NM node in the cluster. They also have a configurable interval that controls for how long should the admission control be relaxed after RM's startup (yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms). This was added to avoid rejection of apps submitted after RM (re)starts and before any NMs register with RM.
One option is to have a larger value for the above configuration parameter for Myriad based YARN clusters. However, it might be worth to see in detail the effects of doing that, since the same config param is also used in "work preserving RM restart" feature. Another option is to add a flag to disable admission control in RM and push the change into YARN. In addition to (or irrespective of) the above, I think the following problems should still be fixed in Myriad: a. FGS shouldn't set NM's capacity to (0G,0CPU) during registration: This is because, if a NM is launched with a "medium" profile and FGS sets it's capacity to (0G,0CPU), RM will never schedule containers on this NM unless FGS expands the capacity with additional mesos offers. Essentially, the capacity used for launching the NM will not be utilized at all. On the other hand, not setting the capacity to (0G,0CPU) is also a problem because once RM allocates containers, FGS can't (easily) tell whether the containers were allocated due to NM's initial capacity or due to additional offers received from Mesos. b. Configuration to enable/disable FGS: Currently, there is no configuration/semantics that control whether Myriad uses coarse grained scaling or fine grained scaling. If you run Myriad off of "phase1" branch, you get coarse grained scaling (CGS). If you run off of "branch_14", you get FGS. As we want branch_14 to be merged into phase1 at some point, we need to think of a way to enable/disable FGS. One option might be to a configuration that tells whether CGS should be enabled or FGS should be enabled. However, I feel both of these features are pretty useful and a co-existence of both would be ideal. Hence, introducing a new "zero profile" (or we should name it "verticallyScalable" profile or similar) and allowing FGS to be applicable only to this profile helps us define a way where admin can use just FGS or just CGS or a combination of both. c. Specify (profiles, instances) at startup: Currently, "flexup" is the only way to add more NMs. It's convenient make the number of instances of each profile configurable in .yml file. If admin chooses to have a few NMs with FGS and a few with CGS, it's a lot easier to specify it before starting RM. Myriad could exploit this configuration to provide a reasonable workaround to the admission control problem: enforce at least 1 NM of non-zero size. Thanks, Santosh On Fri, Jul 10, 2015 at 12:32 AM, Adam Bordelon <[email protected]> wrote: > Why not just add a flag to disable the admission control logic in RMs? This > same concern came up in the Kubernetes-Mesos framework, which uses a > similar "placeholder task" architecture to grow/shrink the executor's > container as new tasks/pods are launched. We spoke to the K8s team, and > they agreed that the admission control check is not critical to the > functionality of their API server (task launch API), and it was kept behind > a flag. > I know we don't want to depend on forks of either project, but we can push > changes into Mesos/YARN when necessary. > > On Thu, Jul 9, 2015 at 1:59 PM, Santosh Marella <[email protected]> > wrote: > > > With hadoop-2.7, RM rejects app submissions when the capacity required to > > run the app master exceeds the cluster capacity. Fine Grained Scaling > (FGS) > > is effected by the above problem. This is because, FGS sets the Node > > Manager's capacity to (0G,0CPU) when the NodeManager registers with RM > and > > expands NM's capacity with resource offers from mesos. Thus, as each NM's > > capacity is set to (0G,0CPU), the "cluster capacity" stays at (0G,0CPU) > > causing the submitted apps to be rejected by RM. Although FGS expands the > > NM's capacity with mesos offers, the probability of the cluster capacity > > exceeding the AM container's capacity at the instant the app is submitted > > is still very low. > > > > Couple of options were evaluated to fix the above problem: > > > > *Option #1* > > - Let FGS not set NM's capacity to (0G,0CPU) during NM's registration > with > > RM. Let FGS use mesos offers to expand NM's capacity beyond it's initial > > capacity (this is what FGS does already). When the mesos offered capacity > > is used/relinquished by Myriad, the NM's capacity is brought down to it's > > initial capacity. > > > > Pros: > > - App submissions won't be rejected as NMs always have certain minimum > > capacity (== profile size). > > - NMs capacities are flexible. NMs start with some initial capacity, > grow > > in size with mesos offers and shrink back to the initial capacity. > > > > Cons: > > - Hard to implement. The main problem is this: > > Let's say an NM registered with RM with an initial capacity of > (3G,2CPU) > > and Myriad subsequently receives a new offer worth (3G,1CPU). If Myriad > > sets the NM's capacity to (6G,3CPU) and allow RM to perform scheduling, > > then RM can potentially allocate 3 containers of (2G,1CPU) each. Once the > > containers are allocated, Myriad needs to figure out which of these > > containers are > > a) allocated purely due to NM's initial capacity. > > b) allocated purely due to additional mesos offers. > > c) allocated purely due to combining NM's initial capacity with > > additional mesos offers. > > > > (c) is especially complex, since Myriad has to figure out the partial > > resources consumed from the mesos offers and hold on to these resources > as > > long as the YARN containers utilizing these resources are alive. > > > > *Option #2* > > 1. Introduce the notion of a new "zero" profile for NMs. NMs launched > with > > this profile register with RM with (0G,0CPU). Existing profile > definitions > > (low/medium/high) are left intact. > > 2. Allow FGS to be applicable only if a NM registers with (0G,0CPU) > > capacity. With this, all the containers allocated to a zero profile NM > are > > always due to resources offered by mesos. > > 3. Let Myriad start a configured number of NMs (default==1) with a > > configured profile (default==low). This will help with "cluster capacity" > > to never be (0G,0CPU) and prevent rejection of apps. > > > > Pros: > > - App submissions won't be rejected as the "cluster capacity" is never > > (0G,0CPU). > > - YARN cluster always would have certain minimum capacity (== sum of > > capacities of NMs launched with non-zero profiles). > > - YARN cluster capacity remains flexible, since the non-zero NMs grow > and > > shrink in size. > > > > Cons: > > - Not a huge con, but one concern is that since some NMs are of fixed > > size and some NMs are flexible, admin might want to be able to control > the > > NM placement wisely. We already have a issue raised to track this, > perhaps > > for a different context. But it's certainly applicable here as well. The > > issue is: https://github.com/mesos/myriad/issues/105 > > > > I tried Option#1 during last week and abandoned it for it's complexity. I > > started implementing #2 (Point 3 above is still pending). > > > > I'm happy to include any feedback from folks before sending out the code > > for review. > > > > Thanks, > > Santosh > > >
