> The only options I can imagine are to a) use fixed-size NMs that cannot grow, alongside the elastic zero-profile NMs; or b) disable admission control in the RM so this isn't a problem. I'd vote for b), but depending on how long that takes, you may want to implement a) in the meantime.
Agreed. (a) is implemented in PR: https://github.com/mesos/myriad/pull/116 Santosh On Tue, Jul 14, 2015 at 4:10 PM, Adam Bordelon <[email protected]> wrote: > Ok, this makes sense now. With zero profile, tracking these will be much > easier, since each YARN container would have a placeholder task of the same > size. > But with an initial/minimum capacity, you'd need to do extra bookkeeping to > know how many resources belong to each task, what the initial NM capacity > was, and its current size. Then, when a task completes, you'll see how many > resources it was using, and determine whether some/all of those resources > should be freed and given back to Mesos, or whether they just go back to > idle minimum capacity for the NM. However, since Mesos doesn't (yet) > support resizeTask, you'd have to kill the placeholder task that best > matches the size of the completed task (even though that task may have > originally launched in the miminum capacity). Tricky indeed. > So, I like the idea of the zero-profile NM in that case, but it still > doesn't solve the problem of admission control of AMs/containers that are > bigger than the current cluster capacity. If we keep some minimum capacity > NMs that can resize with placeholder tasks, you run into the same problem > as above. The only options I can imagine are to a) use fixed-size NMs that > cannot grow, alongside the elastic zero-profile NMs; or b) disable > admission control in the RM so this isn't a problem. I'd vote for b), but > depending on how long that takes, you may want to implement a) in the > meantime. > > On Tue, Jul 14, 2015 at 2:02 PM, Santosh Marella <[email protected]> > wrote: > > > >Why do you worry that Myriad needs to figure out which container is > > associated with which offer/profile > > > > The framework needs to figure out the size of the "placeholder task" that > > it needs to launch corresponding > > to a YARN container. The size of the "placeholder" is not always 1:1 with > > the size of the YARN > > container (zero profile is trying to make it 1:1). > > > > Let's take an example flow: > > > > 1. Let's say the NM's initial capacity was (4G,4CPU) and YARN wants to > > launch a container with size (2G, 2CPU). No problem. NM already has > > capacity > > to accommodate it. No need to wait for more offers or to launch > placeholder > > mesos tasks. > > Just launch the YARN container via NM's HB. > > > > 2. Let's say the NM's initial capacity was (4G,4CPU) and (2G,2CPU) is > under > > use due to previously launched YARN container. If the RM's next request > > requires a container with (3G,3CPU), that container doesn't get allocated > > to > > this NM, since NM doesn't have enough capacity. No problem here too. > > > > 3. Let's say mesos offers a (1G,1CPU) at this point. NM has (2G,2CPU) > > available and Myriad allows adding (1G,1CPU) to it. Thus, RM believes > > NM now has (3G,3CPU) and allocates a (3G,3CPU) container on the NM. > > At this point, since Myriad needs to use the launchTasks() API to > > launch a "placeholder" task with (1G,1CPU). > > > > Thanks, > > Santosh > > > > On Tue, Jul 14, 2015 at 1:12 AM, Adam Bordelon <[email protected]> > wrote: > > > > > Ah, I'm understanding better now. Leaving the 2G,1CPU unused is > certainly > > > flawed and undesirable. > > > I'm unopposed to the idea of an initial/minimum profile size that grows > > and > > > shrinks but never goes below its initial/minimum capacity. As for your > > > concern, a recently completed task will give up its unnamed resources > > like > > > cpu and memory, without knowing/caring where they go. There is no > > > distinction between the cpu from one task and the cpu from another. > First > > > priority goes to maintaining the minimum capacity. Anything beyond that > > can > > > be offered back to Mesos (perhaps after some timeout for promoting > > reuse). > > > The only concern might be with named resources like ports or persistent > > > volumes. Why do you worry that Myriad needs to figure out which > container > > > is associated with which offer/profile? Is it not already tracking the > > YARN > > > containers? How else does it know when to release resources? > > > That said, a zero profile also makes sense, as does mixing profiles of > > > different sizes (including zero/non-zero) within a cluster. You could > > > restrict dynamic NM resizing to zero-profile NMs for starters, but I'd > > > imagine we'd want them all to be resizable in the future. > > > > > > On Fri, Jul 10, 2015 at 6:47 PM, Santosh Marella < > [email protected]> > > > wrote: > > > > > > > > a) Give the executor at least a minimal 0.01cpu, 1MB RAM > > > > > > > > Myriad does this already. The problem is not with respect to > executor's > > > > capacity. > > > > > > > > > b) ... I don't think I understand your "zero profile" use case > > > > > > > > Let's take an example. Let's say the "low" profile corresponds to > > > > (2G,1CPU). When > > > > Myriad wants to launch a NM with "low" profile, it waits for a mesos > > > offer > > > > that can > > > > hold an executor + a java process for NM + a (2G,1CPU) capacity that > NM > > > > can advertise to RM for launching future YARN containers. With CGS, > > > > when NM registers with RM, YARN scheduler believes the NM has > (2G,1CPU) > > > > and hence can allocate containers worth (2G,1CPU) when apps require > > > > containers. > > > > > > > > With FGS, YARN scheduler believes NM has (0G,0CPU). This is because, > > FGS > > > > intercepts NM's registration with RM and sets NM's advertised > capacity > > to > > > > (0G,0CPU), > > > > although NM has originally started with (2G,1CPU). At this point, > YARN > > > > scheduler > > > > cannot allocate containers to this NM. Subsequently, when mesos > offers > > > > resources > > > > on the same slave node, FGS increases the capacity of the NM and > > notifies > > > > RM > > > > that NM now has capacity available. For e.g. if (5G,4CPU) are offered > > to > > > > Myriad, > > > > then FGS notifies RM that the NM now has (5G,4CPU). RM can now > allocate > > > > containers worth (5G,4CPU) for this NM. If you now count the total > > > > resources Myriad has consumed from the given slave node, we observe > > that > > > > Myriad > > > > never utilizes the (2G,1CPU) ["low" profile size] that was obtained > at > > > NM's > > > > launch time. > > > > The notion of a "zero" profile tries to eliminate this wastage by > > > allowing > > > > NM to > > > > be launched with an advertisable capacity of (0G,0CPU) in the first > > > place. > > > > > > > > Why does FGS change NM's initial capacity from (2G,1CPU) to > (0G,0CPU)? > > > > That's the way it had been until now, but it need not be. FGS can > > choose > > > to > > > > not reset > > > > NM's capacity to (0G,0CPU) and instead allow NM to grow beyond > initial > > > > capacity of > > > > (2G,1CPU) and shrink back to (2G,1CPU). I tried this approach > recently, > > > but > > > > there > > > > are other problems if we do that (mentioned under option#1 in my > first > > > > email) that > > > > seemed more complex than going with a "zero" profile. > > > > > > > > > c)... We should still investigate pushing a disable flag into YARN. > > > > Absolutely. It totally makes sense to turn off admission restriction > > > > for auto-scaling YARN clusters. > > > > > > > > FWIW, I will be sending out a PR shortly from my private issue_14 > > branch > > > > with the changes I made so far. Comments/suggestions are welcome! > > > > > > > > Thanks, > > > > Santosh > > > > > > > > On Fri, Jul 10, 2015 at 11:44 AM, Adam Bordelon <[email protected]> > > > > wrote: > > > > > > > > > a) Give the executor at least a minimal 0.01cpu, 1MB RAM, since the > > > > > executor itself will use some resources, and Mesos gets confused > when > > > the > > > > > executor claims no resources. See > > > > > https://issues.apache.org/jira/browse/MESOS-1807 > > > > > b) I agree 100% with needing a way to enable/disable FGS vs. CGS, > > but I > > > > > don't think I understand your "zero profile" use case. I'd > recommend > > > > going > > > > > with a simple enable/disable flag for the MVP, and then we can > extend > > > it > > > > > later if/when necessary. > > > > > c) Interesting. Seems like a hacky workaround for the admission > > control > > > > > problem, but I'm intrigued by its complexities and capabilities for > > > other > > > > > scenarios. We should still investigate pushing a disable flag into > > > YARN. > > > > > > YARN-2604, YARN-3079. YARN-2604 seems to have been added because > > of a > > > > > > genuine problem where an app's AM container size exceeds the size > > of > > > > the > > > > > > largest NM node in the cluster. > > > > > This still needs a way to be disabled, because an auto-scaling > Hadoop > > > > > cluster wouldn't worry about insufficient capacity. It would just > > make > > > > > more. > > > > > > > > > > On Fri, Jul 10, 2015 at 11:13 AM, Santosh Marella < > > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > Good point. YARN seems to have added this admission control as > part > > > of > > > > > > YARN-2604, YARN-3079. YARN-2604 seems to have been added because > > of a > > > > > > genuine problem where an app's AM container size exceeds the size > > of > > > > the > > > > > > largest NM node in the cluster. They also have a configurable > > > interval > > > > > that > > > > > > controls for how long should the admission control be relaxed > after > > > > RM's > > > > > > startup > > > > > (yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms). > > > > > > This was added to avoid rejection of apps submitted after RM > > > (re)starts > > > > > and > > > > > > before any NMs register with RM. > > > > > > > > > > > > One option is to have a larger value for the above configuration > > > > > parameter > > > > > > for Myriad based YARN clusters. However, it might be worth to see > > in > > > > > detail > > > > > > the effects of doing that, since the same config param is also > used > > > in > > > > > > "work preserving RM restart" feature. > > > > > > > > > > > > Another option is to add a flag to disable admission control in > RM > > > and > > > > > push > > > > > > the change into YARN. > > > > > > > > > > > > In addition to (or irrespective of) the above, I think the > > following > > > > > > problems should still be fixed in Myriad: > > > > > > a. FGS shouldn't set NM's capacity to (0G,0CPU) during > > registration: > > > > > > This is because, if a NM is launched with a "medium" profile and > > FGS > > > > sets > > > > > > it's capacity to (0G,0CPU), RM will never schedule containers on > > this > > > > NM > > > > > > unless FGS expands the capacity with additional mesos offers. > > > > > Essentially, > > > > > > the capacity used for launching the NM will not be utilized at > all. > > > > > > On the other hand, not setting the capacity to (0G,0CPU) is also > a > > > > > problem > > > > > > because once RM allocates containers, FGS can't (easily) tell > > whether > > > > the > > > > > > containers were allocated due to NM's initial capacity or due to > > > > > additional > > > > > > offers received from Mesos. > > > > > > > > > > > > b. Configuration to enable/disable FGS: > > > > > > Currently, there is no configuration/semantics that control > whether > > > > > Myriad > > > > > > uses coarse grained scaling or fine grained scaling. If you run > > > Myriad > > > > > off > > > > > > of "phase1" branch, you get coarse grained scaling (CGS). If you > > run > > > > off > > > > > of > > > > > > "branch_14", you get FGS. As we want branch_14 to be merged into > > > phase1 > > > > > at > > > > > > some point, we need to think of a way to enable/disable FGS. One > > > option > > > > > > might be to a configuration that tells whether CGS should be > > enabled > > > or > > > > > FGS > > > > > > should be enabled. However, I feel both of these features are > > pretty > > > > > useful > > > > > > and a co-existence of both would be ideal. Hence, introducing a > new > > > > "zero > > > > > > profile" (or we should name it "verticallyScalable" profile or > > > similar) > > > > > and > > > > > > allowing FGS to be applicable only to this profile helps us > define > > a > > > > way > > > > > > where admin can use just FGS or just CGS or a combination of > both. > > > > > > > > > > > > c. Specify (profiles, instances) at startup: > > > > > > Currently, "flexup" is the only way to add more NMs. It's > > convenient > > > > make > > > > > > the number of instances of each profile configurable in .yml > file. > > If > > > > > admin > > > > > > chooses to have a few NMs with FGS and a few with CGS, it's a lot > > > > easier > > > > > to > > > > > > specify it before starting RM. Myriad could exploit this > > > configuration > > > > to > > > > > > provide a reasonable workaround to the admission control problem: > > > > enforce > > > > > > at least 1 NM of non-zero size. > > > > > > > > > > > > Thanks, > > > > > > Santosh > > > > > > > > > > > > On Fri, Jul 10, 2015 at 12:32 AM, Adam Bordelon < > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > Why not just add a flag to disable the admission control logic > in > > > > RMs? > > > > > > This > > > > > > > same concern came up in the Kubernetes-Mesos framework, which > > uses > > > a > > > > > > > similar "placeholder task" architecture to grow/shrink the > > > executor's > > > > > > > container as new tasks/pods are launched. We spoke to the K8s > > team, > > > > and > > > > > > > they agreed that the admission control check is not critical to > > the > > > > > > > functionality of their API server (task launch API), and it was > > > kept > > > > > > behind > > > > > > > a flag. > > > > > > > I know we don't want to depend on forks of either project, but > we > > > can > > > > > > push > > > > > > > changes into Mesos/YARN when necessary. > > > > > > > > > > > > > > On Thu, Jul 9, 2015 at 1:59 PM, Santosh Marella < > > > > [email protected] > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > With hadoop-2.7, RM rejects app submissions when the capacity > > > > > required > > > > > > to > > > > > > > > run the app master exceeds the cluster capacity. Fine Grained > > > > Scaling > > > > > > > (FGS) > > > > > > > > is effected by the above problem. This is because, FGS sets > the > > > > Node > > > > > > > > Manager's capacity to (0G,0CPU) when the NodeManager > registers > > > with > > > > > RM > > > > > > > and > > > > > > > > expands NM's capacity with resource offers from mesos. Thus, > as > > > > each > > > > > > NM's > > > > > > > > capacity is set to (0G,0CPU), the "cluster capacity" stays at > > > > > (0G,0CPU) > > > > > > > > causing the submitted apps to be rejected by RM. Although FGS > > > > expands > > > > > > the > > > > > > > > NM's capacity with mesos offers, the probability of the > cluster > > > > > > capacity > > > > > > > > exceeding the AM container's capacity at the instant the app > is > > > > > > submitted > > > > > > > > is still very low. > > > > > > > > > > > > > > > > Couple of options were evaluated to fix the above problem: > > > > > > > > > > > > > > > > *Option #1* > > > > > > > > - Let FGS not set NM's capacity to (0G,0CPU) during NM's > > > > registration > > > > > > > with > > > > > > > > RM. Let FGS use mesos offers to expand NM's capacity beyond > > it's > > > > > > initial > > > > > > > > capacity (this is what FGS does already). When the mesos > > offered > > > > > > capacity > > > > > > > > is used/relinquished by Myriad, the NM's capacity is brought > > down > > > > to > > > > > > it's > > > > > > > > initial capacity. > > > > > > > > > > > > > > > > Pros: > > > > > > > > - App submissions won't be rejected as NMs always have > > certain > > > > > > minimum > > > > > > > > capacity (== profile size). > > > > > > > > - NMs capacities are flexible. NMs start with some initial > > > > > capacity, > > > > > > > grow > > > > > > > > in size with mesos offers and shrink back to the initial > > > capacity. > > > > > > > > > > > > > > > > Cons: > > > > > > > > - Hard to implement. The main problem is this: > > > > > > > > Let's say an NM registered with RM with an initial > capacity > > of > > > > > > > (3G,2CPU) > > > > > > > > and Myriad subsequently receives a new offer worth (3G,1CPU). > > If > > > > > Myriad > > > > > > > > sets the NM's capacity to (6G,3CPU) and allow RM to perform > > > > > scheduling, > > > > > > > > then RM can potentially allocate 3 containers of (2G,1CPU) > > each. > > > > Once > > > > > > the > > > > > > > > containers are allocated, Myriad needs to figure out which of > > > these > > > > > > > > containers are > > > > > > > > a) allocated purely due to NM's initial capacity. > > > > > > > > b) allocated purely due to additional mesos offers. > > > > > > > > c) allocated purely due to combining NM's initial > > capacity > > > > with > > > > > > > > additional mesos offers. > > > > > > > > > > > > > > > > (c) is especially complex, since Myriad has to figure out > > the > > > > > > partial > > > > > > > > resources consumed from the mesos offers and hold on to these > > > > > resources > > > > > > > as > > > > > > > > long as the YARN containers utilizing these resources are > > alive. > > > > > > > > > > > > > > > > *Option #2* > > > > > > > > 1. Introduce the notion of a new "zero" profile for NMs. NMs > > > > launched > > > > > > > with > > > > > > > > this profile register with RM with (0G,0CPU). Existing > profile > > > > > > > definitions > > > > > > > > (low/medium/high) are left intact. > > > > > > > > 2. Allow FGS to be applicable only if a NM registers with > > > (0G,0CPU) > > > > > > > > capacity. With this, all the containers allocated to a zero > > > profile > > > > > NM > > > > > > > are > > > > > > > > always due to resources offered by mesos. > > > > > > > > 3. Let Myriad start a configured number of NMs (default==1) > > with > > > a > > > > > > > > configured profile (default==low). This will help with > "cluster > > > > > > capacity" > > > > > > > > to never be (0G,0CPU) and prevent rejection of apps. > > > > > > > > > > > > > > > > Pros: > > > > > > > > - App submissions won't be rejected as the "cluster > capacity" > > > is > > > > > > never > > > > > > > > (0G,0CPU). > > > > > > > > - YARN cluster always would have certain minimum capacity > (== > > > sum > > > > > of > > > > > > > > capacities of NMs launched with non-zero profiles). > > > > > > > > - YARN cluster capacity remains flexible, since the > non-zero > > > NMs > > > > > grow > > > > > > > and > > > > > > > > shrink in size. > > > > > > > > > > > > > > > > Cons: > > > > > > > > - Not a huge con, but one concern is that since some NMs > are > > of > > > > > fixed > > > > > > > > size and some NMs are flexible, admin might want to be able > to > > > > > control > > > > > > > the > > > > > > > > NM placement wisely. We already have a issue raised to track > > > this, > > > > > > > perhaps > > > > > > > > for a different context. But it's certainly applicable here > as > > > > well. > > > > > > The > > > > > > > > issue is: https://github.com/mesos/myriad/issues/105 > > > > > > > > > > > > > > > > I tried Option#1 during last week and abandoned it for it's > > > > > > complexity. I > > > > > > > > started implementing #2 (Point 3 above is still pending). > > > > > > > > > > > > > > > > I'm happy to include any feedback from folks before sending > out > > > the > > > > > > code > > > > > > > > for review. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Santosh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
