Re: Fine Grained Scaling and Hadoop-2.7.

Santosh Marella Fri, 10 Jul 2015 11:16:14 -0700

Good point. YARN seems to have added this admission control as part of
YARN-2604, YARN-3079. YARN-2604 seems to have been added because of a
genuine problem where an app's AM container size exceeds the size of the
largest NM node in the cluster. They also have a configurable interval that
controls for how long should the admission control be relaxed after RM's
startup (yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms).
This was added to avoid rejection of apps submitted after RM (re)starts and
before any NMs register with RM.

One option is to have a larger value for the above configuration parameter
for Myriad based YARN clusters. However, it might be worth to see in detail
the effects of doing that, since the same config param is also used in
"work preserving RM restart" feature.

Another option is to add a flag to disable admission control in RM and push
the change into YARN.

In addition to (or irrespective of) the above, I think the following
problems should still be fixed in Myriad:
a. FGS shouldn't set NM's capacity to (0G,0CPU) during registration:
This is because, if a NM is launched with a "medium" profile and FGS sets
it's capacity to (0G,0CPU), RM will never schedule containers on this NM
unless FGS expands the capacity with additional mesos offers. Essentially,
the capacity used for launching the NM will not be utilized at all.
On the other hand, not setting the capacity to (0G,0CPU) is also a problem
because once RM allocates containers, FGS can't (easily) tell whether the
containers were allocated due to NM's initial capacity or due to additional
offers received from Mesos.

b. Configuration to enable/disable FGS:
Currently, there is no configuration/semantics that control whether Myriad
uses coarse grained scaling or fine grained scaling. If you run Myriad off
of "phase1" branch, you get coarse grained scaling (CGS). If you run off of
"branch_14", you get FGS. As we want branch_14 to be merged into phase1 at
some point, we need to think of a way to enable/disable FGS. One option
might be to a configuration that tells whether CGS should be enabled or FGS
should be enabled. However, I feel both of these features are pretty useful
and a co-existence of both would be ideal. Hence, introducing a new "zero
profile" (or we should name it "verticallyScalable" profile or similar) and
allowing FGS to be applicable only to this profile helps us define a way
where admin can use just FGS or just CGS or a combination of both.

c. Specify (profiles, instances) at startup:
Currently, "flexup" is the only way to add more NMs. It's convenient make
the number of instances of each profile configurable in .yml file. If admin
chooses to have a few NMs with FGS and a few with CGS, it's a lot easier to
specify it before starting RM. Myriad could exploit this configuration to
provide a reasonable workaround to the admission control problem: enforce
at least 1 NM of non-zero size.

Thanks,
Santosh

On Fri, Jul 10, 2015 at 12:32 AM, Adam Bordelon <[email protected]> wrote:

> Why not just add a flag to disable the admission control logic in RMs? This
> same concern came up in the Kubernetes-Mesos framework, which uses a
> similar "placeholder task" architecture to grow/shrink the executor's
> container as new tasks/pods are launched. We spoke to the K8s team, and
> they agreed that the admission control check is not critical to the
> functionality of their API server (task launch API), and it was kept behind
> a flag.
> I know we don't want to depend on forks of either project, but we can push
> changes into Mesos/YARN when necessary.
>
> On Thu, Jul 9, 2015 at 1:59 PM, Santosh Marella <[email protected]>
> wrote:
>
> > With hadoop-2.7, RM rejects app submissions when the capacity required to
> > run the app master exceeds the cluster capacity. Fine Grained Scaling
> (FGS)
> > is effected by the above problem. This is because, FGS sets the Node
> > Manager's capacity to (0G,0CPU) when the NodeManager registers with RM
> and
> > expands NM's capacity with resource offers from mesos. Thus, as each NM's
> > capacity is set to (0G,0CPU), the "cluster capacity" stays at (0G,0CPU)
> > causing the submitted apps to be rejected by RM. Although FGS expands the
> > NM's capacity with mesos offers, the probability of the cluster capacity
> > exceeding the AM container's capacity at the instant the app is submitted
> > is still very low.
> >
> > Couple of options were evaluated to fix the above problem:
> >
> > *Option #1*
> > - Let FGS not set NM's capacity to (0G,0CPU) during NM's registration
> with
> > RM. Let FGS use mesos offers to expand NM's capacity beyond it's initial
> > capacity (this is what FGS does already). When the mesos offered capacity
> > is used/relinquished by Myriad, the NM's capacity is brought down to it's
> > initial capacity.
> >
> > Pros:
> >   - App submissions won't be rejected as NMs always have certain minimum
> > capacity (== profile size).
> >   - NMs capacities are flexible. NMs start with some initial capacity,
> grow
> > in size with mesos offers and shrink back to the initial capacity.
> >
> > Cons:
> >   - Hard to implement. The main problem is this:
> >    Let's say an NM registered with RM with an initial capacity of
> (3G,2CPU)
> > and Myriad subsequently receives a new offer worth (3G,1CPU). If Myriad
> > sets the NM's capacity to (6G,3CPU) and allow RM to perform scheduling,
> > then RM can potentially allocate 3 containers of (2G,1CPU) each. Once the
> > containers are allocated, Myriad needs to figure out which of these
> > containers are
> >       a) allocated purely due to NM's initial capacity.
> >       b) allocated purely due to additional mesos offers.
> >       c) allocated purely due to combining NM's initial capacity with
> > additional mesos offers.
> >
> >     (c) is especially complex, since Myriad has to figure out the partial
> > resources consumed from the mesos offers and hold on to these resources
> as
> > long as the YARN containers utilizing these resources are alive.
> >
> > *Option #2*
> > 1. Introduce the notion of a new "zero" profile for NMs. NMs launched
> with
> > this profile register with RM with (0G,0CPU). Existing profile
> definitions
> > (low/medium/high) are left intact.
> > 2. Allow FGS to be applicable only if a NM registers with (0G,0CPU)
> > capacity. With this, all the containers allocated to a zero profile NM
> are
> > always due to resources offered by mesos.
> > 3. Let Myriad start a configured number of NMs (default==1) with a
> > configured profile (default==low). This will help with "cluster capacity"
> > to never be (0G,0CPU) and prevent rejection of apps.
> >
> > Pros:
> >   - App submissions won't be rejected as the "cluster capacity" is never
> > (0G,0CPU).
> >   - YARN cluster always would have certain minimum capacity (== sum of
> > capacities of NMs launched with non-zero profiles).
> >   - YARN cluster capacity remains flexible, since the non-zero NMs grow
> and
> > shrink in size.
> >
> > Cons:
> >   - Not a huge con, but one concern is that since some NMs are of fixed
> > size and some NMs are flexible, admin might want to be able to control
> the
> > NM placement wisely. We already have a issue raised to track this,
> perhaps
> > for a different context. But it's certainly applicable here as well. The
> > issue is: https://github.com/mesos/myriad/issues/105
> >
> > I tried Option#1 during last week and abandoned it for it's complexity. I
> > started implementing #2 (Point 3 above is still pending).
> >
> > I'm happy to include any feedback from folks before sending out the code
> > for review.
> >
> > Thanks,
> > Santosh
> >
>

Re: Fine Grained Scaling and Hadoop-2.7.

Reply via email to