Re: Fine Grained Scaling and Hadoop-2.7.

Santosh Marella Tue, 14 Jul 2015 16:35:18 -0700

> The only options I can imagine are to a) use fixed-size NMs that
cannot grow, alongside the elastic zero-profile NMs; or b) disable
admission control in the RM so this isn't a problem. I'd vote for b), but
depending on how long that takes, you may want to implement a) in the
meantime.


Agreed. (a) is implemented in PR: https://github.com/mesos/myriad/pull/116


Santosh

On Tue, Jul 14, 2015 at 4:10 PM, Adam Bordelon <[email protected]> wrote:

> Ok, this makes sense now. With zero profile, tracking these will be much
> easier, since each YARN container would have a placeholder task of the same
> size.
> But with an initial/minimum capacity, you'd need to do extra bookkeeping to
> know how many resources belong to each task, what the initial NM capacity
> was, and its current size. Then, when a task completes, you'll see how many
> resources it was using, and determine whether some/all of those resources
> should be freed and given back to Mesos, or whether they just go back to
> idle minimum capacity for the NM. However, since Mesos doesn't (yet)
> support resizeTask, you'd have to kill the placeholder task that best
> matches the size of the completed task (even though that task may have
> originally launched in the miminum capacity). Tricky indeed.
> So, I like the idea of the zero-profile NM in that case, but it still
> doesn't solve the problem of admission control of AMs/containers that are
> bigger than the current cluster capacity. If we keep some minimum capacity
> NMs that can resize with placeholder tasks, you run into the same problem
> as above. The only options I can imagine are to a) use fixed-size NMs that
> cannot grow, alongside the elastic zero-profile NMs; or b) disable
> admission control in the RM so this isn't a problem. I'd vote for b), but
> depending on how long that takes, you may want to implement a) in the
> meantime.
>
> On Tue, Jul 14, 2015 at 2:02 PM, Santosh Marella <[email protected]>
> wrote:
>
> > >Why do you worry that Myriad needs to figure out which container is
> > associated with which offer/profile
> >
> > The framework needs to figure out the size of the "placeholder task" that
> > it needs to launch corresponding
> > to a YARN container. The size of the "placeholder" is not always 1:1 with
> > the size of the YARN
> > container (zero profile is trying to make it 1:1).
> >
> > Let's take an example flow:
> >
> > 1. Let's say the NM's initial capacity was (4G,4CPU) and YARN wants to
> > launch a container with size (2G, 2CPU). No problem. NM already has
> > capacity
> > to accommodate it. No need to wait for more offers or to launch
> placeholder
> > mesos tasks.
> > Just launch the YARN container via NM's HB.
> >
> > 2. Let's say the NM's initial capacity was (4G,4CPU) and (2G,2CPU) is
> under
> > use due to previously launched YARN container. If the RM's next request
> > requires a container with (3G,3CPU), that container doesn't get allocated
> > to
> > this NM, since NM doesn't have enough capacity. No problem here too.
> >
> > 3. Let's say mesos offers a (1G,1CPU) at this point. NM has (2G,2CPU)
> > available and Myriad allows adding (1G,1CPU) to it. Thus, RM believes
> > NM now has (3G,3CPU) and allocates a (3G,3CPU) container on the NM.
> > At this point, since Myriad needs to use the launchTasks() API to
> > launch a "placeholder" task with (1G,1CPU).
> >
> > Thanks,
> > Santosh
> >
> > On Tue, Jul 14, 2015 at 1:12 AM, Adam Bordelon <[email protected]>
> wrote:
> >
> > > Ah, I'm understanding better now. Leaving the 2G,1CPU unused is
> certainly
> > > flawed and undesirable.
> > > I'm unopposed to the idea of an initial/minimum profile size that grows
> > and
> > > shrinks but never goes below its initial/minimum capacity. As for your
> > > concern, a recently completed task will give up its unnamed resources
> > like
> > > cpu and memory, without knowing/caring where they go. There is no
> > > distinction between the cpu from one task and the cpu from another.
> First
> > > priority goes to maintaining the minimum capacity. Anything beyond that
> > can
> > > be offered back to Mesos (perhaps after some timeout for promoting
> > reuse).
> > > The only concern might be with named resources like ports or persistent
> > > volumes. Why do you worry that Myriad needs to figure out which
> container
> > > is associated with which offer/profile? Is it not already tracking the
> > YARN
> > > containers? How else does it know when to release resources?
> > > That said, a zero profile also makes sense, as does mixing profiles of
> > > different sizes (including zero/non-zero) within a cluster. You could
> > > restrict dynamic NM resizing to zero-profile NMs for starters, but I'd
> > > imagine we'd want them all to be resizable in the future.
> > >
> > > On Fri, Jul 10, 2015 at 6:47 PM, Santosh Marella <
> [email protected]>
> > > wrote:
> > >
> > > > > a) Give the executor at least a minimal 0.01cpu, 1MB RAM
> > > >
> > > > Myriad does this already. The problem is not with respect to
> executor's
> > > > capacity.
> > > >
> > > > > b) ... I don't think I understand your "zero profile" use case
> > > >
> > > > Let's take an example. Let's say the "low" profile corresponds to
> > > > (2G,1CPU). When
> > > > Myriad wants to launch a NM with "low" profile, it waits for a mesos
> > > offer
> > > > that can
> > > > hold an executor + a java process for NM + a (2G,1CPU) capacity that
> NM
> > > > can advertise to RM for launching future YARN containers. With CGS,
> > > > when NM registers with RM, YARN scheduler believes the NM has
> (2G,1CPU)
> > > > and hence can allocate containers worth (2G,1CPU) when apps require
> > > > containers.
> > > >
> > > > With FGS, YARN scheduler believes NM has (0G,0CPU). This is because,
> > FGS
> > > > intercepts NM's registration with RM and sets NM's advertised
> capacity
> > to
> > > > (0G,0CPU),
> > > > although NM has originally started with (2G,1CPU).  At this point,
> YARN
> > > > scheduler
> > > > cannot allocate containers to this NM. Subsequently, when mesos
> offers
> > > > resources
> > > > on the same slave node, FGS increases the capacity of the NM and
> > notifies
> > > > RM
> > > > that NM now has capacity available. For e.g. if (5G,4CPU) are offered
> > to
> > > > Myriad,
> > > > then FGS notifies RM that the NM now has (5G,4CPU). RM can now
> allocate
> > > > containers worth (5G,4CPU) for this NM. If you now count the total
> > > > resources Myriad has consumed from the given slave node, we observe
> > that
> > > > Myriad
> > > > never utilizes the (2G,1CPU) ["low" profile size] that was obtained
> at
> > > NM's
> > > > launch time.
> > > > The notion of a "zero" profile tries to eliminate this wastage by
> > > allowing
> > > > NM to
> > > > be launched with an advertisable capacity of (0G,0CPU) in the first
> > > place.
> > > >
> > > > Why does FGS change NM's initial capacity from (2G,1CPU) to
> (0G,0CPU)?
> > > > That's the way it had been until now, but it need not be. FGS can
> > choose
> > > to
> > > > not reset
> > > > NM's capacity to (0G,0CPU) and instead allow NM to grow beyond
> initial
> > > > capacity of
> > > > (2G,1CPU) and shrink back to (2G,1CPU). I tried this approach
> recently,
> > > but
> > > > there
> > > > are other problems if we do that (mentioned under option#1 in my
> first
> > > > email) that
> > > > seemed more complex than going with a "zero" profile.
> > > >
> > > > > c)... We should still investigate pushing a disable flag into YARN.
> > > > Absolutely. It totally makes sense to turn off admission restriction
> > > > for auto-scaling YARN clusters.
> > > >
> > > > FWIW, I will be sending out a PR shortly from my private issue_14
> > branch
> > > > with the changes I made so far. Comments/suggestions are welcome!
> > > >
> > > > Thanks,
> > > > Santosh
> > > >
> > > > On Fri, Jul 10, 2015 at 11:44 AM, Adam Bordelon <[email protected]>
> > > > wrote:
> > > >
> > > > > a) Give the executor at least a minimal 0.01cpu, 1MB RAM, since the
> > > > > executor itself will use some resources, and Mesos gets confused
> when
> > > the
> > > > > executor claims no resources. See
> > > > > https://issues.apache.org/jira/browse/MESOS-1807
> > > > > b) I agree 100% with needing a way to enable/disable FGS vs. CGS,
> > but I
> > > > > don't think I understand your "zero profile" use case. I'd
> recommend
> > > > going
> > > > > with a simple enable/disable flag for the MVP, and then we can
> extend
> > > it
> > > > > later if/when necessary.
> > > > > c) Interesting. Seems like a hacky workaround for the admission
> > control
> > > > > problem, but I'm intrigued by its complexities and capabilities for
> > > other
> > > > > scenarios. We should still investigate pushing a disable flag into
> > > YARN.
> > > > > > YARN-2604, YARN-3079. YARN-2604 seems to have been added because
> > of a
> > > > > > genuine problem where an app's AM container size exceeds the size
> > of
> > > > the
> > > > > > largest NM node in the cluster.
> > > > > This still needs a way to be disabled, because an auto-scaling
> Hadoop
> > > > > cluster wouldn't worry about insufficient capacity. It would just
> > make
> > > > > more.
> > > > >
> > > > > On Fri, Jul 10, 2015 at 11:13 AM, Santosh Marella <
> > > [email protected]
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Good point. YARN seems to have added this admission control as
> part
> > > of
> > > > > > YARN-2604, YARN-3079. YARN-2604 seems to have been added because
> > of a
> > > > > > genuine problem where an app's AM container size exceeds the size
> > of
> > > > the
> > > > > > largest NM node in the cluster. They also have a configurable
> > > interval
> > > > > that
> > > > > > controls for how long should the admission control be relaxed
> after
> > > > RM's
> > > > > > startup
> > > > > (yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms).
> > > > > > This was added to avoid rejection of apps submitted after RM
> > > (re)starts
> > > > > and
> > > > > > before any NMs register with RM.
> > > > > >
> > > > > > One option is to have a larger value for the above configuration
> > > > > parameter
> > > > > > for Myriad based YARN clusters. However, it might be worth to see
> > in
> > > > > detail
> > > > > > the effects of doing that, since the same config param is also
> used
> > > in
> > > > > > "work preserving RM restart" feature.
> > > > > >
> > > > > > Another option is to add a flag to disable admission control in
> RM
> > > and
> > > > > push
> > > > > > the change into YARN.
> > > > > >
> > > > > > In addition to (or irrespective of) the above, I think the
> > following
> > > > > > problems should still be fixed in Myriad:
> > > > > > a. FGS shouldn't set NM's capacity to (0G,0CPU) during
> > registration:
> > > > > > This is because, if a NM is launched with a "medium" profile and
> > FGS
> > > > sets
> > > > > > it's capacity to (0G,0CPU), RM will never schedule containers on
> > this
> > > > NM
> > > > > > unless FGS expands the capacity with additional mesos offers.
> > > > > Essentially,
> > > > > > the capacity used for launching the NM will not be utilized at
> all.
> > > > > > On the other hand, not setting the capacity to (0G,0CPU) is also
> a
> > > > > problem
> > > > > > because once RM allocates containers, FGS can't (easily) tell
> > whether
> > > > the
> > > > > > containers were allocated due to NM's initial capacity or due to
> > > > > additional
> > > > > > offers received from Mesos.
> > > > > >
> > > > > > b. Configuration to enable/disable FGS:
> > > > > > Currently, there is no configuration/semantics that control
> whether
> > > > > Myriad
> > > > > > uses coarse grained scaling or fine grained scaling. If you run
> > > Myriad
> > > > > off
> > > > > > of "phase1" branch, you get coarse grained scaling (CGS). If you
> > run
> > > > off
> > > > > of
> > > > > > "branch_14", you get FGS. As we want branch_14 to be merged into
> > > phase1
> > > > > at
> > > > > > some point, we need to think of a way to enable/disable FGS. One
> > > option
> > > > > > might be to a configuration that tells whether CGS should be
> > enabled
> > > or
> > > > > FGS
> > > > > > should be enabled. However, I feel both of these features are
> > pretty
> > > > > useful
> > > > > > and a co-existence of both would be ideal. Hence, introducing a
> new
> > > > "zero
> > > > > > profile" (or we should name it "verticallyScalable" profile or
> > > similar)
> > > > > and
> > > > > > allowing FGS to be applicable only to this profile helps us
> define
> > a
> > > > way
> > > > > > where admin can use just FGS or just CGS or a combination of
> both.
> > > > > >
> > > > > > c. Specify (profiles, instances) at startup:
> > > > > > Currently, "flexup" is the only way to add more NMs. It's
> > convenient
> > > > make
> > > > > > the number of instances of each profile configurable in .yml
> file.
> > If
> > > > > admin
> > > > > > chooses to have a few NMs with FGS and a few with CGS, it's a lot
> > > > easier
> > > > > to
> > > > > > specify it before starting RM. Myriad could exploit this
> > > configuration
> > > > to
> > > > > > provide a reasonable workaround to the admission control problem:
> > > > enforce
> > > > > > at least 1 NM of non-zero size.
> > > > > >
> > > > > > Thanks,
> > > > > > Santosh
> > > > > >
> > > > > > On Fri, Jul 10, 2015 at 12:32 AM, Adam Bordelon <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Why not just add a flag to disable the admission control logic
> in
> > > > RMs?
> > > > > > This
> > > > > > > same concern came up in the Kubernetes-Mesos framework, which
> > uses
> > > a
> > > > > > > similar "placeholder task" architecture to grow/shrink the
> > > executor's
> > > > > > > container as new tasks/pods are launched. We spoke to the K8s
> > team,
> > > > and
> > > > > > > they agreed that the admission control check is not critical to
> > the
> > > > > > > functionality of their API server (task launch API), and it was
> > > kept
> > > > > > behind
> > > > > > > a flag.
> > > > > > > I know we don't want to depend on forks of either project, but
> we
> > > can
> > > > > > push
> > > > > > > changes into Mesos/YARN when necessary.
> > > > > > >
> > > > > > > On Thu, Jul 9, 2015 at 1:59 PM, Santosh Marella <
> > > > [email protected]
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > With hadoop-2.7, RM rejects app submissions when the capacity
> > > > > required
> > > > > > to
> > > > > > > > run the app master exceeds the cluster capacity. Fine Grained
> > > > Scaling
> > > > > > > (FGS)
> > > > > > > > is effected by the above problem. This is because, FGS sets
> the
> > > > Node
> > > > > > > > Manager's capacity to (0G,0CPU) when the NodeManager
> registers
> > > with
> > > > > RM
> > > > > > > and
> > > > > > > > expands NM's capacity with resource offers from mesos. Thus,
> as
> > > > each
> > > > > > NM's
> > > > > > > > capacity is set to (0G,0CPU), the "cluster capacity" stays at
> > > > > (0G,0CPU)
> > > > > > > > causing the submitted apps to be rejected by RM. Although FGS
> > > > expands
> > > > > > the
> > > > > > > > NM's capacity with mesos offers, the probability of the
> cluster
> > > > > > capacity
> > > > > > > > exceeding the AM container's capacity at the instant the app
> is
> > > > > > submitted
> > > > > > > > is still very low.
> > > > > > > >
> > > > > > > > Couple of options were evaluated to fix the above problem:
> > > > > > > >
> > > > > > > > *Option #1*
> > > > > > > > - Let FGS not set NM's capacity to (0G,0CPU) during NM's
> > > > registration
> > > > > > > with
> > > > > > > > RM. Let FGS use mesos offers to expand NM's capacity beyond
> > it's
> > > > > > initial
> > > > > > > > capacity (this is what FGS does already). When the mesos
> > offered
> > > > > > capacity
> > > > > > > > is used/relinquished by Myriad, the NM's capacity is brought
> > down
> > > > to
> > > > > > it's
> > > > > > > > initial capacity.
> > > > > > > >
> > > > > > > > Pros:
> > > > > > > >   - App submissions won't be rejected as NMs always have
> > certain
> > > > > > minimum
> > > > > > > > capacity (== profile size).
> > > > > > > >   - NMs capacities are flexible. NMs start with some initial
> > > > > capacity,
> > > > > > > grow
> > > > > > > > in size with mesos offers and shrink back to the initial
> > > capacity.
> > > > > > > >
> > > > > > > > Cons:
> > > > > > > >   - Hard to implement. The main problem is this:
> > > > > > > >    Let's say an NM registered with RM with an initial
> capacity
> > of
> > > > > > > (3G,2CPU)
> > > > > > > > and Myriad subsequently receives a new offer worth (3G,1CPU).
> > If
> > > > > Myriad
> > > > > > > > sets the NM's capacity to (6G,3CPU) and allow RM to perform
> > > > > scheduling,
> > > > > > > > then RM can potentially allocate 3 containers of (2G,1CPU)
> > each.
> > > > Once
> > > > > > the
> > > > > > > > containers are allocated, Myriad needs to figure out which of
> > > these
> > > > > > > > containers are
> > > > > > > >       a) allocated purely due to NM's initial capacity.
> > > > > > > >       b) allocated purely due to additional mesos offers.
> > > > > > > >       c) allocated purely due to combining NM's initial
> > capacity
> > > > with
> > > > > > > > additional mesos offers.
> > > > > > > >
> > > > > > > >     (c) is especially complex, since Myriad has to figure out
> > the
> > > > > > partial
> > > > > > > > resources consumed from the mesos offers and hold on to these
> > > > > resources
> > > > > > > as
> > > > > > > > long as the YARN containers utilizing these resources are
> > alive.
> > > > > > > >
> > > > > > > > *Option #2*
> > > > > > > > 1. Introduce the notion of a new "zero" profile for NMs. NMs
> > > > launched
> > > > > > > with
> > > > > > > > this profile register with RM with (0G,0CPU). Existing
> profile
> > > > > > > definitions
> > > > > > > > (low/medium/high) are left intact.
> > > > > > > > 2. Allow FGS to be applicable only if a NM registers with
> > > (0G,0CPU)
> > > > > > > > capacity. With this, all the containers allocated to a zero
> > > profile
> > > > > NM
> > > > > > > are
> > > > > > > > always due to resources offered by mesos.
> > > > > > > > 3. Let Myriad start a configured number of NMs (default==1)
> > with
> > > a
> > > > > > > > configured profile (default==low). This will help with
> "cluster
> > > > > > capacity"
> > > > > > > > to never be (0G,0CPU) and prevent rejection of apps.
> > > > > > > >
> > > > > > > > Pros:
> > > > > > > >   - App submissions won't be rejected as the "cluster
> capacity"
> > > is
> > > > > > never
> > > > > > > > (0G,0CPU).
> > > > > > > >   - YARN cluster always would have certain minimum capacity
> (==
> > > sum
> > > > > of
> > > > > > > > capacities of NMs launched with non-zero profiles).
> > > > > > > >   - YARN cluster capacity remains flexible, since the
> non-zero
> > > NMs
> > > > > grow
> > > > > > > and
> > > > > > > > shrink in size.
> > > > > > > >
> > > > > > > > Cons:
> > > > > > > >   - Not a huge con, but one concern is that since some NMs
> are
> > of
> > > > > fixed
> > > > > > > > size and some NMs are flexible, admin might want to be able
> to
> > > > > control
> > > > > > > the
> > > > > > > > NM placement wisely. We already have a issue raised to track
> > > this,
> > > > > > > perhaps
> > > > > > > > for a different context. But it's certainly applicable here
> as
> > > > well.
> > > > > > The
> > > > > > > > issue is: https://github.com/mesos/myriad/issues/105
> > > > > > > >
> > > > > > > > I tried Option#1 during last week and abandoned it for it's
> > > > > > complexity. I
> > > > > > > > started implementing #2 (Point 3 above is still pending).
> > > > > > > >
> > > > > > > > I'm happy to include any feedback from folks before sending
> out
> > > the
> > > > > > code
> > > > > > > > for review.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Santosh
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fine Grained Scaling and Hadoop-2.7.

Reply via email to