Re: Fine Grained Scaling and Hadoop-2.7.

Adam Bordelon Tue, 14 Jul 2015 16:11:35 -0700

Ok, this makes sense now. With zero profile, tracking these will be much
easier, since each YARN container would have a placeholder task of the same
size.
But with an initial/minimum capacity, you'd need to do extra bookkeeping to
know how many resources belong to each task, what the initial NM capacity
was, and its current size. Then, when a task completes, you'll see how many
resources it was using, and determine whether some/all of those resources
should be freed and given back to Mesos, or whether they just go back to
idle minimum capacity for the NM. However, since Mesos doesn't (yet)
support resizeTask, you'd have to kill the placeholder task that best
matches the size of the completed task (even though that task may have
originally launched in the miminum capacity). Tricky indeed.
So, I like the idea of the zero-profile NM in that case, but it still
doesn't solve the problem of admission control of AMs/containers that are
bigger than the current cluster capacity. If we keep some minimum capacity
NMs that can resize with placeholder tasks, you run into the same problem
as above. The only options I can imagine are to a) use fixed-size NMs that
cannot grow, alongside the elastic zero-profile NMs; or b) disable
admission control in the RM so this isn't a problem. I'd vote for b), but
depending on how long that takes, you may want to implement a) in the
meantime.


On Tue, Jul 14, 2015 at 2:02 PM, Santosh Marella <smare...@maprtech.com>
wrote:

> >Why do you worry that Myriad needs to figure out which container is
> associated with which offer/profile
>
> The framework needs to figure out the size of the "placeholder task" that
> it needs to launch corresponding
> to a YARN container. The size of the "placeholder" is not always 1:1 with
> the size of the YARN
> container (zero profile is trying to make it 1:1).
>
> Let's take an example flow:
>
> 1. Let's say the NM's initial capacity was (4G,4CPU) and YARN wants to
> launch a container with size (2G, 2CPU). No problem. NM already has
> capacity
> to accommodate it. No need to wait for more offers or to launch placeholder
> mesos tasks.
> Just launch the YARN container via NM's HB.
>
> 2. Let's say the NM's initial capacity was (4G,4CPU) and (2G,2CPU) is under
> use due to previously launched YARN container. If the RM's next request
> requires a container with (3G,3CPU), that container doesn't get allocated
> to
> this NM, since NM doesn't have enough capacity. No problem here too.
>
> 3. Let's say mesos offers a (1G,1CPU) at this point. NM has (2G,2CPU)
> available and Myriad allows adding (1G,1CPU) to it. Thus, RM believes
> NM now has (3G,3CPU) and allocates a (3G,3CPU) container on the NM.
> At this point, since Myriad needs to use the launchTasks() API to
> launch a "placeholder" task with (1G,1CPU).
>
> Thanks,
> Santosh
>
> On Tue, Jul 14, 2015 at 1:12 AM, Adam Bordelon <a...@mesosphere.io> wrote:
>
> > Ah, I'm understanding better now. Leaving the 2G,1CPU unused is certainly
> > flawed and undesirable.
> > I'm unopposed to the idea of an initial/minimum profile size that grows
> and
> > shrinks but never goes below its initial/minimum capacity. As for your
> > concern, a recently completed task will give up its unnamed resources
> like
> > cpu and memory, without knowing/caring where they go. There is no
> > distinction between the cpu from one task and the cpu from another. First
> > priority goes to maintaining the minimum capacity. Anything beyond that
> can
> > be offered back to Mesos (perhaps after some timeout for promoting
> reuse).
> > The only concern might be with named resources like ports or persistent
> > volumes. Why do you worry that Myriad needs to figure out which container
> > is associated with which offer/profile? Is it not already tracking the
> YARN
> > containers? How else does it know when to release resources?
> > That said, a zero profile also makes sense, as does mixing profiles of
> > different sizes (including zero/non-zero) within a cluster. You could
> > restrict dynamic NM resizing to zero-profile NMs for starters, but I'd
> > imagine we'd want them all to be resizable in the future.
> >
> > On Fri, Jul 10, 2015 at 6:47 PM, Santosh Marella <smare...@maprtech.com>
> > wrote:
> >
> > > > a) Give the executor at least a minimal 0.01cpu, 1MB RAM
> > >
> > > Myriad does this already. The problem is not with respect to executor's
> > > capacity.
> > >
> > > > b) ... I don't think I understand your "zero profile" use case
> > >
> > > Let's take an example. Let's say the "low" profile corresponds to
> > > (2G,1CPU). When
> > > Myriad wants to launch a NM with "low" profile, it waits for a mesos
> > offer
> > > that can
> > > hold an executor + a java process for NM + a (2G,1CPU) capacity that NM
> > > can advertise to RM for launching future YARN containers. With CGS,
> > > when NM registers with RM, YARN scheduler believes the NM has (2G,1CPU)
> > > and hence can allocate containers worth (2G,1CPU) when apps require
> > > containers.
> > >
> > > With FGS, YARN scheduler believes NM has (0G,0CPU). This is because,
> FGS
> > > intercepts NM's registration with RM and sets NM's advertised capacity
> to
> > > (0G,0CPU),
> > > although NM has originally started with (2G,1CPU).  At this point, YARN
> > > scheduler
> > > cannot allocate containers to this NM. Subsequently, when mesos offers
> > > resources
> > > on the same slave node, FGS increases the capacity of the NM and
> notifies
> > > RM
> > > that NM now has capacity available. For e.g. if (5G,4CPU) are offered
> to
> > > Myriad,
> > > then FGS notifies RM that the NM now has (5G,4CPU). RM can now allocate
> > > containers worth (5G,4CPU) for this NM. If you now count the total
> > > resources Myriad has consumed from the given slave node, we observe
> that
> > > Myriad
> > > never utilizes the (2G,1CPU) ["low" profile size] that was obtained at
> > NM's
> > > launch time.
> > > The notion of a "zero" profile tries to eliminate this wastage by
> > allowing
> > > NM to
> > > be launched with an advertisable capacity of (0G,0CPU) in the first
> > place.
> > >
> > > Why does FGS change NM's initial capacity from (2G,1CPU) to (0G,0CPU)?
> > > That's the way it had been until now, but it need not be. FGS can
> choose
> > to
> > > not reset
> > > NM's capacity to (0G,0CPU) and instead allow NM to grow beyond initial
> > > capacity of
> > > (2G,1CPU) and shrink back to (2G,1CPU). I tried this approach recently,
> > but
> > > there
> > > are other problems if we do that (mentioned under option#1 in my first
> > > email) that
> > > seemed more complex than going with a "zero" profile.
> > >
> > > > c)... We should still investigate pushing a disable flag into YARN.
> > > Absolutely. It totally makes sense to turn off admission restriction
> > > for auto-scaling YARN clusters.
> > >
> > > FWIW, I will be sending out a PR shortly from my private issue_14
> branch
> > > with the changes I made so far. Comments/suggestions are welcome!
> > >
> > > Thanks,
> > > Santosh
> > >
> > > On Fri, Jul 10, 2015 at 11:44 AM, Adam Bordelon <a...@mesosphere.io>
> > > wrote:
> > >
> > > > a) Give the executor at least a minimal 0.01cpu, 1MB RAM, since the
> > > > executor itself will use some resources, and Mesos gets confused when
> > the
> > > > executor claims no resources. See
> > > > https://issues.apache.org/jira/browse/MESOS-1807
> > > > b) I agree 100% with needing a way to enable/disable FGS vs. CGS,
> but I
> > > > don't think I understand your "zero profile" use case. I'd recommend
> > > going
> > > > with a simple enable/disable flag for the MVP, and then we can extend
> > it
> > > > later if/when necessary.
> > > > c) Interesting. Seems like a hacky workaround for the admission
> control
> > > > problem, but I'm intrigued by its complexities and capabilities for
> > other
> > > > scenarios. We should still investigate pushing a disable flag into
> > YARN.
> > > > > YARN-2604, YARN-3079. YARN-2604 seems to have been added because
> of a
> > > > > genuine problem where an app's AM container size exceeds the size
> of
> > > the
> > > > > largest NM node in the cluster.
> > > > This still needs a way to be disabled, because an auto-scaling Hadoop
> > > > cluster wouldn't worry about insufficient capacity. It would just
> make
> > > > more.
> > > >
> > > > On Fri, Jul 10, 2015 at 11:13 AM, Santosh Marella <
> > smare...@maprtech.com
> > > >
> > > > wrote:
> > > >
> > > > > Good point. YARN seems to have added this admission control as part
> > of
> > > > > YARN-2604, YARN-3079. YARN-2604 seems to have been added because
> of a
> > > > > genuine problem where an app's AM container size exceeds the size
> of
> > > the
> > > > > largest NM node in the cluster. They also have a configurable
> > interval
> > > > that
> > > > > controls for how long should the admission control be relaxed after
> > > RM's
> > > > > startup
> > > > (yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms).
> > > > > This was added to avoid rejection of apps submitted after RM
> > (re)starts
> > > > and
> > > > > before any NMs register with RM.
> > > > >
> > > > > One option is to have a larger value for the above configuration
> > > > parameter
> > > > > for Myriad based YARN clusters. However, it might be worth to see
> in
> > > > detail
> > > > > the effects of doing that, since the same config param is also used
> > in
> > > > > "work preserving RM restart" feature.
> > > > >
> > > > > Another option is to add a flag to disable admission control in RM
> > and
> > > > push
> > > > > the change into YARN.
> > > > >
> > > > > In addition to (or irrespective of) the above, I think the
> following
> > > > > problems should still be fixed in Myriad:
> > > > > a. FGS shouldn't set NM's capacity to (0G,0CPU) during
> registration:
> > > > > This is because, if a NM is launched with a "medium" profile and
> FGS
> > > sets
> > > > > it's capacity to (0G,0CPU), RM will never schedule containers on
> this
> > > NM
> > > > > unless FGS expands the capacity with additional mesos offers.
> > > > Essentially,
> > > > > the capacity used for launching the NM will not be utilized at all.
> > > > > On the other hand, not setting the capacity to (0G,0CPU) is also a
> > > > problem
> > > > > because once RM allocates containers, FGS can't (easily) tell
> whether
> > > the
> > > > > containers were allocated due to NM's initial capacity or due to
> > > > additional
> > > > > offers received from Mesos.
> > > > >
> > > > > b. Configuration to enable/disable FGS:
> > > > > Currently, there is no configuration/semantics that control whether
> > > > Myriad
> > > > > uses coarse grained scaling or fine grained scaling. If you run
> > Myriad
> > > > off
> > > > > of "phase1" branch, you get coarse grained scaling (CGS). If you
> run
> > > off
> > > > of
> > > > > "branch_14", you get FGS. As we want branch_14 to be merged into
> > phase1
> > > > at
> > > > > some point, we need to think of a way to enable/disable FGS. One
> > option
> > > > > might be to a configuration that tells whether CGS should be
> enabled
> > or
> > > > FGS
> > > > > should be enabled. However, I feel both of these features are
> pretty
> > > > useful
> > > > > and a co-existence of both would be ideal. Hence, introducing a new
> > > "zero
> > > > > profile" (or we should name it "verticallyScalable" profile or
> > similar)
> > > > and
> > > > > allowing FGS to be applicable only to this profile helps us define
> a
> > > way
> > > > > where admin can use just FGS or just CGS or a combination of both.
> > > > >
> > > > > c. Specify (profiles, instances) at startup:
> > > > > Currently, "flexup" is the only way to add more NMs. It's
> convenient
> > > make
> > > > > the number of instances of each profile configurable in .yml file.
> If
> > > > admin
> > > > > chooses to have a few NMs with FGS and a few with CGS, it's a lot
> > > easier
> > > > to
> > > > > specify it before starting RM. Myriad could exploit this
> > configuration
> > > to
> > > > > provide a reasonable workaround to the admission control problem:
> > > enforce
> > > > > at least 1 NM of non-zero size.
> > > > >
> > > > > Thanks,
> > > > > Santosh
> > > > >
> > > > > On Fri, Jul 10, 2015 at 12:32 AM, Adam Bordelon <
> a...@mesosphere.io>
> > > > > wrote:
> > > > >
> > > > > > Why not just add a flag to disable the admission control logic in
> > > RMs?
> > > > > This
> > > > > > same concern came up in the Kubernetes-Mesos framework, which
> uses
> > a
> > > > > > similar "placeholder task" architecture to grow/shrink the
> > executor's
> > > > > > container as new tasks/pods are launched. We spoke to the K8s
> team,
> > > and
> > > > > > they agreed that the admission control check is not critical to
> the
> > > > > > functionality of their API server (task launch API), and it was
> > kept
> > > > > behind
> > > > > > a flag.
> > > > > > I know we don't want to depend on forks of either project, but we
> > can
> > > > > push
> > > > > > changes into Mesos/YARN when necessary.
> > > > > >
> > > > > > On Thu, Jul 9, 2015 at 1:59 PM, Santosh Marella <
> > > smare...@maprtech.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > With hadoop-2.7, RM rejects app submissions when the capacity
> > > > required
> > > > > to
> > > > > > > run the app master exceeds the cluster capacity. Fine Grained
> > > Scaling
> > > > > > (FGS)
> > > > > > > is effected by the above problem. This is because, FGS sets the
> > > Node
> > > > > > > Manager's capacity to (0G,0CPU) when the NodeManager registers
> > with
> > > > RM
> > > > > > and
> > > > > > > expands NM's capacity with resource offers from mesos. Thus, as
> > > each
> > > > > NM's
> > > > > > > capacity is set to (0G,0CPU), the "cluster capacity" stays at
> > > > (0G,0CPU)
> > > > > > > causing the submitted apps to be rejected by RM. Although FGS
> > > expands
> > > > > the
> > > > > > > NM's capacity with mesos offers, the probability of the cluster
> > > > > capacity
> > > > > > > exceeding the AM container's capacity at the instant the app is
> > > > > submitted
> > > > > > > is still very low.
> > > > > > >
> > > > > > > Couple of options were evaluated to fix the above problem:
> > > > > > >
> > > > > > > *Option #1*
> > > > > > > - Let FGS not set NM's capacity to (0G,0CPU) during NM's
> > > registration
> > > > > > with
> > > > > > > RM. Let FGS use mesos offers to expand NM's capacity beyond
> it's
> > > > > initial
> > > > > > > capacity (this is what FGS does already). When the mesos
> offered
> > > > > capacity
> > > > > > > is used/relinquished by Myriad, the NM's capacity is brought
> down
> > > to
> > > > > it's
> > > > > > > initial capacity.
> > > > > > >
> > > > > > > Pros:
> > > > > > >   - App submissions won't be rejected as NMs always have
> certain
> > > > > minimum
> > > > > > > capacity (== profile size).
> > > > > > >   - NMs capacities are flexible. NMs start with some initial
> > > > capacity,
> > > > > > grow
> > > > > > > in size with mesos offers and shrink back to the initial
> > capacity.
> > > > > > >
> > > > > > > Cons:
> > > > > > >   - Hard to implement. The main problem is this:
> > > > > > >    Let's say an NM registered with RM with an initial capacity
> of
> > > > > > (3G,2CPU)
> > > > > > > and Myriad subsequently receives a new offer worth (3G,1CPU).
> If
> > > > Myriad
> > > > > > > sets the NM's capacity to (6G,3CPU) and allow RM to perform
> > > > scheduling,
> > > > > > > then RM can potentially allocate 3 containers of (2G,1CPU)
> each.
> > > Once
> > > > > the
> > > > > > > containers are allocated, Myriad needs to figure out which of
> > these
> > > > > > > containers are
> > > > > > >       a) allocated purely due to NM's initial capacity.
> > > > > > >       b) allocated purely due to additional mesos offers.
> > > > > > >       c) allocated purely due to combining NM's initial
> capacity
> > > with
> > > > > > > additional mesos offers.
> > > > > > >
> > > > > > >     (c) is especially complex, since Myriad has to figure out
> the
> > > > > partial
> > > > > > > resources consumed from the mesos offers and hold on to these
> > > > resources
> > > > > > as
> > > > > > > long as the YARN containers utilizing these resources are
> alive.
> > > > > > >
> > > > > > > *Option #2*
> > > > > > > 1. Introduce the notion of a new "zero" profile for NMs. NMs
> > > launched
> > > > > > with
> > > > > > > this profile register with RM with (0G,0CPU). Existing profile
> > > > > > definitions
> > > > > > > (low/medium/high) are left intact.
> > > > > > > 2. Allow FGS to be applicable only if a NM registers with
> > (0G,0CPU)
> > > > > > > capacity. With this, all the containers allocated to a zero
> > profile
> > > > NM
> > > > > > are
> > > > > > > always due to resources offered by mesos.
> > > > > > > 3. Let Myriad start a configured number of NMs (default==1)
> with
> > a
> > > > > > > configured profile (default==low). This will help with "cluster
> > > > > capacity"
> > > > > > > to never be (0G,0CPU) and prevent rejection of apps.
> > > > > > >
> > > > > > > Pros:
> > > > > > >   - App submissions won't be rejected as the "cluster capacity"
> > is
> > > > > never
> > > > > > > (0G,0CPU).
> > > > > > >   - YARN cluster always would have certain minimum capacity (==
> > sum
> > > > of
> > > > > > > capacities of NMs launched with non-zero profiles).
> > > > > > >   - YARN cluster capacity remains flexible, since the non-zero
> > NMs
> > > > grow
> > > > > > and
> > > > > > > shrink in size.
> > > > > > >
> > > > > > > Cons:
> > > > > > >   - Not a huge con, but one concern is that since some NMs are
> of
> > > > fixed
> > > > > > > size and some NMs are flexible, admin might want to be able to
> > > > control
> > > > > > the
> > > > > > > NM placement wisely. We already have a issue raised to track
> > this,
> > > > > > perhaps
> > > > > > > for a different context. But it's certainly applicable here as
> > > well.
> > > > > The
> > > > > > > issue is: https://github.com/mesos/myriad/issues/105
> > > > > > >
> > > > > > > I tried Option#1 during last week and abandoned it for it's
> > > > > complexity. I
> > > > > > > started implementing #2 (Point 3 above is still pending).
> > > > > > >
> > > > > > > I'm happy to include any feedback from folks before sending out
> > the
> > > > > code
> > > > > > > for review.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Santosh
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fine Grained Scaling and Hadoop-2.7.

Reply via email to