Re: Fine Grained Scaling and Hadoop-2.7.

2015-07-14 Thread Adam Bordelon
Ah, I'm understanding better now. Leaving the 2G,1CPU unused is certainly
flawed and undesirable.
I'm unopposed to the idea of an initial/minimum profile size that grows and
shrinks but never goes below its initial/minimum capacity. As for your
concern, a recently completed task will give up its unnamed resources like
cpu and memory, without knowing/caring where they go. There is no
distinction between the cpu from one task and the cpu from another. First
priority goes to maintaining the minimum capacity. Anything beyond that can
be offered back to Mesos (perhaps after some timeout for promoting reuse).
The only concern might be with named resources like ports or persistent
volumes. Why do you worry that Myriad needs to figure out which container
is associated with which offer/profile? Is it not already tracking the YARN
containers? How else does it know when to release resources?
That said, a zero profile also makes sense, as does mixing profiles of
different sizes (including zero/non-zero) within a cluster. You could
restrict dynamic NM resizing to zero-profile NMs for starters, but I'd
imagine we'd want them all to be resizable in the future.

On Fri, Jul 10, 2015 at 6:47 PM, Santosh Marella smare...@maprtech.com
wrote:

  a) Give the executor at least a minimal 0.01cpu, 1MB RAM

 Myriad does this already. The problem is not with respect to executor's
 capacity.

  b) ... I don't think I understand your zero profile use case

 Let's take an example. Let's say the low profile corresponds to
 (2G,1CPU). When
 Myriad wants to launch a NM with low profile, it waits for a mesos offer
 that can
 hold an executor + a java process for NM + a (2G,1CPU) capacity that NM
 can advertise to RM for launching future YARN containers. With CGS,
 when NM registers with RM, YARN scheduler believes the NM has (2G,1CPU)
 and hence can allocate containers worth (2G,1CPU) when apps require
 containers.

 With FGS, YARN scheduler believes NM has (0G,0CPU). This is because, FGS
 intercepts NM's registration with RM and sets NM's advertised capacity to
 (0G,0CPU),
 although NM has originally started with (2G,1CPU).  At this point, YARN
 scheduler
 cannot allocate containers to this NM. Subsequently, when mesos offers
 resources
 on the same slave node, FGS increases the capacity of the NM and notifies
 RM
 that NM now has capacity available. For e.g. if (5G,4CPU) are offered to
 Myriad,
 then FGS notifies RM that the NM now has (5G,4CPU). RM can now allocate
 containers worth (5G,4CPU) for this NM. If you now count the total
 resources Myriad has consumed from the given slave node, we observe that
 Myriad
 never utilizes the (2G,1CPU) [low profile size] that was obtained at NM's
 launch time.
 The notion of a zero profile tries to eliminate this wastage by allowing
 NM to
 be launched with an advertisable capacity of (0G,0CPU) in the first place.

 Why does FGS change NM's initial capacity from (2G,1CPU) to (0G,0CPU)?
 That's the way it had been until now, but it need not be. FGS can choose to
 not reset
 NM's capacity to (0G,0CPU) and instead allow NM to grow beyond initial
 capacity of
 (2G,1CPU) and shrink back to (2G,1CPU). I tried this approach recently, but
 there
 are other problems if we do that (mentioned under option#1 in my first
 email) that
 seemed more complex than going with a zero profile.

  c)... We should still investigate pushing a disable flag into YARN.
 Absolutely. It totally makes sense to turn off admission restriction
 for auto-scaling YARN clusters.

 FWIW, I will be sending out a PR shortly from my private issue_14 branch
 with the changes I made so far. Comments/suggestions are welcome!

 Thanks,
 Santosh

 On Fri, Jul 10, 2015 at 11:44 AM, Adam Bordelon a...@mesosphere.io
 wrote:

  a) Give the executor at least a minimal 0.01cpu, 1MB RAM, since the
  executor itself will use some resources, and Mesos gets confused when the
  executor claims no resources. See
  https://issues.apache.org/jira/browse/MESOS-1807
  b) I agree 100% with needing a way to enable/disable FGS vs. CGS, but I
  don't think I understand your zero profile use case. I'd recommend
 going
  with a simple enable/disable flag for the MVP, and then we can extend it
  later if/when necessary.
  c) Interesting. Seems like a hacky workaround for the admission control
  problem, but I'm intrigued by its complexities and capabilities for other
  scenarios. We should still investigate pushing a disable flag into YARN.
   YARN-2604, YARN-3079. YARN-2604 seems to have been added because of a
   genuine problem where an app's AM container size exceeds the size of
 the
   largest NM node in the cluster.
  This still needs a way to be disabled, because an auto-scaling Hadoop
  cluster wouldn't worry about insufficient capacity. It would just make
  more.
 
  On Fri, Jul 10, 2015 at 11:13 AM, Santosh Marella smare...@maprtech.com
 
  wrote:
 
   Good point. YARN seems to have added this admission control as part of
   YARN-2604, YARN-3079. 

Re: Fine Grained Scaling and Hadoop-2.7.

2015-07-14 Thread Santosh Marella
 The only options I can imagine are to a) use fixed-size NMs that
cannot grow, alongside the elastic zero-profile NMs; or b) disable
admission control in the RM so this isn't a problem. I'd vote for b), but
depending on how long that takes, you may want to implement a) in the
meantime.

Agreed. (a) is implemented in PR: https://github.com/mesos/myriad/pull/116


Santosh

On Tue, Jul 14, 2015 at 4:10 PM, Adam Bordelon a...@mesosphere.io wrote:

 Ok, this makes sense now. With zero profile, tracking these will be much
 easier, since each YARN container would have a placeholder task of the same
 size.
 But with an initial/minimum capacity, you'd need to do extra bookkeeping to
 know how many resources belong to each task, what the initial NM capacity
 was, and its current size. Then, when a task completes, you'll see how many
 resources it was using, and determine whether some/all of those resources
 should be freed and given back to Mesos, or whether they just go back to
 idle minimum capacity for the NM. However, since Mesos doesn't (yet)
 support resizeTask, you'd have to kill the placeholder task that best
 matches the size of the completed task (even though that task may have
 originally launched in the miminum capacity). Tricky indeed.
 So, I like the idea of the zero-profile NM in that case, but it still
 doesn't solve the problem of admission control of AMs/containers that are
 bigger than the current cluster capacity. If we keep some minimum capacity
 NMs that can resize with placeholder tasks, you run into the same problem
 as above. The only options I can imagine are to a) use fixed-size NMs that
 cannot grow, alongside the elastic zero-profile NMs; or b) disable
 admission control in the RM so this isn't a problem. I'd vote for b), but
 depending on how long that takes, you may want to implement a) in the
 meantime.

 On Tue, Jul 14, 2015 at 2:02 PM, Santosh Marella smare...@maprtech.com
 wrote:

  Why do you worry that Myriad needs to figure out which container is
  associated with which offer/profile
 
  The framework needs to figure out the size of the placeholder task that
  it needs to launch corresponding
  to a YARN container. The size of the placeholder is not always 1:1 with
  the size of the YARN
  container (zero profile is trying to make it 1:1).
 
  Let's take an example flow:
 
  1. Let's say the NM's initial capacity was (4G,4CPU) and YARN wants to
  launch a container with size (2G, 2CPU). No problem. NM already has
  capacity
  to accommodate it. No need to wait for more offers or to launch
 placeholder
  mesos tasks.
  Just launch the YARN container via NM's HB.
 
  2. Let's say the NM's initial capacity was (4G,4CPU) and (2G,2CPU) is
 under
  use due to previously launched YARN container. If the RM's next request
  requires a container with (3G,3CPU), that container doesn't get allocated
  to
  this NM, since NM doesn't have enough capacity. No problem here too.
 
  3. Let's say mesos offers a (1G,1CPU) at this point. NM has (2G,2CPU)
  available and Myriad allows adding (1G,1CPU) to it. Thus, RM believes
  NM now has (3G,3CPU) and allocates a (3G,3CPU) container on the NM.
  At this point, since Myriad needs to use the launchTasks() API to
  launch a placeholder task with (1G,1CPU).
 
  Thanks,
  Santosh
 
  On Tue, Jul 14, 2015 at 1:12 AM, Adam Bordelon a...@mesosphere.io
 wrote:
 
   Ah, I'm understanding better now. Leaving the 2G,1CPU unused is
 certainly
   flawed and undesirable.
   I'm unopposed to the idea of an initial/minimum profile size that grows
  and
   shrinks but never goes below its initial/minimum capacity. As for your
   concern, a recently completed task will give up its unnamed resources
  like
   cpu and memory, without knowing/caring where they go. There is no
   distinction between the cpu from one task and the cpu from another.
 First
   priority goes to maintaining the minimum capacity. Anything beyond that
  can
   be offered back to Mesos (perhaps after some timeout for promoting
  reuse).
   The only concern might be with named resources like ports or persistent
   volumes. Why do you worry that Myriad needs to figure out which
 container
   is associated with which offer/profile? Is it not already tracking the
  YARN
   containers? How else does it know when to release resources?
   That said, a zero profile also makes sense, as does mixing profiles of
   different sizes (including zero/non-zero) within a cluster. You could
   restrict dynamic NM resizing to zero-profile NMs for starters, but I'd
   imagine we'd want them all to be resizable in the future.
  
   On Fri, Jul 10, 2015 at 6:47 PM, Santosh Marella 
 smare...@maprtech.com
   wrote:
  
 a) Give the executor at least a minimal 0.01cpu, 1MB RAM
   
Myriad does this already. The problem is not with respect to
 executor's
capacity.
   
 b) ... I don't think I understand your zero profile use case
   
Let's take an example. Let's say the low profile corresponds to

Re: Fine Grained Scaling and Hadoop-2.7.

2015-07-14 Thread Adam Bordelon
Ok, this makes sense now. With zero profile, tracking these will be much
easier, since each YARN container would have a placeholder task of the same
size.
But with an initial/minimum capacity, you'd need to do extra bookkeeping to
know how many resources belong to each task, what the initial NM capacity
was, and its current size. Then, when a task completes, you'll see how many
resources it was using, and determine whether some/all of those resources
should be freed and given back to Mesos, or whether they just go back to
idle minimum capacity for the NM. However, since Mesos doesn't (yet)
support resizeTask, you'd have to kill the placeholder task that best
matches the size of the completed task (even though that task may have
originally launched in the miminum capacity). Tricky indeed.
So, I like the idea of the zero-profile NM in that case, but it still
doesn't solve the problem of admission control of AMs/containers that are
bigger than the current cluster capacity. If we keep some minimum capacity
NMs that can resize with placeholder tasks, you run into the same problem
as above. The only options I can imagine are to a) use fixed-size NMs that
cannot grow, alongside the elastic zero-profile NMs; or b) disable
admission control in the RM so this isn't a problem. I'd vote for b), but
depending on how long that takes, you may want to implement a) in the
meantime.

On Tue, Jul 14, 2015 at 2:02 PM, Santosh Marella smare...@maprtech.com
wrote:

 Why do you worry that Myriad needs to figure out which container is
 associated with which offer/profile

 The framework needs to figure out the size of the placeholder task that
 it needs to launch corresponding
 to a YARN container. The size of the placeholder is not always 1:1 with
 the size of the YARN
 container (zero profile is trying to make it 1:1).

 Let's take an example flow:

 1. Let's say the NM's initial capacity was (4G,4CPU) and YARN wants to
 launch a container with size (2G, 2CPU). No problem. NM already has
 capacity
 to accommodate it. No need to wait for more offers or to launch placeholder
 mesos tasks.
 Just launch the YARN container via NM's HB.

 2. Let's say the NM's initial capacity was (4G,4CPU) and (2G,2CPU) is under
 use due to previously launched YARN container. If the RM's next request
 requires a container with (3G,3CPU), that container doesn't get allocated
 to
 this NM, since NM doesn't have enough capacity. No problem here too.

 3. Let's say mesos offers a (1G,1CPU) at this point. NM has (2G,2CPU)
 available and Myriad allows adding (1G,1CPU) to it. Thus, RM believes
 NM now has (3G,3CPU) and allocates a (3G,3CPU) container on the NM.
 At this point, since Myriad needs to use the launchTasks() API to
 launch a placeholder task with (1G,1CPU).

 Thanks,
 Santosh

 On Tue, Jul 14, 2015 at 1:12 AM, Adam Bordelon a...@mesosphere.io wrote:

  Ah, I'm understanding better now. Leaving the 2G,1CPU unused is certainly
  flawed and undesirable.
  I'm unopposed to the idea of an initial/minimum profile size that grows
 and
  shrinks but never goes below its initial/minimum capacity. As for your
  concern, a recently completed task will give up its unnamed resources
 like
  cpu and memory, without knowing/caring where they go. There is no
  distinction between the cpu from one task and the cpu from another. First
  priority goes to maintaining the minimum capacity. Anything beyond that
 can
  be offered back to Mesos (perhaps after some timeout for promoting
 reuse).
  The only concern might be with named resources like ports or persistent
  volumes. Why do you worry that Myriad needs to figure out which container
  is associated with which offer/profile? Is it not already tracking the
 YARN
  containers? How else does it know when to release resources?
  That said, a zero profile also makes sense, as does mixing profiles of
  different sizes (including zero/non-zero) within a cluster. You could
  restrict dynamic NM resizing to zero-profile NMs for starters, but I'd
  imagine we'd want them all to be resizable in the future.
 
  On Fri, Jul 10, 2015 at 6:47 PM, Santosh Marella smare...@maprtech.com
  wrote:
 
a) Give the executor at least a minimal 0.01cpu, 1MB RAM
  
   Myriad does this already. The problem is not with respect to executor's
   capacity.
  
b) ... I don't think I understand your zero profile use case
  
   Let's take an example. Let's say the low profile corresponds to
   (2G,1CPU). When
   Myriad wants to launch a NM with low profile, it waits for a mesos
  offer
   that can
   hold an executor + a java process for NM + a (2G,1CPU) capacity that NM
   can advertise to RM for launching future YARN containers. With CGS,
   when NM registers with RM, YARN scheduler believes the NM has (2G,1CPU)
   and hence can allocate containers worth (2G,1CPU) when apps require
   containers.
  
   With FGS, YARN scheduler believes NM has (0G,0CPU). This is because,
 FGS
   intercepts NM's registration with RM and sets NM's 

Re: Fine Grained Scaling and Hadoop-2.7.

2015-07-10 Thread Santosh Marella
 a) Give the executor at least a minimal 0.01cpu, 1MB RAM

Myriad does this already. The problem is not with respect to executor's
capacity.

 b) ... I don't think I understand your zero profile use case

Let's take an example. Let's say the low profile corresponds to
(2G,1CPU). When
Myriad wants to launch a NM with low profile, it waits for a mesos offer
that can
hold an executor + a java process for NM + a (2G,1CPU) capacity that NM
can advertise to RM for launching future YARN containers. With CGS,
when NM registers with RM, YARN scheduler believes the NM has (2G,1CPU)
and hence can allocate containers worth (2G,1CPU) when apps require
containers.

With FGS, YARN scheduler believes NM has (0G,0CPU). This is because, FGS
intercepts NM's registration with RM and sets NM's advertised capacity to
(0G,0CPU),
although NM has originally started with (2G,1CPU).  At this point, YARN
scheduler
cannot allocate containers to this NM. Subsequently, when mesos offers
resources
on the same slave node, FGS increases the capacity of the NM and notifies RM
that NM now has capacity available. For e.g. if (5G,4CPU) are offered to
Myriad,
then FGS notifies RM that the NM now has (5G,4CPU). RM can now allocate
containers worth (5G,4CPU) for this NM. If you now count the total
resources Myriad has consumed from the given slave node, we observe that
Myriad
never utilizes the (2G,1CPU) [low profile size] that was obtained at NM's
launch time.
The notion of a zero profile tries to eliminate this wastage by allowing
NM to
be launched with an advertisable capacity of (0G,0CPU) in the first place.

Why does FGS change NM's initial capacity from (2G,1CPU) to (0G,0CPU)?
That's the way it had been until now, but it need not be. FGS can choose to
not reset
NM's capacity to (0G,0CPU) and instead allow NM to grow beyond initial
capacity of
(2G,1CPU) and shrink back to (2G,1CPU). I tried this approach recently, but
there
are other problems if we do that (mentioned under option#1 in my first
email) that
seemed more complex than going with a zero profile.

 c)... We should still investigate pushing a disable flag into YARN.
Absolutely. It totally makes sense to turn off admission restriction
for auto-scaling YARN clusters.

FWIW, I will be sending out a PR shortly from my private issue_14 branch
with the changes I made so far. Comments/suggestions are welcome!

Thanks,
Santosh

On Fri, Jul 10, 2015 at 11:44 AM, Adam Bordelon a...@mesosphere.io wrote:

 a) Give the executor at least a minimal 0.01cpu, 1MB RAM, since the
 executor itself will use some resources, and Mesos gets confused when the
 executor claims no resources. See
 https://issues.apache.org/jira/browse/MESOS-1807
 b) I agree 100% with needing a way to enable/disable FGS vs. CGS, but I
 don't think I understand your zero profile use case. I'd recommend going
 with a simple enable/disable flag for the MVP, and then we can extend it
 later if/when necessary.
 c) Interesting. Seems like a hacky workaround for the admission control
 problem, but I'm intrigued by its complexities and capabilities for other
 scenarios. We should still investigate pushing a disable flag into YARN.
  YARN-2604, YARN-3079. YARN-2604 seems to have been added because of a
  genuine problem where an app's AM container size exceeds the size of the
  largest NM node in the cluster.
 This still needs a way to be disabled, because an auto-scaling Hadoop
 cluster wouldn't worry about insufficient capacity. It would just make
 more.

 On Fri, Jul 10, 2015 at 11:13 AM, Santosh Marella smare...@maprtech.com
 wrote:

  Good point. YARN seems to have added this admission control as part of
  YARN-2604, YARN-3079. YARN-2604 seems to have been added because of a
  genuine problem where an app's AM container size exceeds the size of the
  largest NM node in the cluster. They also have a configurable interval
 that
  controls for how long should the admission control be relaxed after RM's
  startup
 (yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms).
  This was added to avoid rejection of apps submitted after RM (re)starts
 and
  before any NMs register with RM.
 
  One option is to have a larger value for the above configuration
 parameter
  for Myriad based YARN clusters. However, it might be worth to see in
 detail
  the effects of doing that, since the same config param is also used in
  work preserving RM restart feature.
 
  Another option is to add a flag to disable admission control in RM and
 push
  the change into YARN.
 
  In addition to (or irrespective of) the above, I think the following
  problems should still be fixed in Myriad:
  a. FGS shouldn't set NM's capacity to (0G,0CPU) during registration:
  This is because, if a NM is launched with a medium profile and FGS sets
  it's capacity to (0G,0CPU), RM will never schedule containers on this NM
  unless FGS expands the capacity with additional mesos offers.
 Essentially,
  the capacity used for launching the NM will not be utilized at