To me both use cases appear to be generic resource management use cases. For example, a randomly rebooting node is not good for any purpose esp. long running apps so it is a bit of a stretch to imagine that these nodes will be acceptable for some batch jobs in Yarn. So such a node should be marked “Bad” or Unavailable in Yarn itself.
Second use case is also typical anti-affinity use case which ideally should be implemented in Yarn – Milind’s example can also apply to non-Apex batch jobs. In any case it looks like Yarn still doesn’t have it (https://issues.apache.org/jira/browse/YARN-1042) so if Apex needs it we will need to do it ourselves. On 11/30/16, 10:39 AM, "Munagala Ramanath" <r...@datatorrent.com> wrote: But then, what's the solution to the 2 problem scenarios that Milind describes ? Ram On Wed, Nov 30, 2016 at 10:34 AM, Sanjay Pujare <san...@datatorrent.com> wrote: > I think “exclude nodes” and such is really the job of the resource manager > i.e. Yarn. So I am not sure taking over some of these tasks in Apex would > be very useful. > > I agree with Amol that apps should be node neutral. Resource management in > Yarn together with fault tolerance in Apex should minimize the need for > this feature although I am sure one can find use cases. > > > On 11/29/16, 10:41 PM, "Amol Kekre" <a...@datatorrent.com> wrote: > > We do have this feature in Yarn, but that applies to all applications. > I am > not sure if Yarn has anti-affinity. This feature may be used, but in > general there is danger is an application taking over resource > allocation. > Another quirk is that big data apps should ideally be node-neutral. > This is > a good idea, if we are able to carve out something where need is app > specific. > > Thks > Amol > > > On Tue, Nov 29, 2016 at 10:00 PM, Milind Barve <mili...@gmail.com> > wrote: > > > We have seen 2 cases mentioned below, where, it would have been nice > if > > Apex allowed us to exclude a node from the cluster for an > application. > > > > 1. A node in the cluster had gone bad (was randomly rebooting) and > so an > > Apex app should not use it - other apps can use it as they were > batch jobs. > > 2. A node is being used for a mission critical app (Could be an Apex > app > > itself), but another Apex app which is mission critical should not > be using > > resources on that node. > > > > Can we have a way in which, Stram and YARN can coordinate between > each > > other to not use a set of nodes for the application. It an be done > in 2 way > > s- > > > > 1. Have a list of "exclude" nodes with Stram- when YARN allcates > resources > > on either of these, STRAM rejects and gets resources allocated again > frm > > YARN > > 2. Have a list of nodes that can be used for an app - This can be a > part of > > config. Hwever, I don't think this would be a right way to do so as > we will > > need support from YARN as well. Further, this might be difficult to > change > > at runtim if need be. > > > > Any thoughts? > > > > > > -- > > ~Milind bee at gee mail dot com > > > > > >