I have created a jira, for adding the list of blacklisted nodes, https://issues.apache.org/jira/browse/APEXCORE-584
On Wed, Nov 30, 2016 at 11:06 PM Sanjay Pujare <san...@datatorrent.com> wrote: > Yes, Ram explained to me that in practice this would be a useful feature > for Apex devops who typically have no control over Hadoop/Yarn cluster. > > On 11/30/16, 9:22 PM, "Mohit Jotwani" <mo...@datatorrent.com> wrote: > > This is a practical scenario where developers would be required to > exclude > certain nodes as they might be required for some mission critical > applications. It would be good to have this feature. > > I understand that Stram should not get into resourcing and still rely > on > Yarn, however, as the App Master it should have the right to reject the > nodes offered by Yarn and request for other resources. > > Regards, > Mohit > > On Thu, Dec 1, 2016 at 2:34 AM, Sandesh Hegde <sand...@datatorrent.com > > > wrote: > > > Apex has automatic blacklisting of the troublesome nodes, please > take a > > look at the following attributes, > > > > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST > > https://www.datatorrent.com/docs/apidocs/com/datatorrent/ > > api/Context.DAGContext.html#MAX_CONSECUTIVE_CONTAINER_ > > FAILURES_FOR_BLACKLIST > > > > BLACKLISTED_NODE_REMOVAL_TIME_MILLIS > > > > Thanks > > > > > > > > On Wed, Nov 30, 2016 at 12:56 PM Munagala Ramanath < > r...@datatorrent.com> > > wrote: > > > > Not sure if this is what Milind had in mind but we often run into > > situations where the dev group > > working with Apex has no control over cluster configuration -- to > make any > > changes to the cluster they need to > > go through an elaborate process that can take many days. > > > > Meanwhile, if they notice that a particular node is consistently > causing > > problems for their > > app, having a simple way to exclude it would be very helpful since > it gives > > them a way > > to bypass communication and process issues within their own > organization. > > > > Ram > > > > On Wed, Nov 30, 2016 at 10:58 AM, Sanjay Pujare < > san...@datatorrent.com> > > wrote: > > > > > To me both use cases appear to be generic resource management use > cases. > > > For example, a randomly rebooting node is not good for any purpose > esp. > > > long running apps so it is a bit of a stretch to imagine that > these nodes > > > will be acceptable for some batch jobs in Yarn. So such a node > should be > > > marked “Bad” or Unavailable in Yarn itself. > > > > > > Second use case is also typical anti-affinity use case which > ideally > > > should be implemented in Yarn – Milind’s example can also apply to > > non-Apex > > > batch jobs. In any case it looks like Yarn still doesn’t have it ( > > > https://issues.apache.org/jira/browse/YARN-1042) so if Apex needs > it we > > > will need to do it ourselves. > > > > > > On 11/30/16, 10:39 AM, "Munagala Ramanath" <r...@datatorrent.com> > wrote: > > > > > > But then, what's the solution to the 2 problem scenarios that > Milind > > > describes ? > > > > > > Ram > > > > > > On Wed, Nov 30, 2016 at 10:34 AM, Sanjay Pujare < > > > san...@datatorrent.com> > > > wrote: > > > > > > > I think “exclude nodes” and such is really the job of the > resource > > > manager > > > > i.e. Yarn. So I am not sure taking over some of these tasks > in Apex > > > would > > > > be very useful. > > > > > > > > I agree with Amol that apps should be node neutral. Resource > > > management in > > > > Yarn together with fault tolerance in Apex should minimize > the need > > > for > > > > this feature although I am sure one can find use cases. > > > > > > > > > > > > On 11/29/16, 10:41 PM, "Amol Kekre" <a...@datatorrent.com> > wrote: > > > > > > > > We do have this feature in Yarn, but that applies to all > > > applications. > > > > I am > > > > not sure if Yarn has anti-affinity. This feature may be > used, > > > but in > > > > general there is danger is an application taking over > resource > > > > allocation. > > > > Another quirk is that big data apps should ideally be > > > node-neutral. > > > > This is > > > > a good idea, if we are able to carve out something where > need > > is > > > app > > > > specific. > > > > > > > > Thks > > > > Amol > > > > > > > > > > > > On Tue, Nov 29, 2016 at 10:00 PM, Milind Barve < > > > mili...@gmail.com> > > > > wrote: > > > > > > > > > We have seen 2 cases mentioned below, where, it would > have > > > been nice > > > > if > > > > > Apex allowed us to exclude a node from the cluster for > an > > > > application. > > > > > > > > > > 1. A node in the cluster had gone bad (was randomly > > rebooting) > > > and > > > > so an > > > > > Apex app should not use it - other apps can use it as > they > > were > > > > batch jobs. > > > > > 2. A node is being used for a mission critical app > (Could be > > > an Apex > > > > app > > > > > itself), but another Apex app which is mission critical > > should > > > not > > > > be using > > > > > resources on that node. > > > > > > > > > > Can we have a way in which, Stram and YARN can > coordinate > > > between > > > > each > > > > > other to not use a set of nodes for the application. > It an be > > > done > > > > in 2 way > > > > > s- > > > > > > > > > > 1. Have a list of "exclude" nodes with Stram- when YARN > > > allcates > > > > resources > > > > > on either of these, STRAM rejects and gets resources > > allocated > > > again > > > > frm > > > > > YARN > > > > > 2. Have a list of nodes that can be used for an app - > This > > can > > > be a > > > > part of > > > > > config. Hwever, I don't think this would be a right > way to do > > > so as > > > > we will > > > > > need support from YARN as well. Further, this might be > > > difficult to > > > > change > > > > > at runtim if need be. > > > > > > > > > > Any thoughts? > > > > > > > > > > > > > > > -- > > > > > ~Milind bee at gee mail dot com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >