Shouldn't this be already covered by anti-affinity. Today users can specify
multiple affinity rules, for each rule they can specify positive or
negative affinity, locality and operator selection. If an affinity rule
specifying negative affinity, node locality and all operators, does not
work then let's fix that scenario instead of creating a new option.

On Thu, Dec 1, 2016 at 11:17 AM, Sandesh Hegde <sand...@datatorrent.com>
wrote:

> I have created a jira, for adding the list of blacklisted nodes,
> https://issues.apache.org/jira/browse/APEXCORE-584
>
> On Wed, Nov 30, 2016 at 11:06 PM Sanjay Pujare <san...@datatorrent.com>
> wrote:
>
> > Yes, Ram explained to me that in practice this would be a useful feature
> > for Apex devops who typically have no control over Hadoop/Yarn cluster.
> >
> > On 11/30/16, 9:22 PM, "Mohit Jotwani" <mo...@datatorrent.com> wrote:
> >
> >     This is a practical scenario where developers would be required to
> > exclude
> >     certain nodes as they might be required for some mission critical
> >     applications. It would be good to have this feature.
> >
> >     I understand that Stram should not get into resourcing and still rely
> > on
> >     Yarn, however, as the App Master it should have the right to reject
> the
> >     nodes offered by Yarn and request for other resources.
> >
> >     Regards,
> >     Mohit
> >
> >     On Thu, Dec 1, 2016 at 2:34 AM, Sandesh Hegde <
> sand...@datatorrent.com
> > >
> >     wrote:
> >
> >     > Apex has automatic blacklisting of the troublesome nodes, please
> > take a
> >     > look at the following attributes,
> >     >
> >     > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST
> >     > https://www.datatorrent.com/docs/apidocs/com/datatorrent/
> >     > api/Context.DAGContext.html#MAX_CONSECUTIVE_CONTAINER_
> >     > FAILURES_FOR_BLACKLIST
> >     >
> >     > BLACKLISTED_NODE_REMOVAL_TIME_MILLIS
> >     >
> >     > Thanks
> >     >
> >     >
> >     >
> >     > On Wed, Nov 30, 2016 at 12:56 PM Munagala Ramanath <
> > r...@datatorrent.com>
> >     > wrote:
> >     >
> >     > Not sure if this is what Milind had in mind but we often run into
> >     > situations where the dev group
> >     > working with Apex has no control over cluster configuration -- to
> > make any
> >     > changes to the cluster they need to
> >     > go through an elaborate process that can take many days.
> >     >
> >     > Meanwhile, if they notice that a particular node is consistently
> > causing
> >     > problems for their
> >     > app, having a simple way to exclude it would be very helpful since
> > it gives
> >     > them a way
> >     > to bypass communication and process issues within their own
> > organization.
> >     >
> >     > Ram
> >     >
> >     > On Wed, Nov 30, 2016 at 10:58 AM, Sanjay Pujare <
> > san...@datatorrent.com>
> >     > wrote:
> >     >
> >     > > To me both use cases appear to be generic resource management use
> > cases.
> >     > > For example, a randomly rebooting node is not good for any
> purpose
> > esp.
> >     > > long running apps so it is a bit of a stretch to imagine that
> > these nodes
> >     > > will be acceptable for some batch jobs in Yarn. So such a node
> > should be
> >     > > marked “Bad” or Unavailable in Yarn itself.
> >     > >
> >     > > Second use case is also typical anti-affinity use case which
> > ideally
> >     > > should be implemented in Yarn – Milind’s example can also apply
> to
> >     > non-Apex
> >     > > batch jobs. In any case it looks like Yarn still doesn’t have it
> (
> >     > > https://issues.apache.org/jira/browse/YARN-1042) so if Apex
> needs
> > it we
> >     > > will need to do it ourselves.
> >     > >
> >     > > On 11/30/16, 10:39 AM, "Munagala Ramanath" <r...@datatorrent.com>
> > wrote:
> >     > >
> >     > >     But then, what's the solution to the 2 problem scenarios that
> > Milind
> >     > >     describes ?
> >     > >
> >     > >     Ram
> >     > >
> >     > >     On Wed, Nov 30, 2016 at 10:34 AM, Sanjay Pujare <
> >     > > san...@datatorrent.com>
> >     > >     wrote:
> >     > >
> >     > >     > I think “exclude nodes” and such is really the job of the
> > resource
> >     > > manager
> >     > >     > i.e. Yarn. So I am not sure taking over some of these tasks
> > in Apex
> >     > > would
> >     > >     > be very useful.
> >     > >     >
> >     > >     > I agree with Amol that apps should be node neutral.
> Resource
> >     > > management in
> >     > >     > Yarn together with fault tolerance in Apex should minimize
> > the need
> >     > > for
> >     > >     > this feature although I am sure one can find use cases.
> >     > >     >
> >     > >     >
> >     > >     > On 11/29/16, 10:41 PM, "Amol Kekre" <a...@datatorrent.com>
> > wrote:
> >     > >     >
> >     > >     >     We do have this feature in Yarn, but that applies to
> all
> >     > > applications.
> >     > >     > I am
> >     > >     >     not sure if Yarn has anti-affinity. This feature may be
> > used,
> >     > > but in
> >     > >     >     general there is danger is an application taking over
> > resource
> >     > >     > allocation.
> >     > >     >     Another quirk is that big data apps should ideally be
> >     > > node-neutral.
> >     > >     > This is
> >     > >     >     a good idea, if we are able to carve out something
> where
> > need
> >     > is
> >     > > app
> >     > >     >     specific.
> >     > >     >
> >     > >     >     Thks
> >     > >     >     Amol
> >     > >     >
> >     > >     >
> >     > >     >     On Tue, Nov 29, 2016 at 10:00 PM, Milind Barve <
> >     > > mili...@gmail.com>
> >     > >     > wrote:
> >     > >     >
> >     > >     >     > We have seen 2 cases mentioned below, where, it would
> > have
> >     > > been nice
> >     > >     > if
> >     > >     >     > Apex allowed us to exclude a node from the cluster
> for
> > an
> >     > >     > application.
> >     > >     >     >
> >     > >     >     > 1. A node in the cluster had gone bad (was randomly
> >     > rebooting)
> >     > > and
> >     > >     > so an
> >     > >     >     > Apex app should not use it - other apps can use it as
> > they
> >     > were
> >     > >     > batch jobs.
> >     > >     >     > 2. A node is being used for a mission critical app
> > (Could be
> >     > > an Apex
> >     > >     > app
> >     > >     >     > itself), but another Apex app which is mission
> critical
> >     > should
> >     > > not
> >     > >     > be using
> >     > >     >     > resources on that node.
> >     > >     >     >
> >     > >     >     > Can we have a way in which, Stram and YARN can
> > coordinate
> >     > > between
> >     > >     > each
> >     > >     >     > other to not use a set of nodes for the application.
> > It an be
> >     > > done
> >     > >     > in 2 way
> >     > >     >     > s-
> >     > >     >     >
> >     > >     >     > 1. Have a list of "exclude" nodes with Stram- when
> YARN
> >     > > allcates
> >     > >     > resources
> >     > >     >     > on either of these, STRAM rejects and gets resources
> >     > allocated
> >     > > again
> >     > >     > frm
> >     > >     >     > YARN
> >     > >     >     > 2. Have a list of nodes that can be used for an app -
> > This
> >     > can
> >     > > be a
> >     > >     > part of
> >     > >     >     > config. Hwever, I don't think this would be a right
> > way to do
> >     > > so as
> >     > >     > we will
> >     > >     >     > need support from YARN as well. Further, this might
> be
> >     > > difficult to
> >     > >     > change
> >     > >     >     > at runtim if need be.
> >     > >     >     >
> >     > >     >     > Any thoughts?
> >     > >     >     >
> >     > >     >     >
> >     > >     >     > --
> >     > >     >     > ~Milind bee at gee mail dot com
> >     > >     >     >
> >     > >     >
> >     > >     >
> >     > >     >
> >     > >     >
> >     > >
> >     > >
> >     > >
> >     > >
> >     >
> >
> >
> >
> >
>

Reply via email to