Thanks a lot Steve,

So the suggestion is to use  the AMRMClient.updateBlacklist() method to
include *nodes where the component is running + unreliable nodes* before
the container is requested instead of discarding allocated containers.

I will take a closer look and update SLIDER-82 with my approach.

Thanks,
Rajesh


On Tue, May 19, 2015 at 12:34 PM, Steve Loughran <[email protected]>
wrote:

>
> > On 18 May 2015, at 23:41, Rajesh Kartha <[email protected]> wrote:
> >
> > Hello,
> >
> > One of the early requests we got for improving Slider was to have a
> > way of *ensuring
> > only a single process* of the given application runs on any given node. I
> > have read about the ANTI-AFFINITY flag but was not fully sure about its
> > implementation.
> >
> > Hence have been  trying to piece things together based on the comments
> in:
> >
> > - SLIDER-82
> > - Steve's blog at
> >
> http://steveloughran.blogspot.co.uk/2015/05/dynamic-datacentre-applications.html
> > - The Slider wiki at -
> > http://slider.incubator.apache.org/design/rolehistory.html
> > - Looking at the code
> >
> > Today the flag PlacementPolicy.ANTI_AFFINITY_REQUIRED seems like a place
> > holder and is not being used currently  in the flow.
>
> I think it's used on restart, where explicit requests for nodes on hosts
> are not made if there's
> already an instance on that node and the anti-affinity flag is set
>
> >
> > Also, as I understand, the main method where the check on containers
> happen
> > is in the Event Handler:
> >
> > *AppState.onContainersAllocated()*
> >
> > Since this method makes the decision on the allocated containers before
> > launching the role, I was thinking of a simple approach where we could :
> >
> > 1) check for the RoleStatus.isAntiAffinePlacement() to be true
> > 2) check if the NodeInstance on which the current container is allocated
> to
> > be either  in the RoleHistory.listActiveNodes(roleId) or found to be
> > unreliable
> > 3) discard the container without decrementing the request count for the
> role
> > 4) if the container check does not meet the check in #2 then proceed with
> > the flow
> > and continue with the launch
> >
> > The launching of the role via the launchService happens after this check,
> > so I would hope these checks may not be that expensive.
> >
>
> I thought about this, but wasn't happy with it.
>
> We will end up discarding a lot of containers. These may seem simple idle
> capacity, but in a cluster with pre-emption enabled a slider app could be
> killing  other work to get containers it then discards.
>
> Even without that, theres a risk that you end up getting back those same
> hosts, again, and again. Same for unreliable hosts
>
>
> > One other potential area for such a check is
> > RoleHostory.findNodeForNewInstance(role) during
> > the iteration of the list of Node Instances from the getNodesForRoleId(),
> > but based on my  experiments the listofActiveNodes() and the
> > getNodesForRoleId() seemed mutually exclusive, hence this check may not
> be
> > needed there.
> >
> > Again, not sure if the above can address the different scenarios that is
> > expected from the ANTI-AFFINITY flag, but was wondering if this was
> > feasible as a first approach to having some ANTI-AFFINITY support.
>
>
> I think it's a step in the right direction, but we really need to make the
> leap to doing what twill did and use the blacklist to exclude nodes where
> we either have active containers or their reliability is considered too low
>
> I'm not planning to do any work in that area in the near future -so if you
> want to sit down and start doing it, feel free!
>

Reply via email to