Thanks a lot Steve, So the suggestion is to use the AMRMClient.updateBlacklist() method to include *nodes where the component is running + unreliable nodes* before the container is requested instead of discarding allocated containers.
I will take a closer look and update SLIDER-82 with my approach. Thanks, Rajesh On Tue, May 19, 2015 at 12:34 PM, Steve Loughran <[email protected]> wrote: > > > On 18 May 2015, at 23:41, Rajesh Kartha <[email protected]> wrote: > > > > Hello, > > > > One of the early requests we got for improving Slider was to have a > > way of *ensuring > > only a single process* of the given application runs on any given node. I > > have read about the ANTI-AFFINITY flag but was not fully sure about its > > implementation. > > > > Hence have been trying to piece things together based on the comments > in: > > > > - SLIDER-82 > > - Steve's blog at > > > http://steveloughran.blogspot.co.uk/2015/05/dynamic-datacentre-applications.html > > - The Slider wiki at - > > http://slider.incubator.apache.org/design/rolehistory.html > > - Looking at the code > > > > Today the flag PlacementPolicy.ANTI_AFFINITY_REQUIRED seems like a place > > holder and is not being used currently in the flow. > > I think it's used on restart, where explicit requests for nodes on hosts > are not made if there's > already an instance on that node and the anti-affinity flag is set > > > > > Also, as I understand, the main method where the check on containers > happen > > is in the Event Handler: > > > > *AppState.onContainersAllocated()* > > > > Since this method makes the decision on the allocated containers before > > launching the role, I was thinking of a simple approach where we could : > > > > 1) check for the RoleStatus.isAntiAffinePlacement() to be true > > 2) check if the NodeInstance on which the current container is allocated > to > > be either in the RoleHistory.listActiveNodes(roleId) or found to be > > unreliable > > 3) discard the container without decrementing the request count for the > role > > 4) if the container check does not meet the check in #2 then proceed with > > the flow > > and continue with the launch > > > > The launching of the role via the launchService happens after this check, > > so I would hope these checks may not be that expensive. > > > > I thought about this, but wasn't happy with it. > > We will end up discarding a lot of containers. These may seem simple idle > capacity, but in a cluster with pre-emption enabled a slider app could be > killing other work to get containers it then discards. > > Even without that, theres a risk that you end up getting back those same > hosts, again, and again. Same for unreliable hosts > > > > One other potential area for such a check is > > RoleHostory.findNodeForNewInstance(role) during > > the iteration of the list of Node Instances from the getNodesForRoleId(), > > but based on my experiments the listofActiveNodes() and the > > getNodesForRoleId() seemed mutually exclusive, hence this check may not > be > > needed there. > > > > Again, not sure if the above can address the different scenarios that is > > expected from the ANTI-AFFINITY flag, but was wondering if this was > > feasible as a first approach to having some ANTI-AFFINITY support. > > > I think it's a step in the right direction, but we really need to make the > leap to doing what twill did and use the blacklist to exclude nodes where > we either have active containers or their reliability is considered too low > > I'm not planning to do any work in that area in the near future -so if you > want to sit down and start doing it, feel free! >
