https://issues.apache.org/jira/browse/SLIDER-743 it is then.

On 8 January 2015 at 14:26, Jon Maron <[email protected]> wrote:

> +1.  A good way to provide the functionality while leveraging existing
> mechanisms
>
> On Jan 8, 2015, at 8:46 AM, Gour Saha <[email protected]> wrote:
>
> > +1 on that
> >
> > That's also what I meant when I said -
> >>> I don't think we have a logic where we apply data locality and then
> upon a
> >>> certain no of failures (threshold) try with "no data locality" at least
> >>> once before giving up. It will be a good idea to file a JIRA with this
> >>> requirement.
> >
> > -Gour
> >
> > - Sent from my iPhone
> >
> >> On Jan 8, 2015, at 3:30 AM, Steve Loughran <[email protected]>
> wrote:
> >>
> >> thinking about this some more, we could use our tracking of node
> >> reliability to tune our placement decisions.
> >>
> >>
> >>  1. We add a "recent failures" field to the node entries, alongside the
> >>  "total failures"
> >>  2. Our scheduled failure count resetter will set that field to zero,
> >>  alongside the component failures
> >>  3. When Slider has to request a new container, unless the placement
> >>  policy is STRICT, we will continue to use the (persisted) placement
> history
> >>  4. Except now, if a node has a recent failure count above some
> >>  threshold, we don't ask for a container on that node...we just ask for
> >>  "anywhere" placement.
> >>
> >> What do people think?
> >>
> >>> On 7 January 2015 at 09:50, Steve Loughran <[email protected]>
> wrote:
> >>>
> >>> the history of where things were is retained in the RoleHistory
> >>> structures, persisted to HDFS and reread on startup. for each component
> >>> type, it's sorted by most-recent-first.
> >>>
> >>> When a container is needed, the AM looks in that history first, and
> looks
> >>> through the list of "previously used nodes for that component type".,
> >>> skipping any that already have an instance of that component running.
> The
> >>> chosen node is taken off the list, so there's no duplicates
> >>> (exception: the component type doesn't have any locality, in which case
> >>> although the history is tracked, it's not used for placement)
> >>>
> >>>
> >>>
> >>> When a placement on the node comes in, then its taken off the "pending
> >>> list"
> >>>
> >>> There's one small issue here: no way to tie requests to allocations. We
> >>> don't really care which request allocates a component to a node, we
> just
> >>> like to track outstanding requests for explicit nodes. The algorithm is
> >>> -allocation to a requested node: remove node from "list of outstanding
> >>> explicit requests"
> >>> -allocation to another node: do nothing while there are outstanding
> >>> requests
> >>> -all outstanding requests satisfied: clean the list of outstanding
> >>> "placed" requests.
> >>>
> >>> Now, fun happens when a container fails on a newly allocated node —and
> its
> >>> here there may be some policy tuning required.
> >>>
> >>> It comes down to this: what is the best way to react when a component
> >>> fails to start, either immediately, or shortly after startup? This can
> be a
> >>> sign of a major problem "node doesn't run my app", or something
> transient
> >>> "port still considered in use"
> >>>
> >>> If its a transient problem, there's no harm in asking again.
> >>>
> >>> If its a permanent problem: we need to make the decision that this
> node is
> >>> bad —at least for that specific component.
> >>>
> >>> I think right now, on a startup/launch time failure, the failing node
> is
> >>> placed at the back of the list of recently used nodes; the failure
> counts
> >>> of both the node and the component incremented. Although there's a
> YARN API
> >>> where an application can provide blacklist hints to YARN, we're not
> >>> currently using it.
> >>>
> >>> I think what you may be seeing is that Slider is repeatedly asking for
> the
> >>> same node: it's failing and going to the back of the list of previously
> >>> used nodes, but at there is only one, it's being asked for again.
> >>>
> >>> We can tune this -maybe- but it gets complex.
> >>>
> >>> 1. If the placement policy is STRICT, then we must ask for that
> previously
> >>> used node. (Though thinking about it, the component must have started
> at
> >>> least once at some point in the past...I don't know if the special
> case of
> >>> "previously allocated but never started" is detected and handled)
> >>>
> >>> 2. If the placement is location-preferred, default, how best to react
> to a
> >>> launch failure? Completely cut that node off the list of suitable
> targets?
> >>> Or try again a few more times? If its a transient problem, retry gives
> >>> locality without over-reacting. If its a permanent problem, then
> retrying
> >>> is the wrong policy.
> >>>
> >>> What should we do here? We are tracking failures in NodeEntry entries,
> in
> >>> a map of the cluster built up (NodeMap), but not currently using
> failure
> >>> counts there to make decisions. If we do think about using it, we'll
> have
> >>> to think about not just keeping the count of failures, but resetting
> it on
> >>> an interval, the way we now do with component failure counts.
> >>>
> >>> -steve
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>> On 7 January 2015 at 02:50, Gour Saha <[email protected]> wrote:
> >>>>
> >>>> Nitin,
> >>>>
> >>>> I don't think we have a logic where we apply data locality and then
> upon a
> >>>> certain no of failures (threshold) try with "no data locality" at
> least
> >>>> once before giving up. It will be a good idea to file a JIRA with this
> >>>> requirement.
> >>>>
> >>>> -Gour
> >>>>
> >>>>
> >>>> On Tue, Jan 6, 2015 at 5:12 PM, Nitin Aggarwal <
> >>>> [email protected]
> >>>>> wrote:
> >>>>
> >>>>> I am running HBase application, and I prefer data locality. I don't
> >>>> want to
> >>>>> give up locality by default. It's ok to lose locality in rare
> scenarios,
> >>>>> where something is wrong with one of the local nodes.
> >>>>> It's more of fail-safe that I am looking for, to give up locality,
> if it
> >>>>> cannot be satisfied.
> >>>>>
> >>>>> Thanks
> >>>>> Nitin
> >>>>>
> >>>>>
> >>>>>> On Tue, Jan 6, 2015 at 4:52 PM, Ted Yu <[email protected]> wrote:
> >>>>>>
> >>>>>> Here is the meaning of 2 (see PlacementPolicy):
> >>>>>>
> >>>>>>  * No data locality; do not bother trying to ask for any location
> >>>>>>
> >>>>>>  */
> >>>>>>
> >>>>>> public static final int NO_DATA_LOCALITY = 2;
> >>>>>>
> >>>>>> On Tue, Jan 6, 2015 at 4:15 PM, Gour Saha <[email protected]>
> >>>> wrote:
> >>>>>>
> >>>>>>> Try setting property *yarn.component.placement.policy* to 2 for the
> >>>>>>> component, something like this -
> >>>>>>>
> >>>>>>>   "HBASE_MASTER": {
> >>>>>>>     "yarn.role.priority": "1",
> >>>>>>>     "yarn.component.instances": "1",
> >>>>>>>     "yarn.memory": "1500",
> >>>>>>>     "yarn.component.placement.policy": "2"
> >>>>>>>   },
> >>>>>>>
> >>>>>>> -Gour
> >>>>>>>
> >>>>>>> On Tue, Jan 6, 2015 at 3:33 PM, Nitin Aggarwal <
> >>>>>>> [email protected]
> >>>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> We keep on running into scenario, where one of the node in the
> >>>>> cluster
> >>>>>>> went
> >>>>>>>> bad (either due to clock out of sync, no disk space etc.). As a
> >>>>> result
> >>>>>>>> container fails to start, and due to locality, container is
> >>>> assigned
> >>>>> on
> >>>>>>> the
> >>>>>>>> same machine again and again, and it fails again and again. After
> >>>> few
> >>>>>>>> failures, when failure threshold is reached (which is currently
> >>>> also
> >>>>>> not
> >>>>>>>> reset correctly. SLIDER-629), it triggers instance shut-down.
> >>>>>>>>
> >>>>>>>> Is there a way to give up locality, in case of multiple failures,
> >>>> to
> >>>>>>> avoid
> >>>>>>>> this scenario ?
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Nitin Aggarwal
> >>>>>>>
> >>>>>>> --
> >>>>>>> CONFIDENTIALITY NOTICE
> >>>>>>> NOTICE: This message is intended for the use of the individual or
> >>>>> entity
> >>>>>> to
> >>>>>>> which it is addressed and may contain information that is
> >>>> confidential,
> >>>>>>> privileged and exempt from disclosure under applicable law. If the
> >>>>> reader
> >>>>>>> of this message is not the intended recipient, you are hereby
> >>>> notified
> >>>>>> that
> >>>>>>> any printing, copying, dissemination, distribution, disclosure or
> >>>>>>> forwarding of this communication is strictly prohibited. If you
> have
> >>>>>>> received this communication in error, please contact the sender
> >>>>>> immediately
> >>>>>>> and delete it from your system. Thank You.
> >>>>
> >>>> --
> >>>> CONFIDENTIALITY NOTICE
> >>>> NOTICE: This message is intended for the use of the individual or
> entity
> >>>> to
> >>>> which it is addressed and may contain information that is
> confidential,
> >>>> privileged and exempt from disclosure under applicable law. If the
> reader
> >>>> of this message is not the intended recipient, you are hereby notified
> >>>> that
> >>>> any printing, copying, dissemination, distribution, disclosure or
> >>>> forwarding of this communication is strictly prohibited. If you have
> >>>> received this communication in error, please contact the sender
> >>>> immediately
> >>>> and delete it from your system. Thank You.
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or
> entity to
> >> which it is addressed and may contain information that is confidential,
> >> privileged and exempt from disclosure under applicable law. If the
> reader
> >> of this message is not the intended recipient, you are hereby notified
> that
> >> any printing, copying, dissemination, distribution, disclosure or
> >> forwarding of this communication is strictly prohibited. If you have
> >> received this communication in error, please contact the sender
> immediately
> >> and delete it from your system. Thank You.
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Reply via email to