Re: About stream manager's quitting logic on connection failures

Ning Wang Mon, 05 Feb 2018 13:40:51 -0800

Cool. Thanks!

On Mon, Feb 5, 2018 at 11:01 AM, Karthik Ramasamy <kramas...@gmail.com>
wrote:


> Ning - let us get this rolled out soon.
>
> Cheers
> /karthik
>
> > On Feb 5, 2018, at 10:57 AM, Sanjeev Kulkarni <sanjee...@gmail.com>
> wrote:
> >
> > This sounds good to me!
> >
> > On Mon, Feb 5, 2018 at 1:08 AM, Ning Wang <wangnin...@gmail.com> wrote:
> >
> >> Yeah. That is an option too. In fact it was my first try:
> >> https://github.com/twitter/heron/pull/2693 (just an initiative, not
> >> completed, a count map should be used instead of a single total count)
> >>
> >> In most cases, I think both solutions should have the same result. A few
> >> reasons I changed to a tmaster check:
> >> - with tmaster, there is only one source of truth and tmaster is more
> >> critical anyway. If the tmaster link is not healthy, stmgrs won't work
> >> correctly: topology may have created replacement nodes but the
> disconnected
> >> nodes could keep going by themselves.
> >> - it is more straightforward. The logic is the same as the current one.
> One
> >> the other side, if we use an array for all remote stmgrs, we could have
> a
> >> smarter logic (which is good) but it could make stmgrs more complicated
> and
> >> less straightforward (bad). I left the stmgr counters there so if in
> future
> >> we decide to add this feature, it should be easy to add. There is a gap
> >> between "errors from all" and "errors from a few" and this is not a
> >> simple/quick question.
> >>
> >>
> >>
> >>
> >> On Sun, Feb 4, 2018 at 6:48 PM, Sanjeev Kulkarni <sanjee...@gmail.com>
> >> wrote:
> >>
> >>> I could't add comments to the document, thus am posting my comments to
> >> the
> >>> mailing list
> >>> One more approach could be to do the current measurement as it is, but
> >>> instead of leaving the quitting decision to the stmgtclient, have
> >>> stmgrclientmgr do the decision. Thus everytime a stmgr client detects
> >>> connection issues, inform that to stmgrclientmgr which keeps a map of
> >>> peerstmgrid to error count. Thus it is able to decide things like am i
> >>> seeing connection errors from all stmgrs or if only a few of them are
> >>> having issues. Then it can take the decisions better.
> >>>
> >>> On Sat, Feb 3, 2018 at 8:11 PM, Ning Wang <wangnin...@gmail.com>
> wrote:
> >>>
> >>>> Hi, heron devs~
> >>>>
> >>>> I think the current stream manager's quitting logic on connection
> >>> failures
> >>>> is problematic. We saw a few internal cases in Twitter that this logic
> >>>> could cause extra issue.
> >>>>
> >>>> Here is a doc with more details:
> >>>>
> >>>> https://docs.google.com/document/d/1WHNc2NEp2gVL9ge2QVKp9t4Hpd4U9
> >>>> sAbzBqCu4-iDUM/edit#
> >>>>
> >>>> Comments and feedbacks are welcome!
> >>>>
> >>>> Thanks.
> >>>> --ning
> >>>>
> >>>
> >>
>
>

Re: About stream manager's quitting logic on connection failures

Reply via email to