Cool. Thanks! On Mon, Feb 5, 2018 at 11:01 AM, Karthik Ramasamy <kramas...@gmail.com> wrote:
> Ning - let us get this rolled out soon. > > Cheers > /karthik > > > On Feb 5, 2018, at 10:57 AM, Sanjeev Kulkarni <sanjee...@gmail.com> > wrote: > > > > This sounds good to me! > > > > On Mon, Feb 5, 2018 at 1:08 AM, Ning Wang <wangnin...@gmail.com> wrote: > > > >> Yeah. That is an option too. In fact it was my first try: > >> https://github.com/twitter/heron/pull/2693 (just an initiative, not > >> completed, a count map should be used instead of a single total count) > >> > >> In most cases, I think both solutions should have the same result. A few > >> reasons I changed to a tmaster check: > >> - with tmaster, there is only one source of truth and tmaster is more > >> critical anyway. If the tmaster link is not healthy, stmgrs won't work > >> correctly: topology may have created replacement nodes but the > disconnected > >> nodes could keep going by themselves. > >> - it is more straightforward. The logic is the same as the current one. > One > >> the other side, if we use an array for all remote stmgrs, we could have > a > >> smarter logic (which is good) but it could make stmgrs more complicated > and > >> less straightforward (bad). I left the stmgr counters there so if in > future > >> we decide to add this feature, it should be easy to add. There is a gap > >> between "errors from all" and "errors from a few" and this is not a > >> simple/quick question. > >> > >> > >> > >> > >> On Sun, Feb 4, 2018 at 6:48 PM, Sanjeev Kulkarni <sanjee...@gmail.com> > >> wrote: > >> > >>> I could't add comments to the document, thus am posting my comments to > >> the > >>> mailing list > >>> One more approach could be to do the current measurement as it is, but > >>> instead of leaving the quitting decision to the stmgtclient, have > >>> stmgrclientmgr do the decision. Thus everytime a stmgr client detects > >>> connection issues, inform that to stmgrclientmgr which keeps a map of > >>> peerstmgrid to error count. Thus it is able to decide things like am i > >>> seeing connection errors from all stmgrs or if only a few of them are > >>> having issues. Then it can take the decisions better. > >>> > >>> On Sat, Feb 3, 2018 at 8:11 PM, Ning Wang <wangnin...@gmail.com> > wrote: > >>> > >>>> Hi, heron devs~ > >>>> > >>>> I think the current stream manager's quitting logic on connection > >>> failures > >>>> is problematic. We saw a few internal cases in Twitter that this logic > >>>> could cause extra issue. > >>>> > >>>> Here is a doc with more details: > >>>> > >>>> https://docs.google.com/document/d/1WHNc2NEp2gVL9ge2QVKp9t4Hpd4U9 > >>>> sAbzBqCu4-iDUM/edit# > >>>> > >>>> Comments and feedbacks are welcome! > >>>> > >>>> Thanks. > >>>> --ning > >>>> > >>> > >> > >