I could't add comments to the document, thus am posting my comments to the mailing list One more approach could be to do the current measurement as it is, but instead of leaving the quitting decision to the stmgtclient, have stmgrclientmgr do the decision. Thus everytime a stmgr client detects connection issues, inform that to stmgrclientmgr which keeps a map of peerstmgrid to error count. Thus it is able to decide things like am i seeing connection errors from all stmgrs or if only a few of them are having issues. Then it can take the decisions better.
On Sat, Feb 3, 2018 at 8:11 PM, Ning Wang <wangnin...@gmail.com> wrote: > Hi, heron devs~ > > I think the current stream manager's quitting logic on connection failures > is problematic. We saw a few internal cases in Twitter that this logic > could cause extra issue. > > Here is a doc with more details: > > https://docs.google.com/document/d/1WHNc2NEp2gVL9ge2QVKp9t4Hpd4U9 > sAbzBqCu4-iDUM/edit# > > Comments and feedbacks are welcome! > > Thanks. > --ning >