On Mon, Dec 21, 2020 at 12:58:14PM +0100, Ilya Maximets wrote:
> On 12/19/20 3:11 AM, Ben Pfaff wrote:
> > On Fri, Dec 18, 2020 at 10:11:15PM +0100, Ilya Maximets wrote:
> >> On 12/18/20 6:31 AM, Ben Pfaff wrote:
> >>> Open vSwitch has a few different jsonrpc-based protocols that depend on
> >>> jsonrpc_session to make sure that the connection is up and working.
> >>> In turn, jsonrpc_session uses the "reconnect" state machine to send
> >>> probes if nothing is received.  This works fine in normal circumstances.
> >>> In unusual circumstances, though, it can happen that the program is
> >>> busy and doesn't even try to receive anything for a long time.  Then the
> >>> timer can time out without a good reason; if it had tried to receive
> >>> something, it would have.
> >>>
> >>> There's a solution that the clients of jsonrpc_session could adopt.
> >>> Instead of first calling jsonrpc_session_run(), which is what calls into
> >>> "reconnect" to deal with timing out, and then calling into
> >>> jsonrpc_session_recv(), which is what tries to receive something, they
> >>> could use the opposite order.  That would make sure that the timeout
> >>> was always based on a recent attempt to receive something.  Great.
> >>>
> >>> The actual code in OVS that uses jsonrpc_session, though, tends to use
> >>> the opposite order, and there are enough users and this is a subtle
> >>> enough issue that it could get flipped back around even if we fixed it
> >>> now.  So this commit takes a different approach.  Instead of fixing
> >>> this in the users of jsonrpc_session, we fix it in the users of
> >>> reconnect: make them tell when they've tried to receive something (or
> >>> disable this particular feature).
> >>>
> >>> This commit fixes the problem that way.  It's kind of hard to reproduce
> >>> but I'm pretty sure that I've seen it a number of times in testing.
> >>>
> >>> Signed-off-by: Ben Pfaff <[email protected]>
> >>> ---
> >>>  lib/jsonrpc.c          |  5 ++++-
> >>>  lib/reconnect.c        | 25 +++++++++++++++++++++++--
> >>>  lib/reconnect.h        |  1 +
> >>>  tests/test-reconnect.c |  1 +
> >>>  4 files changed, 29 insertions(+), 3 deletions(-)
> >>
> >> Thanks for the fix!
> >>
> >> I'd still prefer to s/reconnect_try_receive/reconnect_receive_attempt/
> >> or some other name since the 'try_receive' part makes me think that
> >> we're actually asking it to receive something.
> > 
> > I am sorry that I forgot about that request from the OVN meeting.  I
> > made that change just now.
> > 
> > I decided to call it reconnect_receive_attempted(), to indicate that the
> > received attempt has already happened.
> > 
> >> It might also be good to add a specific unit test to test-reconnect
> >> that will check that timeout is not set.  This should not be hard, but,
> >> I guess, this could be a separate patch.
> > 
> > I wasn't sure how to test this.  Could you explain your idea?  I didn't
> > quite follow.
> 
> I mean pure reconnect test in tests/reconnect.at.
> Idea is to add new command 'do_receive_attempted' to test-reconnect.c and
> check that in pair with 'advance' we will have particular timeouts or
> 'no timeout' if there was no receive attempted.  Does that make sense?

OK.

I sent a patch:
https://mail.openvswitch.org/pipermail/ovs-dev/2020-December/378965.html

> BTW, do we need the same fix for the python 'reconnect' library?

I added a Python implementation in the same patch.

> >> for this patch with or without above changes:
> >>
> >> Acked-by: Ilya Maximets <[email protected]>
> > .
> >> It might make sense to backport this to 2.13 as OVN users might hit
> >> this issue on a highly loaded cluster.
> > 
> > Thanks.  I applied this to master and branch-2.13.
> 
> We should, probably, apply this to 2.14 too to avoid feature/fix gap
> during upgrade.

I forgot about 2.14.  Thanks, I did that backport (forward-port?) too
now.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to