On Fri, Dec 18, 2020 at 10:11:15PM +0100, Ilya Maximets wrote:
> On 12/18/20 6:31 AM, Ben Pfaff wrote:
> > Open vSwitch has a few different jsonrpc-based protocols that depend on
> > jsonrpc_session to make sure that the connection is up and working.
> > In turn, jsonrpc_session uses the "reconnect" state machine to send
> > probes if nothing is received.  This works fine in normal circumstances.
> > In unusual circumstances, though, it can happen that the program is
> > busy and doesn't even try to receive anything for a long time.  Then the
> > timer can time out without a good reason; if it had tried to receive
> > something, it would have.
> > 
> > There's a solution that the clients of jsonrpc_session could adopt.
> > Instead of first calling jsonrpc_session_run(), which is what calls into
> > "reconnect" to deal with timing out, and then calling into
> > jsonrpc_session_recv(), which is what tries to receive something, they
> > could use the opposite order.  That would make sure that the timeout
> > was always based on a recent attempt to receive something.  Great.
> > 
> > The actual code in OVS that uses jsonrpc_session, though, tends to use
> > the opposite order, and there are enough users and this is a subtle
> > enough issue that it could get flipped back around even if we fixed it
> > now.  So this commit takes a different approach.  Instead of fixing
> > this in the users of jsonrpc_session, we fix it in the users of
> > reconnect: make them tell when they've tried to receive something (or
> > disable this particular feature).
> > 
> > This commit fixes the problem that way.  It's kind of hard to reproduce
> > but I'm pretty sure that I've seen it a number of times in testing.
> > 
> > Signed-off-by: Ben Pfaff <[email protected]>
> > ---
> >  lib/jsonrpc.c          |  5 ++++-
> >  lib/reconnect.c        | 25 +++++++++++++++++++++++--
> >  lib/reconnect.h        |  1 +
> >  tests/test-reconnect.c |  1 +
> >  4 files changed, 29 insertions(+), 3 deletions(-)
> 
> Thanks for the fix!
> 
> I'd still prefer to s/reconnect_try_receive/reconnect_receive_attempt/
> or some other name since the 'try_receive' part makes me think that
> we're actually asking it to receive something.

I am sorry that I forgot about that request from the OVN meeting.  I
made that change just now.

I decided to call it reconnect_receive_attempted(), to indicate that the
received attempt has already happened.

> It might also be good to add a specific unit test to test-reconnect
> that will check that timeout is not set.  This should not be hard, but,
> I guess, this could be a separate patch.

I wasn't sure how to test this.  Could you explain your idea?  I didn't
quite follow.

> For this patch with or without above changes:
> 
> Acked-by: Ilya Maximets <[email protected]>
.
> It might make sense to backport this to 2.13 as OVN users might hit
> this issue on a highly loaded cluster.

Thanks.  I applied this to master and branch-2.13.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to