On 12/18/20 6:31 AM, Ben Pfaff wrote:
> Open vSwitch has a few different jsonrpc-based protocols that depend on
> jsonrpc_session to make sure that the connection is up and working.
> In turn, jsonrpc_session uses the "reconnect" state machine to send
> probes if nothing is received.  This works fine in normal circumstances.
> In unusual circumstances, though, it can happen that the program is
> busy and doesn't even try to receive anything for a long time.  Then the
> timer can time out without a good reason; if it had tried to receive
> something, it would have.
> 
> There's a solution that the clients of jsonrpc_session could adopt.
> Instead of first calling jsonrpc_session_run(), which is what calls into
> "reconnect" to deal with timing out, and then calling into
> jsonrpc_session_recv(), which is what tries to receive something, they
> could use the opposite order.  That would make sure that the timeout
> was always based on a recent attempt to receive something.  Great.
> 
> The actual code in OVS that uses jsonrpc_session, though, tends to use
> the opposite order, and there are enough users and this is a subtle
> enough issue that it could get flipped back around even if we fixed it
> now.  So this commit takes a different approach.  Instead of fixing
> this in the users of jsonrpc_session, we fix it in the users of
> reconnect: make them tell when they've tried to receive something (or
> disable this particular feature).
> 
> This commit fixes the problem that way.  It's kind of hard to reproduce
> but I'm pretty sure that I've seen it a number of times in testing.
> 
> Signed-off-by: Ben Pfaff <[email protected]>
> ---
>  lib/jsonrpc.c          |  5 ++++-
>  lib/reconnect.c        | 25 +++++++++++++++++++++++--
>  lib/reconnect.h        |  1 +
>  tests/test-reconnect.c |  1 +
>  4 files changed, 29 insertions(+), 3 deletions(-)

Thanks for the fix!

I'd still prefer to s/reconnect_try_receive/reconnect_receive_attempt/
or some other name since the 'try_receive' part makes me think that
we're actually asking it to receive something.

It might also be good to add a specific unit test to test-reconnect
that will check that timeout is not set.  This should not be hard, but,
I guess, this could be a separate patch.

For this patch with or without above changes:

Acked-by: Ilya Maximets <[email protected]>

It might make sense to backport this to 2.13 as OVN users might hit
this issue on a highly loaded cluster.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to