On 12/18/20 6:31 AM, Ben Pfaff wrote: > Open vSwitch has a few different jsonrpc-based protocols that depend on > jsonrpc_session to make sure that the connection is up and working. > In turn, jsonrpc_session uses the "reconnect" state machine to send > probes if nothing is received. This works fine in normal circumstances. > In unusual circumstances, though, it can happen that the program is > busy and doesn't even try to receive anything for a long time. Then the > timer can time out without a good reason; if it had tried to receive > something, it would have. > > There's a solution that the clients of jsonrpc_session could adopt. > Instead of first calling jsonrpc_session_run(), which is what calls into > "reconnect" to deal with timing out, and then calling into > jsonrpc_session_recv(), which is what tries to receive something, they > could use the opposite order. That would make sure that the timeout > was always based on a recent attempt to receive something. Great. > > The actual code in OVS that uses jsonrpc_session, though, tends to use > the opposite order, and there are enough users and this is a subtle > enough issue that it could get flipped back around even if we fixed it > now. So this commit takes a different approach. Instead of fixing > this in the users of jsonrpc_session, we fix it in the users of > reconnect: make them tell when they've tried to receive something (or > disable this particular feature). > > This commit fixes the problem that way. It's kind of hard to reproduce > but I'm pretty sure that I've seen it a number of times in testing. > > Signed-off-by: Ben Pfaff <[email protected]> > --- > lib/jsonrpc.c | 5 ++++- > lib/reconnect.c | 25 +++++++++++++++++++++++-- > lib/reconnect.h | 1 + > tests/test-reconnect.c | 1 + > 4 files changed, 29 insertions(+), 3 deletions(-)
Thanks for the fix! I'd still prefer to s/reconnect_try_receive/reconnect_receive_attempt/ or some other name since the 'try_receive' part makes me think that we're actually asking it to receive something. It might also be good to add a specific unit test to test-reconnect that will check that timeout is not set. This should not be hard, but, I guess, this could be a separate patch. For this patch with or without above changes: Acked-by: Ilya Maximets <[email protected]> It might make sense to backport this to 2.13 as OVN users might hit this issue on a highly loaded cluster. Best regards, Ilya Maximets. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
