On Fri, Dec 18, 2020 at 10:11:15PM +0100, Ilya Maximets wrote: > On 12/18/20 6:31 AM, Ben Pfaff wrote: > > Open vSwitch has a few different jsonrpc-based protocols that depend on > > jsonrpc_session to make sure that the connection is up and working. > > In turn, jsonrpc_session uses the "reconnect" state machine to send > > probes if nothing is received. This works fine in normal circumstances. > > In unusual circumstances, though, it can happen that the program is > > busy and doesn't even try to receive anything for a long time. Then the > > timer can time out without a good reason; if it had tried to receive > > something, it would have. > > > > There's a solution that the clients of jsonrpc_session could adopt. > > Instead of first calling jsonrpc_session_run(), which is what calls into > > "reconnect" to deal with timing out, and then calling into > > jsonrpc_session_recv(), which is what tries to receive something, they > > could use the opposite order. That would make sure that the timeout > > was always based on a recent attempt to receive something. Great. > > > > The actual code in OVS that uses jsonrpc_session, though, tends to use > > the opposite order, and there are enough users and this is a subtle > > enough issue that it could get flipped back around even if we fixed it > > now. So this commit takes a different approach. Instead of fixing > > this in the users of jsonrpc_session, we fix it in the users of > > reconnect: make them tell when they've tried to receive something (or > > disable this particular feature). > > > > This commit fixes the problem that way. It's kind of hard to reproduce > > but I'm pretty sure that I've seen it a number of times in testing. > > > > Signed-off-by: Ben Pfaff <[email protected]> > > --- > > lib/jsonrpc.c | 5 ++++- > > lib/reconnect.c | 25 +++++++++++++++++++++++-- > > lib/reconnect.h | 1 + > > tests/test-reconnect.c | 1 + > > 4 files changed, 29 insertions(+), 3 deletions(-) > > Thanks for the fix! > > I'd still prefer to s/reconnect_try_receive/reconnect_receive_attempt/ > or some other name since the 'try_receive' part makes me think that > we're actually asking it to receive something.
I am sorry that I forgot about that request from the OVN meeting. I made that change just now. I decided to call it reconnect_receive_attempted(), to indicate that the received attempt has already happened. > It might also be good to add a specific unit test to test-reconnect > that will check that timeout is not set. This should not be hard, but, > I guess, this could be a separate patch. I wasn't sure how to test this. Could you explain your idea? I didn't quite follow. > For this patch with or without above changes: > > Acked-by: Ilya Maximets <[email protected]> . > It might make sense to backport this to 2.13 as OVN users might hit > this issue on a highly loaded cluster. Thanks. I applied this to master and branch-2.13. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
