On Mon, Dec 21, 2020 at 12:58:14PM +0100, Ilya Maximets wrote: > On 12/19/20 3:11 AM, Ben Pfaff wrote: > > On Fri, Dec 18, 2020 at 10:11:15PM +0100, Ilya Maximets wrote: > >> On 12/18/20 6:31 AM, Ben Pfaff wrote: > >>> Open vSwitch has a few different jsonrpc-based protocols that depend on > >>> jsonrpc_session to make sure that the connection is up and working. > >>> In turn, jsonrpc_session uses the "reconnect" state machine to send > >>> probes if nothing is received. This works fine in normal circumstances. > >>> In unusual circumstances, though, it can happen that the program is > >>> busy and doesn't even try to receive anything for a long time. Then the > >>> timer can time out without a good reason; if it had tried to receive > >>> something, it would have. > >>> > >>> There's a solution that the clients of jsonrpc_session could adopt. > >>> Instead of first calling jsonrpc_session_run(), which is what calls into > >>> "reconnect" to deal with timing out, and then calling into > >>> jsonrpc_session_recv(), which is what tries to receive something, they > >>> could use the opposite order. That would make sure that the timeout > >>> was always based on a recent attempt to receive something. Great. > >>> > >>> The actual code in OVS that uses jsonrpc_session, though, tends to use > >>> the opposite order, and there are enough users and this is a subtle > >>> enough issue that it could get flipped back around even if we fixed it > >>> now. So this commit takes a different approach. Instead of fixing > >>> this in the users of jsonrpc_session, we fix it in the users of > >>> reconnect: make them tell when they've tried to receive something (or > >>> disable this particular feature). > >>> > >>> This commit fixes the problem that way. It's kind of hard to reproduce > >>> but I'm pretty sure that I've seen it a number of times in testing. > >>> > >>> Signed-off-by: Ben Pfaff <[email protected]> > >>> --- > >>> lib/jsonrpc.c | 5 ++++- > >>> lib/reconnect.c | 25 +++++++++++++++++++++++-- > >>> lib/reconnect.h | 1 + > >>> tests/test-reconnect.c | 1 + > >>> 4 files changed, 29 insertions(+), 3 deletions(-) > >> > >> Thanks for the fix! > >> > >> I'd still prefer to s/reconnect_try_receive/reconnect_receive_attempt/ > >> or some other name since the 'try_receive' part makes me think that > >> we're actually asking it to receive something. > > > > I am sorry that I forgot about that request from the OVN meeting. I > > made that change just now. > > > > I decided to call it reconnect_receive_attempted(), to indicate that the > > received attempt has already happened. > > > >> It might also be good to add a specific unit test to test-reconnect > >> that will check that timeout is not set. This should not be hard, but, > >> I guess, this could be a separate patch. > > > > I wasn't sure how to test this. Could you explain your idea? I didn't > > quite follow. > > I mean pure reconnect test in tests/reconnect.at. > Idea is to add new command 'do_receive_attempted' to test-reconnect.c and > check that in pair with 'advance' we will have particular timeouts or > 'no timeout' if there was no receive attempted. Does that make sense?
OK. I sent a patch: https://mail.openvswitch.org/pipermail/ovs-dev/2020-December/378965.html > BTW, do we need the same fix for the python 'reconnect' library? I added a Python implementation in the same patch. > >> for this patch with or without above changes: > >> > >> Acked-by: Ilya Maximets <[email protected]> > > . > >> It might make sense to backport this to 2.13 as OVN users might hit > >> this issue on a highly loaded cluster. > > > > Thanks. I applied this to master and branch-2.13. > > We should, probably, apply this to 2.14 too to avoid feature/fix gap > during upgrade. I forgot about 2.14. Thanks, I did that backport (forward-port?) too now. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
