> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote:
>> On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote:
>> > On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff <blp at ovn.org> wrote:
>> > >
>> > > Until now, rconn_get_version() has only reported the OpenFlow version in
>> > > use when the rconn is actually connected. This makes sense, but it has a
>> > > harsh consequence. Consider code like this:
>> > >
>> > > if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) {
>> > > for (int i = 0; i < 2; i++) {
>> > > struct ofpbuf *b = ofputil_encode_echo_request(
>> > > rconn_get_version(rconn));
>> > > rconn_send(rconn, b, NULL);
>> > > }
>> > > }
>> > >
>> > > Maybe not the smartest code in the world, and probably no one would write
>> > > this exact code in any case, but it doesn't look too risky or crazy.
>> > >
>> > > But it is. The second trip through the loop can assert-fail inside
>> > > ofputil_encode_echo_request() because rconn_get_version(rconn) returns -1
>> > > instead of a valid OpenFlow version. That happens if the first call to
>> > > rconn_send() encounters an error while sending the message and therefore
>> > > destroys the underlying vconn and disconnects so that rconn_get_version()
>> > > doesn't have a vconn to query for its version.
>> > >
>> > > In a case like this where all the code to send the messages is close by,
>> > we
>> > > could just check rconn_get_version() in each loop iteration. We could
>> > even
>> > > go through the tree and convince ourselves that individual bits of code
>> > are
>> > > safe, or be conservative and check rconn_get_version() >= 0 in the iffy
>> > > cases. But this seems to me like an ongoing source of risk and a way to
>> > > get things wrong in corner cases.
>> > >
>> > > This commit takes a different approach. It introduces a new invariant:
>> > > if
>> > > an rconn has ever been connected, then it returns a valid OpenFlow
>> > > version
>> > > from rconn_get_version(). In addition, if an rconn is currently
>> > connected,
>> > > then the OpenFlow version it returns is the correct one (that may be
>> > > obvious, but there were corner cases before where it returned -1 even
>> > > though rconn_is_connected() returned true).
>> > >
>> > > With this commit, the code above would work OK. If the first call to
>> > > rconn_send() encounters an error sending the message, then
>> > > rconn_get_version() in the second iteration will return the same value as
>> > > in the first iteration. The message passed to rconn_send() will end up
>> > > being discarded, but that's much better than either an assertion failure
>> > or
>> > > having to carefully analyze a lot of our code to deal with one unusual
>> > > corner case.
>> > >
>> > > Reported-by: Han Zhou <zhouhan at gmail.com>
>> > > Signed-off-by: Ben Pfaff <blp at ovn.org>
>> > > ---
>> > > lib/learning-switch.c | 2 +-
>> > > lib/rconn.c | 41 ++++++++++++++++-------------------------
>> > > lib/vconn.c | 1 +
>> > > 3 files changed, 18 insertions(+), 26 deletions(-)
>> > >
>> > Acked-by: Han Zhou <hzhou8 at ebay.com>
>>
>> Thanks. I applied this to master. I'll backport it to older versions
>> if no one notices trouble soon.
>
> I backported as far as branch-2.5.
Sorry for being late, but I just tried to use branch-2.9 on my test environment
and found that this patch produces a lot of "connected" logs that are a bit
annoying:
2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected
2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected
2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected
2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected
2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected
2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected
2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected
2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected
2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected
2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected
2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected
2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected
2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected
2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886: connected
2018-06-18T12:55:20.611Z|03065|rconn|INFO|br-int<->unix#2887: connected
On my system it's about 2 log messages on each 2 seconds. It has only
ovsdb-server, ovs-vswitchd and ovsdb-client monitor running.
Could you please check?
Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev