On Mon, Jun 18, 2018 at 09:31:43AM -0700, Han Zhou wrote: > On Mon, Jun 18, 2018 at 8:34 AM, Ilya Maximets <[email protected]> > wrote: > > > > On 18.06.2018 18:07, Ben Pfaff wrote: > > > On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote: > > >>> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote: > > >>>> On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote: > > >>>>> On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff <blp at ovn.org> wrote: > > >>>>>> > > >>>>>> Until now, rconn_get_version() has only reported the OpenFlow > version in > > >>>>>> use when the rconn is actually connected. This makes sense, but > it has a > > >>>>>> harsh consequence. Consider code like this: > > >>>>>> > > >>>>>> if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= > 0) { > > >>>>>> for (int i = 0; i < 2; i++) { > > >>>>>> struct ofpbuf *b = ofputil_encode_echo_request( > > >>>>>> rconn_get_version(rconn)); > > >>>>>> rconn_send(rconn, b, NULL); > > >>>>>> } > > >>>>>> } > > >>>>>> > > >>>>>> Maybe not the smartest code in the world, and probably no one > would write > > >>>>>> this exact code in any case, but it doesn't look too risky or > crazy. > > >>>>>> > > >>>>>> But it is. The second trip through the loop can assert-fail inside > > >>>>>> ofputil_encode_echo_request() because rconn_get_version(rconn) > returns -1 > > >>>>>> instead of a valid OpenFlow version. That happens if the first > call to > > >>>>>> rconn_send() encounters an error while sending the message and > therefore > > >>>>>> destroys the underlying vconn and disconnects so that > rconn_get_version() > > >>>>>> doesn't have a vconn to query for its version. > > >>>>>> > > >>>>>> In a case like this where all the code to send the messages is > close by, > > >>>>> we > > >>>>>> could just check rconn_get_version() in each loop iteration. We > could > > >>>>> even > > >>>>>> go through the tree and convince ourselves that individual bits of > code > > >>>>> are > > >>>>>> safe, or be conservative and check rconn_get_version() >= 0 in the > iffy > > >>>>>> cases. But this seems to me like an ongoing source of risk and a > way to > > >>>>>> get things wrong in corner cases. > > >>>>>> > > >>>>>> This commit takes a different approach. It introduces a new > invariant: if > > >>>>>> an rconn has ever been connected, then it returns a valid OpenFlow > version > > >>>>>> from rconn_get_version(). In addition, if an rconn is currently > > >>>>> connected, > > >>>>>> then the OpenFlow version it returns is the correct one (that may > be > > >>>>>> obvious, but there were corner cases before where it returned -1 > even > > >>>>>> though rconn_is_connected() returned true). > > >>>>>> > > >>>>>> With this commit, the code above would work OK. If the first call > to > > >>>>>> rconn_send() encounters an error sending the message, then > > >>>>>> rconn_get_version() in the second iteration will return the same > value as > > >>>>>> in the first iteration. The message passed to rconn_send() will > end up > > >>>>>> being discarded, but that's much better than either an assertion > failure > > >>>>> or > > >>>>>> having to carefully analyze a lot of our code to deal with one > unusual > > >>>>>> corner case. > > >>>>>> > > >>>>>> Reported-by: Han Zhou <zhouhan at gmail.com> > > >>>>>> Signed-off-by: Ben Pfaff <blp at ovn.org> > > >>>>>> --- > > >>>>>> lib/learning-switch.c | 2 +- > > >>>>>> lib/rconn.c | 41 > ++++++++++++++++------------------------- > > >>>>>> lib/vconn.c | 1 + > > >>>>>> 3 files changed, 18 insertions(+), 26 deletions(-) > > >>>>>> > > >>>>> Acked-by: Han Zhou <hzhou8 at ebay.com> > > >>>> > > >>>> Thanks. I applied this to master. I'll backport it to older > versions > > >>>> if no one notices trouble soon. > > >>> > > >>> I backported as far as branch-2.5. > > >> > > >> Sorry for being late, but I just tried to use branch-2.9 on my test > environment > > >> and found that this patch produces a lot of "connected" logs that are > a bit annoying: > > >> > > >> 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected > > >> 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected > > >> 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected > > >> 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected > > >> 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected > > >> 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected > > >> 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected > > >> 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected > > >> 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected > > >> 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected > > >> 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected > > >> 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected > > >> 2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected > > >> 2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886: connected > > >> 2018-06-18T12:55:20.611Z|03065|rconn|INFO|br-int<->unix#2887: connected > > >> > > >> On my system it's about 2 log messages on each 2 seconds. It has only > > >> ovsdb-server, ovs-vswitchd and ovsdb-client monitor running. > > >> > > >> Could you please check? > > > > > > Are you sure that it's this patch that makes a difference? It doesn't > > > change anything in the logging. > > > > I don't know much about this part of OVS. That is wat I have with > reverted patch > > and debug enabled: > > > > 2018-06-18T15:31:09.162Z|00423|rconn|DBG|br-int<->unix#150: entering > ACTIVE > > 2018-06-18T15:31:09.163Z|00424|rconn|DBG|br-int<->unix#150: connection > closed by peer > > 2018-06-18T15:31:09.163Z|00425|rconn|DBG|br-int<->unix#150: entering > DISCONNECTED > > 2018-06-18T15:31:11.162Z|00426|rconn|DBG|br-int<->unix#151: entering > ACTIVE > > 2018-06-18T15:31:11.162Z|00427|rconn|DBG|br-int<->unix#151: connection > closed by peer > > 2018-06-18T15:31:11.162Z|00428|rconn|DBG|br-int<->unix#151: entering > DISCONNECTED > > 2018-06-18T15:31:11.162Z|00429|rconn|DBG|br-int<->unix#152: entering > ACTIVE > > 2018-06-18T15:31:11.163Z|00430|rconn|DBG|br-int<->unix#152: connection > closed by peer > > 2018-06-18T15:31:11.163Z|00431|rconn|DBG|br-int<->unix#152: entering > DISCONNECTED > > > > Without revert (just branch-2.9 with patch applied): > > > > 2018-06-18T15:26:32.366Z|00658|rconn|DBG|br-int<->unix#198: entering > ACTIVE > > 2018-06-18T15:26:32.366Z|00659|rconn|DBG|br-int<->unix#198: connection > closed by peer > > 2018-06-18T15:26:32.366Z|00660|rconn|DBG|br-int<->unix#198: entering > DISCONNECTED > > 2018-06-18T15:26:34.365Z|00661|rconn|DBG|br-int<->unix#199: entering > CONNECTING > > 2018-06-18T15:26:34.365Z|00662|rconn|INFO|br-int<->unix#199: connected > > 2018-06-18T15:26:34.365Z|00663|rconn|DBG|br-int<->unix#199: entering > ACTIVE > > 2018-06-18T15:26:34.365Z|00664|rconn|DBG|br-int<->unix#199: connection > closed by peer > > 2018-06-18T15:26:34.365Z|00665|rconn|DBG|br-int<->unix#199: entering > DISCONNECTED > > 2018-06-18T15:26:34.365Z|00666|rconn|DBG|br-int<->unix#200: entering > CONNECTING > > 2018-06-18T15:26:34.365Z|00667|rconn|INFO|br-int<->unix#200: connected > > 2018-06-18T15:26:34.365Z|00668|rconn|DBG|br-int<->unix#200: entering > ACTIVE > > 2018-06-18T15:26:34.366Z|00669|rconn|DBG|br-int<->unix#200: connection > closed by peer > > 2018-06-18T15:26:34.366Z|00670|rconn|DBG|br-int<->unix#200: entering > DISCONNECTED > > 2018-06-18T15:26:36.365Z|00671|rconn|DBG|br-int<->unix#201: entering > CONNECTING > > 2018-06-18T15:26:36.365Z|00672|rconn|INFO|br-int<->unix#201: connected > > 2018-06-18T15:26:36.365Z|00673|rconn|DBG|br-int<->unix#201: entering > ACTIVE > > 2018-06-18T15:26:36.365Z|00674|rconn|DBG|br-int<->unix#201: connection > closed by peer > > 2018-06-18T15:26:36.365Z|00675|rconn|DBG|br-int<->unix#201: entering > DISCONNECTED > > 2018-06-18T15:26:36.365Z|00676|rconn|DBG|br-int<->unix#202: entering > CONNECTING > > 2018-06-18T15:26:36.366Z|00677|rconn|INFO|br-int<->unix#202: connected > > > > Looks like that patch just changed a workflow a bit introducing > "CONNECTING" > > state in case where was just direct transition to "ACTIVE". > > > Hi Ilya, > > From the debug log it looks like the connection is flapping, even without > the patch. The patch just revealed the problem by an info level log.
Probably these are just connections from "ovs-ofctl" or similar. We try to suppress logging for them because it isn't too useful. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
