If you can come up with simple reproduction instructions that work for
me, I'm happy to track this down.  It's probably something very simple.

On Tue, Jan 17, 2017 at 08:50:20AM -0800, Vidyasagara Guntaka wrote:
> This issue happened on our in-use systems and we were trying to find a way
> to move forward avoiding this issue so that we do not have to upgrade OVS
> on thousands of our hypervisors causing down time. Our debugging did help
> us avoid the issue for now by installing an explicit rule to to drop
> packets when there is no match and this issue is not seen over many hours
> of test runs.
> 
> We will definitely run this test with latest version.  But, will need more
> time since we are busy with our release related activities.
> 
> Regards,
> Sagar.
> 
> On Tue, Jan 17, 2017 at 8:42 AM, Ben Pfaff <b...@ovn.org> wrote:
> 
> > It would be more helpful to have a simple reproduction case.
> >
> > Why haven't you tried a newer version from branch-2.5?
> >
> > On Tue, Jan 17, 2017 at 07:59:05AM -0800, Vidyasagara Guntaka wrote:
> > > Hi Ben,
> > >
> > > Here i is more debug information related to this incident (still using
> > version 2.5.0):
> > >
> > > Summary :
> > >
> > > We think that there is some race condition involved in processing OF
> > Controller connections and Packet miss processing in ovs-vswitchd.
> > >
> > > Reasoning :
> > >
> > > Please consider the following GDB Debug Session:
> > >
> > > Breakpoint 1, ofconn_set_protocol (ofconn=0x16d5810,
> > protocol=OFPUTIL_P_OF10_STD) at ofproto/connmgr.c:999
> > > (gdb) f 2
> > > #2  0x000000000045f586 in connmgr_wants_packet_in_on_miss
> > (mgr=0x16a6de0) at ofproto/connmgr.c:1613
> > > 1613          enum ofputil_protocol protocol =
> > ofconn_get_protocol(ofconn);
> > > (gdb) p *ofconn
> > > $2 = {node = {prev = 0x16a6e18, next = 0x16a6e18}, hmap_node = {hash =
> > 0, next = 0x0}, connmgr = 0x16a6de0, rconn = 0x16edb50, type =
> > OFCONN_SERVICE, band = OFPROTO_IN_BAND, enable_async_msgs = true,
> > >   role = OFPCR12_ROLE_EQUAL, protocol = OFPUTIL_P_OF10_STD_TID,
> > packet_in_format = NXPIF_OPENFLOW10, packet_in_counter = 0x167a170,
> > schedulers = {0x0, 0x0}, pktbuf = 0x0, miss_send_len = 0,
> > >   controller_id = 0, reply_counter = 0x1673190, master_async_config =
> > {3, 7, 7, 0, 0, 0}, slave_async_config = {0, 7, 0, 0, 0, 0}, n_add = 0,
> > n_delete = 0, n_modify = 0,
> > >   first_op = -9223372036854775808, last_op = -9223372036854775808,
> > next_op_report = 9223372036854775807, op_backoff = -9223372036854775808,
> > monitors = {buckets = 0x16d58f0, one = 0x0, mask = 0,
> > >     n = 0}, monitor_paused = 0, monitor_counter = 0x16759f0, updates =
> > {prev = 0x16d5918, next = 0x16d5918}, sent_abbrev_update = false, bundles =
> > {buckets = 0x16d5938, one = 0x0, mask = 0, n = 0}}
> > > (gdb) bt
> > > #0  ofconn_set_protocol (ofconn=0x16d5810, protocol=OFPUTIL_P_OF10_STD)
> > at ofproto/connmgr.c:999
> > > #1  0x000000000045e194 in ofconn_get_protocol (ofconn=0x16d5810) at
> > ofproto/connmgr.c:982
> > > #2  0x000000000045f586 in connmgr_wants_packet_in_on_miss
> > (mgr=0x16a6de0) at ofproto/connmgr.c:1613
> > > #3  0x0000000000435261 in rule_dpif_lookup_from_table
> > (ofproto=0x16a6880, version=323, flow=0x7f2ace7f86e8, wc=0x7f2ace7f84b0,
> > stats=0x0, table_id=0x7f2ace7f7eda "", in_port=28, may_packet_in=true,
> > >     honor_table_miss=true) at ofproto/ofproto-dpif.c:3973
> > > #4  0x0000000000457ecf in xlate_actions (xin=0x7f2ace7f86e0,
> > xout=0x7f2ace7f8010) at ofproto/ofproto-dpif-xlate.c:5188
> > > #5  0x00000000004481b1 in revalidate_ukey (udpif=0x16a7300,
> > ukey=0x7f2ab80060e0, stats=0x7f2ace7f94e0, odp_actions=0x7f2ace7f8a40,
> > reval_seq=585728, recircs=0x7f2ace7f8a30)
> > >     at ofproto/ofproto-dpif-upcall.c:1866
> > > #6  0x0000000000448fb2 in revalidate (revalidator=0x1691990) at
> > ofproto/ofproto-dpif-upcall.c:2186
> > > #7  0x000000000044593e in udpif_revalidator (arg=0x1691990) at
> > ofproto/ofproto-dpif-upcall.c:862
> > > #8  0x000000000050b93d in ovsthread_wrapper (aux_=0x16f4560) at
> > lib/ovs-thread.c:340
> > > #9  0x00007f2ad75c2184 in start_thread () from /lib/x86_64-linux-gnu/
> > libpthread.so.0
> > > #10 0x00007f2ad6de137d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> > > (gdb) f 1
> > > #1  0x000000000045e194 in ofconn_get_protocol (ofconn=0x16d5810) at
> > ofproto/connmgr.c:982
> > > 982               ofconn_set_protocol(CONST_CAST(struct ofconn *,
> > ofconn),
> > > (gdb) l
> > > 977   {
> > > 978       if (ofconn->protocol == OFPUTIL_P_NONE &&
> > > 979           rconn_is_connected(ofconn->rconn)) {
> > > 980           int version = rconn_get_version(ofconn->rconn);
> > > 981           if (version > 0) {
> > > 982               ofconn_set_protocol(CONST_CAST(struct ofconn *,
> > ofconn),
> > > 983                                   ofputil_protocol_from_ofp_
> > version(version));
> > > 984           }
> > > 985       }
> > > 986
> > > (gdb) p *ofconn
> > > $3 = {node = {prev = 0x16a6e18, next = 0x16a6e18}, hmap_node = {hash =
> > 0, next = 0x0}, connmgr = 0x16a6de0, rconn = 0x16edb50, type =
> > OFCONN_SERVICE, band = OFPROTO_IN_BAND, enable_async_msgs = true,
> > >   role = OFPCR12_ROLE_EQUAL, protocol = OFPUTIL_P_OF10_STD_TID,
> > packet_in_format = NXPIF_OPENFLOW10, packet_in_counter = 0x167a170,
> > schedulers = {0x0, 0x0}, pktbuf = 0x0, miss_send_len = 0,
> > >   controller_id = 0, reply_counter = 0x1673190, master_async_config =
> > {3, 7, 7, 0, 0, 0}, slave_async_config = {0, 7, 0, 0, 0, 0}, n_add = 0,
> > n_delete = 0, n_modify = 0,
> > >   first_op = -9223372036854775808, last_op = -9223372036854775808,
> > next_op_report = 9223372036854775807, op_backoff = -9223372036854775808,
> > monitors = {buckets = 0x16d58f0, one = 0x0, mask = 0,
> > >     n = 0}, monitor_paused = 0, monitor_counter = 0x16759f0, updates =
> > {prev = 0x16d5918, next = 0x16d5918}, sent_abbrev_update = false, bundles =
> > {buckets = 0x16d5938, one = 0x0, mask = 0, n = 0}}
> > > (gdb) p ofconn
> > > $4 = (const struct ofconn *) 0x16d5810
> > > (gdb) c
> > > Continuing.
> > > [Thread 0x7f2ad79f5980 (LWP 20165) exited]
> > >
> > > From the above GDB Session, ovs-vswitchd is in the middle of processing
> > a packet miss that was read from the data path.
> > > The break point was set inside ofconn_set_protocol inside so that we hit
> > if protocol was already set to other than OFPUTIL_P_NONE and now is being
> > set to OFPUTIL_P_OF10_STD.
> > > Yes, we modified the code in ofconn_set_protocol with this if condition
> > so that we do not hit breakpoint all the time.  From our earlier
> > experiments, we were led towards this assumption and hence added the if
> > condition to see if we were hitting this condition in reality.
> > > Please note the ofconn structure printed towards the end of the session.
> > > ofconn has protocol already set to OFPUTIL_P_OF10_STD_TID
> > > However, the incoming argument to ofconn_set_protocol in frame 0 is
> > OFPUTIL_P_OF10_STD.
> > > Now in Frame 1, in function ofconn_get_protocol, please note the if
> > condition on line 978.
> > > This if check will not be satisfied unless ofconn->protocol is
> > OFPUTIL_P_NONE.
> > > Combining point 5 and point 8 above, it leads us to believe that when
> > the packet miss was being processed, at some point in time initially, the
> > protocol was seen as  OFPUTIL_P_NONE and at some point in time later it was
> > changed to OFPUTIL_P_OF10_STD_TID.  But, later this stack continued its
> > execution and is in the process of setting it to OFPUTIL_P_OF10_STD.
> > >
> > > Taking a cursory look at the code, the code involved in these stack
> > frames seems to acquire the ofproto_mutex appropriately.  This leads us to
> > think of the following two possibilities (there could be more):
> > >
> > > Potential Race Causes:
> > >
> > > The code path in packet miss processing was in race with another thread
> > that was processing connection establishment and protocol negotiation
> > > The connection involved is from the utility ovs-ofctl and since it is a
> > short lived controller (with OpenFlow version 1.0) in our use case as we
> > are issuing one command at a time (as posted by Sam at the beginning of the
> > thread)
> > > Is it possible that the connection was taken down while the packet miss
> > is being processed and same ofconn pointer was used for a new incoming
> > connection (since we are running ovs-ofctl back to back)?
> > >
> > > Both of the above scenarios are purely speculative at this point as we
> > have not yet spent much time debugging this further.
> > >
> > > But, from the GDB stack it does definitely look like there is a race
> > condition involved.
> > >
> > > Please let us know what you think of this analysis.
> > >
> > > Thank you very much, Ben, for taking time in the middle of your busy
> > schedule to sift through this.
> > >
> > > Regards,
> > > Sagar.
> > >
> > > > On Jan 13, 2017, at 12:12 PM, Samuel Jean <sj...@digitalocean.com>
> > wrote:
> > > >
> > > > Thanks for the quick follow up Ben,
> > > >
> > > > So we'll indeed try against latest versions to rule out the
> > possibility of a bug that has been fixed already although I could not find
> > any commit with such mention.  We'll report back here.
> > > >
> > > > At this moment, we can reproduce over and over within minutes.  We've
> > nailed down -- or at least we think -- to something related to race
> > condition or memory overwrite between the time connection negotiation
> > happens and the flow mod packet arrives.  We've been able to run the stress
> > test for hours when we used --flow-format=NXM+table_id as an argument to
> > ovs-ofctl but eventually, we've hit the same error.
> > > >
> > > > Sagar is spending more time debugging this issue so maybe he'll be
> > able to provide more information.
> > > >
> > > > On Fri, Jan 13, 2017 at 1:45 PM, Ben Pfaff <b...@ovn.org <mailto:
> > b...@ovn.org>> wrote:
> > > > On Thu, Jan 12, 2017 at 03:54:42PM -0500, Samuel Jean via dev wrote:
> > > > > It seems that shelling out to ovs-ofctl very quickly can lead to bug
> > where
> > > > > it reports an OFPT_ERROR.
> > > > >
> > > > > We were able to constantly reproduce within minutes of running the
> > above
> > > > > flow modifications on Unbutu.
> > > > >
> > > > > Any help, hints or guidance would be appreciated.  I'd be happy to
> > pursue
> > > > > some debugging that would be required to nail down the issue here.
> > > >
> > > > Thanks for the bug report and especially for the detailed reproduction
> > > > advice.
> > > >
> > > > I've now tried running this reproduction case against Open vSwitch from
> > > > latest master and against the latest versions from the 2.6.x and 2.5.x
> > > > branches, and I can't see any failures even after letting the script
> > run
> > > > for a few minutes.
> > > >
> > > > Maybe you should try 2.5.1 or the latest from branch-2.5 and see if it
> > > > fixes the problem?  And if not, then we'll have to figure out what's
> > > > different between your setup and mine.
> > > >
> > >
> >
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to