Re: [ovs-dev] [BUG] ovs-ofctl version 2.5.0 will crash with OFPFMFC_BAD_COMMAND

Ben Pfaff Wed, 18 Jan 2017 13:20:06 -0800

If you can come up with simple reproduction instructions that work for
me, I'm happy to track this down.  It's probably something very simple.


On Tue, Jan 17, 2017 at 08:50:20AM -0800, Vidyasagara Guntaka wrote:
> This issue happened on our in-use systems and we were trying to find a way
> to move forward avoiding this issue so that we do not have to upgrade OVS
> on thousands of our hypervisors causing down time. Our debugging did help
> us avoid the issue for now by installing an explicit rule to to drop
> packets when there is no match and this issue is not seen over many hours
> of test runs.
> 
> We will definitely run this test with latest version.  But, will need more
> time since we are busy with our release related activities.
> 
> Regards,
> Sagar.
> 
> On Tue, Jan 17, 2017 at 8:42 AM, Ben Pfaff <b...@ovn.org> wrote:
> 
> > It would be more helpful to have a simple reproduction case.
> >
> > Why haven't you tried a newer version from branch-2.5?
> >
> > On Tue, Jan 17, 2017 at 07:59:05AM -0800, Vidyasagara Guntaka wrote:
> > > Hi Ben,
> > >
> > > Here i is more debug information related to this incident (still using
> > version 2.5.0):
> > >
> > > Summary :
> > >
> > > We think that there is some race condition involved in processing OF
> > Controller connections and Packet miss processing in ovs-vswitchd.
> > >
> > > Reasoning :
> > >
> > > Please consider the following GDB Debug Session:
> > >
> > > Breakpoint 1, ofconn_set_protocol (ofconn=0x16d5810,
> > protocol=OFPUTIL_P_OF10_STD) at ofproto/connmgr.c:999
> > > (gdb) f 2
> > > #2  0x000000000045f586 in connmgr_wants_packet_in_on_miss
> > (mgr=0x16a6de0) at ofproto/connmgr.c:1613
> > > 1613          enum ofputil_protocol protocol =
> > ofconn_get_protocol(ofconn);
> > > (gdb) p *ofconn
> > > $2 = {node = {prev = 0x16a6e18, next = 0x16a6e18}, hmap_node = {hash =
> > 0, next = 0x0}, connmgr = 0x16a6de0, rconn = 0x16edb50, type =
> > OFCONN_SERVICE, band = OFPROTO_IN_BAND, enable_async_msgs = true,
> > >   role = OFPCR12_ROLE_EQUAL, protocol = OFPUTIL_P_OF10_STD_TID,
> > packet_in_format = NXPIF_OPENFLOW10, packet_in_counter = 0x167a170,
> > schedulers = {0x0, 0x0}, pktbuf = 0x0, miss_send_len = 0,
> > >   controller_id = 0, reply_counter = 0x1673190, master_async_config =
> > {3, 7, 7, 0, 0, 0}, slave_async_config = {0, 7, 0, 0, 0, 0}, n_add = 0,
> > n_delete = 0, n_modify = 0,
> > >   first_op = -9223372036854775808, last_op = -9223372036854775808,
> > next_op_report = 9223372036854775807, op_backoff = -9223372036854775808,
> > monitors = {buckets = 0x16d58f0, one = 0x0, mask = 0,
> > >     n = 0}, monitor_paused = 0, monitor_counter = 0x16759f0, updates =
> > {prev = 0x16d5918, next = 0x16d5918}, sent_abbrev_update = false, bundles =
> > {buckets = 0x16d5938, one = 0x0, mask = 0, n = 0}}
> > > (gdb) bt
> > > #0  ofconn_set_protocol (ofconn=0x16d5810, protocol=OFPUTIL_P_OF10_STD)
> > at ofproto/connmgr.c:999
> > > #1  0x000000000045e194 in ofconn_get_protocol (ofconn=0x16d5810) at
> > ofproto/connmgr.c:982
> > > #2  0x000000000045f586 in connmgr_wants_packet_in_on_miss
> > (mgr=0x16a6de0) at ofproto/connmgr.c:1613
> > > #3  0x0000000000435261 in rule_dpif_lookup_from_table
> > (ofproto=0x16a6880, version=323, flow=0x7f2ace7f86e8, wc=0x7f2ace7f84b0,
> > stats=0x0, table_id=0x7f2ace7f7eda "", in_port=28, may_packet_in=true,
> > >     honor_table_miss=true) at ofproto/ofproto-dpif.c:3973
> > > #4  0x0000000000457ecf in xlate_actions (xin=0x7f2ace7f86e0,
> > xout=0x7f2ace7f8010) at ofproto/ofproto-dpif-xlate.c:5188
> > > #5  0x00000000004481b1 in revalidate_ukey (udpif=0x16a7300,
> > ukey=0x7f2ab80060e0, stats=0x7f2ace7f94e0, odp_actions=0x7f2ace7f8a40,
> > reval_seq=585728, recircs=0x7f2ace7f8a30)
> > >     at ofproto/ofproto-dpif-upcall.c:1866
> > > #6  0x0000000000448fb2 in revalidate (revalidator=0x1691990) at
> > ofproto/ofproto-dpif-upcall.c:2186
> > > #7  0x000000000044593e in udpif_revalidator (arg=0x1691990) at
> > ofproto/ofproto-dpif-upcall.c:862
> > > #8  0x000000000050b93d in ovsthread_wrapper (aux_=0x16f4560) at
> > lib/ovs-thread.c:340
> > > #9  0x00007f2ad75c2184 in start_thread () from /lib/x86_64-linux-gnu/
> > libpthread.so.0
> > > #10 0x00007f2ad6de137d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> > > (gdb) f 1
> > > #1  0x000000000045e194 in ofconn_get_protocol (ofconn=0x16d5810) at
> > ofproto/connmgr.c:982
> > > 982               ofconn_set_protocol(CONST_CAST(struct ofconn *,
> > ofconn),
> > > (gdb) l
> > > 977   {
> > > 978       if (ofconn->protocol == OFPUTIL_P_NONE &&
> > > 979           rconn_is_connected(ofconn->rconn)) {
> > > 980           int version = rconn_get_version(ofconn->rconn);
> > > 981           if (version > 0) {
> > > 982               ofconn_set_protocol(CONST_CAST(struct ofconn *,
> > ofconn),
> > > 983                                   ofputil_protocol_from_ofp_
> > version(version));
> > > 984           }
> > > 985       }
> > > 986
> > > (gdb) p *ofconn
> > > $3 = {node = {prev = 0x16a6e18, next = 0x16a6e18}, hmap_node = {hash =
> > 0, next = 0x0}, connmgr = 0x16a6de0, rconn = 0x16edb50, type =
> > OFCONN_SERVICE, band = OFPROTO_IN_BAND, enable_async_msgs = true,
> > >   role = OFPCR12_ROLE_EQUAL, protocol = OFPUTIL_P_OF10_STD_TID,
> > packet_in_format = NXPIF_OPENFLOW10, packet_in_counter = 0x167a170,
> > schedulers = {0x0, 0x0}, pktbuf = 0x0, miss_send_len = 0,
> > >   controller_id = 0, reply_counter = 0x1673190, master_async_config =
> > {3, 7, 7, 0, 0, 0}, slave_async_config = {0, 7, 0, 0, 0, 0}, n_add = 0,
> > n_delete = 0, n_modify = 0,
> > >   first_op = -9223372036854775808, last_op = -9223372036854775808,
> > next_op_report = 9223372036854775807, op_backoff = -9223372036854775808,
> > monitors = {buckets = 0x16d58f0, one = 0x0, mask = 0,
> > >     n = 0}, monitor_paused = 0, monitor_counter = 0x16759f0, updates =
> > {prev = 0x16d5918, next = 0x16d5918}, sent_abbrev_update = false, bundles =
> > {buckets = 0x16d5938, one = 0x0, mask = 0, n = 0}}
> > > (gdb) p ofconn
> > > $4 = (const struct ofconn *) 0x16d5810
> > > (gdb) c
> > > Continuing.
> > > [Thread 0x7f2ad79f5980 (LWP 20165) exited]
> > >
> > > From the above GDB Session, ovs-vswitchd is in the middle of processing
> > a packet miss that was read from the data path.
> > > The break point was set inside ofconn_set_protocol inside so that we hit
> > if protocol was already set to other than OFPUTIL_P_NONE and now is being
> > set to OFPUTIL_P_OF10_STD.
> > > Yes, we modified the code in ofconn_set_protocol with this if condition
> > so that we do not hit breakpoint all the time.  From our earlier
> > experiments, we were led towards this assumption and hence added the if
> > condition to see if we were hitting this condition in reality.
> > > Please note the ofconn structure printed towards the end of the session.
> > > ofconn has protocol already set to OFPUTIL_P_OF10_STD_TID
> > > However, the incoming argument to ofconn_set_protocol in frame 0 is
> > OFPUTIL_P_OF10_STD.
> > > Now in Frame 1, in function ofconn_get_protocol, please note the if
> > condition on line 978.
> > > This if check will not be satisfied unless ofconn->protocol is
> > OFPUTIL_P_NONE.
> > > Combining point 5 and point 8 above, it leads us to believe that when
> > the packet miss was being processed, at some point in time initially, the
> > protocol was seen as  OFPUTIL_P_NONE and at some point in time later it was
> > changed to OFPUTIL_P_OF10_STD_TID.  But, later this stack continued its
> > execution and is in the process of setting it to OFPUTIL_P_OF10_STD.
> > >
> > > Taking a cursory look at the code, the code involved in these stack
> > frames seems to acquire the ofproto_mutex appropriately.  This leads us to
> > think of the following two possibilities (there could be more):
> > >
> > > Potential Race Causes:
> > >
> > > The code path in packet miss processing was in race with another thread
> > that was processing connection establishment and protocol negotiation
> > > The connection involved is from the utility ovs-ofctl and since it is a
> > short lived controller (with OpenFlow version 1.0) in our use case as we
> > are issuing one command at a time (as posted by Sam at the beginning of the
> > thread)
> > > Is it possible that the connection was taken down while the packet miss
> > is being processed and same ofconn pointer was used for a new incoming
> > connection (since we are running ovs-ofctl back to back)?
> > >
> > > Both of the above scenarios are purely speculative at this point as we
> > have not yet spent much time debugging this further.
> > >
> > > But, from the GDB stack it does definitely look like there is a race
> > condition involved.
> > >
> > > Please let us know what you think of this analysis.
> > >
> > > Thank you very much, Ben, for taking time in the middle of your busy
> > schedule to sift through this.
> > >
> > > Regards,
> > > Sagar.
> > >
> > > > On Jan 13, 2017, at 12:12 PM, Samuel Jean <sj...@digitalocean.com>
> > wrote:
> > > >
> > > > Thanks for the quick follow up Ben,
> > > >
> > > > So we'll indeed try against latest versions to rule out the
> > possibility of a bug that has been fixed already although I could not find
> > any commit with such mention.  We'll report back here.
> > > >
> > > > At this moment, we can reproduce over and over within minutes.  We've
> > nailed down -- or at least we think -- to something related to race
> > condition or memory overwrite between the time connection negotiation
> > happens and the flow mod packet arrives.  We've been able to run the stress
> > test for hours when we used --flow-format=NXM+table_id as an argument to
> > ovs-ofctl but eventually, we've hit the same error.
> > > >
> > > > Sagar is spending more time debugging this issue so maybe he'll be
> > able to provide more information.
> > > >
> > > > On Fri, Jan 13, 2017 at 1:45 PM, Ben Pfaff <b...@ovn.org <mailto:
> > b...@ovn.org>> wrote:
> > > > On Thu, Jan 12, 2017 at 03:54:42PM -0500, Samuel Jean via dev wrote:
> > > > > It seems that shelling out to ovs-ofctl very quickly can lead to bug
> > where
> > > > > it reports an OFPT_ERROR.
> > > > >
> > > > > We were able to constantly reproduce within minutes of running the
> > above
> > > > > flow modifications on Unbutu.
> > > > >
> > > > > Any help, hints or guidance would be appreciated.  I'd be happy to
> > pursue
> > > > > some debugging that would be required to nail down the issue here.
> > > >
> > > > Thanks for the bug report and especially for the detailed reproduction
> > > > advice.
> > > >
> > > > I've now tried running this reproduction case against Open vSwitch from
> > > > latest master and against the latest versions from the 2.6.x and 2.5.x
> > > > branches, and I can't see any failures even after letting the script
> > run
> > > > for a few minutes.
> > > >
> > > > Maybe you should try 2.5.1 or the latest from branch-2.5 and see if it
> > > > fixes the problem?  And if not, then we'll have to figure out what's
> > > > different between your setup and mine.
> > > >
> > >
> >
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [BUG] ovs-ofctl version 2.5.0 will crash with OFPFMFC_BAD_COMMAND

Reply via email to