It would be more helpful to have a simple reproduction case.
Why haven't you tried a newer version from branch-2.5?
On Tue, Jan 17, 2017 at 07:59:05AM -0800, Vidyasagara Guntaka wrote:
> Hi Ben,
>
> Here i is more debug information related to this incident (still using
> version 2.5.0):
>
> Summary :
>
> We think that there is some race condition involved in processing OF
> Controller connections and Packet miss processing in ovs-vswitchd.
>
> Reasoning :
>
> Please consider the following GDB Debug Session:
>
> Breakpoint 1, ofconn_set_protocol (ofconn=0x16d5810,
> protocol=OFPUTIL_P_OF10_STD) at ofproto/connmgr.c:999
> (gdb) f 2
> #2 0x000000000045f586 in connmgr_wants_packet_in_on_miss (mgr=0x16a6de0) at
> ofproto/connmgr.c:1613
> 1613 enum ofputil_protocol protocol = ofconn_get_protocol(ofconn);
> (gdb) p *ofconn
> $2 = {node = {prev = 0x16a6e18, next = 0x16a6e18}, hmap_node = {hash = 0,
> next = 0x0}, connmgr = 0x16a6de0, rconn = 0x16edb50, type = OFCONN_SERVICE,
> band = OFPROTO_IN_BAND, enable_async_msgs = true,
> role = OFPCR12_ROLE_EQUAL, protocol = OFPUTIL_P_OF10_STD_TID,
> packet_in_format = NXPIF_OPENFLOW10, packet_in_counter = 0x167a170,
> schedulers = {0x0, 0x0}, pktbuf = 0x0, miss_send_len = 0,
> controller_id = 0, reply_counter = 0x1673190, master_async_config = {3, 7,
> 7, 0, 0, 0}, slave_async_config = {0, 7, 0, 0, 0, 0}, n_add = 0, n_delete =
> 0, n_modify = 0,
> first_op = -9223372036854775808, last_op = -9223372036854775808,
> next_op_report = 9223372036854775807, op_backoff = -9223372036854775808,
> monitors = {buckets = 0x16d58f0, one = 0x0, mask = 0,
> n = 0}, monitor_paused = 0, monitor_counter = 0x16759f0, updates = {prev
> = 0x16d5918, next = 0x16d5918}, sent_abbrev_update = false, bundles =
> {buckets = 0x16d5938, one = 0x0, mask = 0, n = 0}}
> (gdb) bt
> #0 ofconn_set_protocol (ofconn=0x16d5810, protocol=OFPUTIL_P_OF10_STD) at
> ofproto/connmgr.c:999
> #1 0x000000000045e194 in ofconn_get_protocol (ofconn=0x16d5810) at
> ofproto/connmgr.c:982
> #2 0x000000000045f586 in connmgr_wants_packet_in_on_miss (mgr=0x16a6de0) at
> ofproto/connmgr.c:1613
> #3 0x0000000000435261 in rule_dpif_lookup_from_table (ofproto=0x16a6880,
> version=323, flow=0x7f2ace7f86e8, wc=0x7f2ace7f84b0, stats=0x0,
> table_id=0x7f2ace7f7eda "", in_port=28, may_packet_in=true,
> honor_table_miss=true) at ofproto/ofproto-dpif.c:3973
> #4 0x0000000000457ecf in xlate_actions (xin=0x7f2ace7f86e0,
> xout=0x7f2ace7f8010) at ofproto/ofproto-dpif-xlate.c:5188
> #5 0x00000000004481b1 in revalidate_ukey (udpif=0x16a7300,
> ukey=0x7f2ab80060e0, stats=0x7f2ace7f94e0, odp_actions=0x7f2ace7f8a40,
> reval_seq=585728, recircs=0x7f2ace7f8a30)
> at ofproto/ofproto-dpif-upcall.c:1866
> #6 0x0000000000448fb2 in revalidate (revalidator=0x1691990) at
> ofproto/ofproto-dpif-upcall.c:2186
> #7 0x000000000044593e in udpif_revalidator (arg=0x1691990) at
> ofproto/ofproto-dpif-upcall.c:862
> #8 0x000000000050b93d in ovsthread_wrapper (aux_=0x16f4560) at
> lib/ovs-thread.c:340
> #9 0x00007f2ad75c2184 in start_thread () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x00007f2ad6de137d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> (gdb) f 1
> #1 0x000000000045e194 in ofconn_get_protocol (ofconn=0x16d5810) at
> ofproto/connmgr.c:982
> 982 ofconn_set_protocol(CONST_CAST(struct ofconn *, ofconn),
> (gdb) l
> 977 {
> 978 if (ofconn->protocol == OFPUTIL_P_NONE &&
> 979 rconn_is_connected(ofconn->rconn)) {
> 980 int version = rconn_get_version(ofconn->rconn);
> 981 if (version > 0) {
> 982 ofconn_set_protocol(CONST_CAST(struct ofconn *, ofconn),
> 983
> ofputil_protocol_from_ofp_version(version));
> 984 }
> 985 }
> 986
> (gdb) p *ofconn
> $3 = {node = {prev = 0x16a6e18, next = 0x16a6e18}, hmap_node = {hash = 0,
> next = 0x0}, connmgr = 0x16a6de0, rconn = 0x16edb50, type = OFCONN_SERVICE,
> band = OFPROTO_IN_BAND, enable_async_msgs = true,
> role = OFPCR12_ROLE_EQUAL, protocol = OFPUTIL_P_OF10_STD_TID,
> packet_in_format = NXPIF_OPENFLOW10, packet_in_counter = 0x167a170,
> schedulers = {0x0, 0x0}, pktbuf = 0x0, miss_send_len = 0,
> controller_id = 0, reply_counter = 0x1673190, master_async_config = {3, 7,
> 7, 0, 0, 0}, slave_async_config = {0, 7, 0, 0, 0, 0}, n_add = 0, n_delete =
> 0, n_modify = 0,
> first_op = -9223372036854775808, last_op = -9223372036854775808,
> next_op_report = 9223372036854775807, op_backoff = -9223372036854775808,
> monitors = {buckets = 0x16d58f0, one = 0x0, mask = 0,
> n = 0}, monitor_paused = 0, monitor_counter = 0x16759f0, updates = {prev
> = 0x16d5918, next = 0x16d5918}, sent_abbrev_update = false, bundles =
> {buckets = 0x16d5938, one = 0x0, mask = 0, n = 0}}
> (gdb) p ofconn
> $4 = (const struct ofconn *) 0x16d5810
> (gdb) c
> Continuing.
> [Thread 0x7f2ad79f5980 (LWP 20165) exited]
>
> From the above GDB Session, ovs-vswitchd is in the middle of processing a
> packet miss that was read from the data path.
> The break point was set inside ofconn_set_protocol inside so that we hit if
> protocol was already set to other than OFPUTIL_P_NONE and now is being set to
> OFPUTIL_P_OF10_STD.
> Yes, we modified the code in ofconn_set_protocol with this if condition so
> that we do not hit breakpoint all the time. From our earlier experiments, we
> were led towards this assumption and hence added the if condition to see if
> we were hitting this condition in reality.
> Please note the ofconn structure printed towards the end of the session.
> ofconn has protocol already set to OFPUTIL_P_OF10_STD_TID
> However, the incoming argument to ofconn_set_protocol in frame 0 is
> OFPUTIL_P_OF10_STD.
> Now in Frame 1, in function ofconn_get_protocol, please note the if condition
> on line 978.
> This if check will not be satisfied unless ofconn->protocol is OFPUTIL_P_NONE.
> Combining point 5 and point 8 above, it leads us to believe that when the
> packet miss was being processed, at some point in time initially, the
> protocol was seen as OFPUTIL_P_NONE and at some point in time later it was
> changed to OFPUTIL_P_OF10_STD_TID. But, later this stack continued its
> execution and is in the process of setting it to OFPUTIL_P_OF10_STD.
>
> Taking a cursory look at the code, the code involved in these stack frames
> seems to acquire the ofproto_mutex appropriately. This leads us to think of
> the following two possibilities (there could be more):
>
> Potential Race Causes:
>
> The code path in packet miss processing was in race with another thread that
> was processing connection establishment and protocol negotiation
> The connection involved is from the utility ovs-ofctl and since it is a short
> lived controller (with OpenFlow version 1.0) in our use case as we are
> issuing one command at a time (as posted by Sam at the beginning of the
> thread)
> Is it possible that the connection was taken down while the packet miss is
> being processed and same ofconn pointer was used for a new incoming
> connection (since we are running ovs-ofctl back to back)?
>
> Both of the above scenarios are purely speculative at this point as we have
> not yet spent much time debugging this further.
>
> But, from the GDB stack it does definitely look like there is a race
> condition involved.
>
> Please let us know what you think of this analysis.
>
> Thank you very much, Ben, for taking time in the middle of your busy schedule
> to sift through this.
>
> Regards,
> Sagar.
>
> > On Jan 13, 2017, at 12:12 PM, Samuel Jean <[email protected]> wrote:
> >
> > Thanks for the quick follow up Ben,
> >
> > So we'll indeed try against latest versions to rule out the possibility of
> > a bug that has been fixed already although I could not find any commit with
> > such mention. We'll report back here.
> >
> > At this moment, we can reproduce over and over within minutes. We've
> > nailed down -- or at least we think -- to something related to race
> > condition or memory overwrite between the time connection negotiation
> > happens and the flow mod packet arrives. We've been able to run the stress
> > test for hours when we used --flow-format=NXM+table_id as an argument to
> > ovs-ofctl but eventually, we've hit the same error.
> >
> > Sagar is spending more time debugging this issue so maybe he'll be able to
> > provide more information.
> >
> > On Fri, Jan 13, 2017 at 1:45 PM, Ben Pfaff <[email protected]
> > <mailto:[email protected]>> wrote:
> > On Thu, Jan 12, 2017 at 03:54:42PM -0500, Samuel Jean via dev wrote:
> > > It seems that shelling out to ovs-ofctl very quickly can lead to bug where
> > > it reports an OFPT_ERROR.
> > >
> > > We were able to constantly reproduce within minutes of running the above
> > > flow modifications on Unbutu.
> > >
> > > Any help, hints or guidance would be appreciated. I'd be happy to pursue
> > > some debugging that would be required to nail down the issue here.
> >
> > Thanks for the bug report and especially for the detailed reproduction
> > advice.
> >
> > I've now tried running this reproduction case against Open vSwitch from
> > latest master and against the latest versions from the 2.6.x and 2.5.x
> > branches, and I can't see any failures even after letting the script run
> > for a few minutes.
> >
> > Maybe you should try 2.5.1 or the latest from branch-2.5 and see if it
> > fixes the problem? And if not, then we'll have to figure out what's
> > different between your setup and mine.
> >
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev