On 3/11/24 06:17, Han Zhou wrote: > > > On Fri, Mar 8, 2024 at 12:17 AM Tao Liu <[email protected] > <mailto:[email protected]>> wrote: >> >> >> On 3/7/24 5:39 AM, Ilya Maximets wrote: >> > On 2/27/24 20:14, Han Zhou wrote: >> >> For kernel datapath, when a new tunnel port is created in the same >> >> transaction in which an old tunnel port with the same tunnel >> >> configuration is deleted, the new tunnel port creation will fail and >> >> left in an error state. This can be easily reproduced in OVN by >> >> triggering a chassis deletion and addition with the same encap in the >> >> same SB DB transaction, such as: >> >> >> >> ovn-sbctl chassis-add aaaaaa geneve 1.2.3.4 >> >> ovn-sbctl chassis-del aaaaaa -- chassis-add bbbbbb 1.2.3.4 >> >> >> >> ovs-vsctl show | grep error >> >> error: "could not add network device ovn-bbbbbb-0 to ofproto (File >> >> exists)" >> >> >> >> Related logs in OVS: >> >> — >> >> 2024-02-23T05:41:49.978Z|405933|bridge|INFO|bridge br-int: deleted >> >> interface ovn-aaaaaa-0 on port 113 >> >> 2024-02-23T05:41:49.989Z|405935|tunnel|WARN|ovn-bbbbbb-0: attempting to >> >> add tunnel port with same config as port 'ovn-aaaaaa-0' (::->1.2.3.4, >> >> key=flow, legacy_l2, dp port=9) >> >> 2024-02-23T05:41:49.989Z|405936|ofproto|WARN|br-int: could not add port >> >> ovn-bbbbbb-0 (File exists) >> >> 2024-02-23T05:41:49.989Z|405937|bridge|WARN|could not add network device >> >> ovn-bbbbbb-0 to ofproto (File exists) >> >> — >> > >> > Hi, Han. Thanks for the patch! >> > >> >> >> >> Depending on when there are other OVSDB changes, it may take a long time >> >> for the device to be added successfully, triggered by the next OVS >> >> iteration. >> >> >> >> (note: native tunnel ports do not have this problem) >> > >> > I don't think this is correct. The code path is common for both system >> > and native tunnels. I can reproduce the issues in a sandbox with: >> > >> > $ make -j8 sandbox SANDBOXFLAGS="\-\-dummy='system'" >> > [tutorial]$ ovs-vsctl add-port br0 tunnel_port \ >> > -- set Interface tunnel_port \ >> > type=geneve options:remote_ip=flow options:key=123 >> > [tutorial]$ ovs-vsctl del-port tunnel_port \ >> > -- add-port br0 tunnel_port2 \ >> > -- set Interface tunnel_port2 \ >> > type=geneve options:remote_ip=flow options:key=123 >> > ovs-vsctl: Error detected while setting up 'tunnel_port2': >> > could not add network device tunnel_port2 to ofproto (File exists). >> > See ovs-vswitchd log for details. >> > >> > The same should work in a testsuite as well, i.e. we should be able to >> > create a test for this scenario. >> > >> > Note: The --dummy=system prevents OVS from replacing tunnel ports with >> > dummy ones. >> > > > Thanks Ilya for the correction! --dummy=system is very helpful. > >> >> >> >> The problem is how the tunnel port deletion and creation are handled. In >> >> bridge_reconfigure(), port deletion is handled before creation, to avoid >> >> such resource conflict. However, for kernel tunnel ports, the real clean >> >> up is performed at the end of the bridge_reconfigure() in the: >> >> bridge_run__()->ofproto_run()->...->ofproto_dpif:port_destruct() >> >> >> >> We cannot call bridge_run__() at an earlier point before all >> >> reconfigurations are done, so this patch tries a generic approach to >> >> just re-run the bridge_reconfigure() when there are any port creations >> >> encountered "File exists" error, which indicates a possible resource >> >> conflict may have happened due to a later deleted port, and retry may >> >> succeed. >> >> >> >> Signed-off-by: Han Zhou <[email protected] <mailto:[email protected]>> >> >> --- >> >> This is RFC because I am not sure if there is a better way to solve the >> >> problem >> >> more specifically by executing the port_destruct for the old port before >> >> trying >> >> to create the new port. The fix may be more complex though. >> > >> > I don't think re-trying is a good approach in general. We should likely >> > just destroy the tnl_port structure right away, similarly to how we clean >> > up stp, lldp and bundles in ofproto_port_unregister(). Maybe we can add >> > a new callback similar to bundle_remove() and call tnl_port_del() from it? >> > (I didn't try, so I'm not 100% sure this will not cause any issues.) >> > >> > What do you think? >> > >> > Best regards, Ilya Maximets. >> > _______________________________________________ >> > dev mailing list >> > [email protected] <mailto:[email protected]> >> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> > <https://mail.openvswitch.org/mailman/listinfo/ovs-dev> >> >> Hi Ilya and Han, >> >> We hit this issue some days ago. As our analysis, it was introduced by >> commit fe83f81df977 ("netdev: Remove netdev from global shash when the >> user is changing interface configuration.") >> >> Reproduce: >> ovs-vsctl add-port br-int vxlan0 \ >> -- set interface vxlan0 type=vxlan options:remote_ip=10.10.10.1 >> >> sleep 2 >> >> ovs-vsctl --if-exists del-port vxlan0 \ >> -- add-port br-int vxlan1 \ >> -- set interface vxlan1 type=vxlan options:remote_ip=10.10.10.1 >> ovs-vsctl: Error detected while setting up 'vxlan1': could not add >> network device vxlan1 to ofproto (File exists). >> >> vswitchd log: >> bridge|INFO|bridge br-int: added interface vxlan0 on port 1106 >> bridge|INFO|bridge br-int: deleted interface vxlan0 on port 1106 >> tunnel|WARN|vxlan1: attempting to add tunnel port with same config as >> port 'vxlan0' (::->10.10.10.1, key=0, legacy_l2, dp port=122) >> ofproto|WARN|br-int: could not add port vxlan1 (File exists) >> bridge|WARN|could not add network device vxlan1 to ofproto (File exists) >> >> CallTrace: >> bridge_reconfigure >> bridge_del_ports >> port_destroy >> iface_destroy__ >> netdev_remove <------ netdev vxlan0 removed >> bridge_delete_or_reconfigure_ports >> OFPROTO_PORT_FOR_EACH >> ofproto_port_dump_next >> port_dump_next >> port_query_by_name <------ netdev_shash do not contain vxlan0 >> ofproto_port_del <------ vxlan0 do not del in ofproto >> bridge_add_ports >> bridge_add_ports__ >> iface_create >> iface_do_create >> ofproto_port_add <------ vxlan1 add failed >> >> >> We tried to fix this by the following diff: >> >> --- >> ofproto/ofproto-dpif.c | 8 +++++--- >> 1 file changed, 5 insertions(+), 3 deletions(-) >> >> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c >> index d7544ef..654468e 100644 >> --- a/ofproto/ofproto-dpif.c >> +++ b/ofproto/ofproto-dpif.c >> @@ -3907,14 +3907,16 @@ port_query_by_name(const struct ofproto >> *ofproto_, const char *devname, >> >> if (sset_contains(&ofproto->ghost_ports, devname)) { >> const char *type = netdev_get_type_from_name(devname); >> + const struct ofport *ofport = >> + shash_find_data(&ofproto->up.port_by_name, >> devname); >> + if (!type && ofport && ofport->netdev) { >> + type = netdev_get_type(ofport->netdev); >> + } >> >> /* We may be called before ofproto->up.port_by_name is >> populated with >> * the appropriate ofport. For this reason, we must get the >> name and >> * type from the netdev layer directly. */ >> if (type) { >> - const struct ofport *ofport; >> - >> - ofport = shash_find_data(&ofproto->up.port_by_name, devname); >> ofproto_port->ofp_port = ofport ? ofport->ofp_port : >> OFPP_NONE; >> ofproto_port->name = xstrdup(devname); >> ofproto_port->type = xstrdup(type); >> >> Best regards, Tao >> > > Thanks Tao for the patch. Your patch is much cleaner. Would you like to send > a formal patch? > Ilya, what do you think?
I didn't test this one, but it does indeed look like a better solution to the problem! port_query_by_name() function will need an update in the comment that talks why we're retrieving the type from the netdev layer directly and maybe moving this comment above the netdev_get_type_from_name call. And we'll need a unit test for this issue. But otherwise the code above looks good. Tao, if you can made these changes and post a formal patch, that would be great. Thanks for the analysis! Best regards, Ilya Maximets. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
