Hello everyone, we are currently running ovn 22.12 for our openstack environment. We have a large logical switch which is connected to our internet connection. On this switch there are currently around 350 logical routers connected (with more to come).
If our physical switches now try an arp request targeted to the ip of one of the logical routers the request works fine. However if they send an arp request targeting an ip that is not assigned we see packet drops on vswitchd because of "Translation failed (Too many resubmits), packet is dropped.". The flow that is failing is arp,in_port=1,vlan_tci=0x0000,dl_src=00:1c:73:00:00:99,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=our.phyiscal.switch.ip,arp_tpa=some.unassigned.ip,arp_op=1,arp_sha=00:1c:73:00:00:99,arp_tha=00:00:00:00:00:00 It seems like it is send to the ingress pipeline of all logical routers based on the following logical flow: table=25(ls_in_l2_lkup ), priority=70 , match=(eth.mcast), action=(outport = "_MC_flood"; output;) This in turn causes around 18 resubmit actions per router and additionaly a lot of load on the vswitchd/ovn-controllers. We currently see a few options on how to solve the "too many resubmits": ## Option 1: Prevent sending unknown arp requests to the logical routers by adding the following flow table=25(ls_in_l2_lkup ), priority=72 , match=(eth.mcast && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood_l2"; output;) This would still allow normal arp requests to the logical routers to work as they are already handled by a priority 80 flow in the same table. However this would break garps, since we would no longer forward them to all logical routers. It might therefor make sense to add this as an option to the logical switch instead of setting it as some default. We are currently already using this solution and it seems to solve this specific issue. ## Option 2: Increase the resubmit limit in ovs to cover these cases. However we see the following issues: 1. Independent of the value we would set there, it might always be too low for some cases (e.g. in our other openstack environment we currently have ~2k routers on a network. That would be roughly 36000 resubmits for such a arp request) 2. Too much load on the vswitchd/ovn-controller side 1. because we would actually need to run through all of the routers only to find out that we can not answer the request (if it's a arp request for an ip that is not assigned) 2. because we would send all of these arp requests to the ovn-controller to potentially learn the mac_bindings (if configured) To reduce the load issue we could use the following flows. They would ensure that garps are flooded to all logical routers, while normal arp requests are only send to routers that could actually answer them: table=25(ls_in_l2_lkup ), priority=72 , match=(eth.mcast && arp.op == 1 && arp.spa != arp.tpa), action=(outport = "_MC_flood_l2"; output;) table=25(ls_in_l2_lkup ), priority=72 , match=(eth.mcast && nd_ns), action=(outport = "_MC_flood_l2"; output;) table=25(ls_in_l2_lkup ), priority=70 , match=(eth.mcast), action=(outport = "_MC_flood"; output;) however that depends on being able to do the following match "arp.spa != arp.tpa" which from my knowledge is currently not possible (as you can not match fields against other fields) -- Felix Huettner Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die Verwertung durch den vorgesehenen Empfänger bestimmt. Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte unverzüglich in Kenntnis und löschen diese E Mail. Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>. _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss