Hello everyone,

we are currently running ovn 22.12 for our openstack environment.
We have a large logical switch which is connected to our internet connection.
On this switch there are currently around 350 logical routers connected (with 
more to come).

If our physical switches now try an arp request targeted to the ip of one of 
the logical routers the request works fine.
However if they send an arp request targeting an ip that is not assigned we see 
packet drops on vswitchd because of "Translation failed (Too many resubmits), 
packet is dropped.".

The flow that is failing is 
arp,in_port=1,vlan_tci=0x0000,dl_src=00:1c:73:00:00:99,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=our.phyiscal.switch.ip,arp_tpa=some.unassigned.ip,arp_op=1,arp_sha=00:1c:73:00:00:99,arp_tha=00:00:00:00:00:00

It seems like it is send to the ingress pipeline of all logical routers based 
on the following logical flow:
table=25(ls_in_l2_lkup      ), priority=70   , match=(eth.mcast), 
action=(outport = "_MC_flood"; output;)
This in turn causes around 18 resubmit actions per router and additionaly a lot 
of load on the vswitchd/ovn-controllers.

We currently see a few options on how to solve the "too many resubmits":

## Option 1:
Prevent sending unknown arp requests to the logical routers by adding the 
following flow
table=25(ls_in_l2_lkup      ), priority=72   , match=(eth.mcast && (arp.op == 1 
|| nd_ns)), action=(outport = "_MC_flood_l2"; output;)

This would still allow normal arp requests to the logical routers to work as 
they are already handled by a priority 80 flow in the same table.
However this would break garps, since we would no longer forward them to all 
logical routers.
It might therefor make sense to add this as an option to the logical switch 
instead of setting it as some default.

We are currently already using this solution and it seems to solve this 
specific issue.

## Option 2:
Increase the resubmit limit in ovs to cover these cases.
However we see the following issues:

1. Independent of the value we would set there, it might always be too low for 
some cases (e.g. in our other openstack environment we currently have ~2k 
routers on a network. That would be roughly 36000 resubmits for such a arp 
request)
2. Too much load on the vswitchd/ovn-controller side
   1. because we would actually need to run through all of the routers only to 
find out that we can not answer the request (if it's a arp request for an ip 
that is not assigned)
   2. because we would send all of these arp requests to the ovn-controller to 
potentially learn the mac_bindings (if configured)

To reduce the load issue we could use the following flows. They would ensure 
that garps are flooded to all logical routers, while normal arp requests are 
only send to routers that could actually answer them:
  table=25(ls_in_l2_lkup      ), priority=72   , match=(eth.mcast && arp.op == 
1 && arp.spa != arp.tpa), action=(outport = "_MC_flood_l2"; output;)
  table=25(ls_in_l2_lkup      ), priority=72   , match=(eth.mcast && nd_ns), 
action=(outport = "_MC_flood_l2"; output;)
  table=25(ls_in_l2_lkup      ), priority=70   , match=(eth.mcast), 
action=(outport = "_MC_flood"; output;)

however that depends on being able to do the following match "arp.spa != 
arp.tpa" which from my knowledge is currently not possible (as you can not 
match fields against other fields)

--
Felix Huettner

Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die 
Verwertung durch den vorgesehenen Empfänger bestimmt. Sollten Sie nicht der 
vorgesehene Empfänger sein, setzen Sie den Absender bitte unverzüglich in 
Kenntnis und löschen diese E Mail. Hinweise zum Datenschutz finden Sie 
hier<https://www.datenschutz.schwarz>.
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to