Re: [ovs-discuss] OVN: too many resubmits for arp requests

Felix Hüttner via discuss Mon, 27 Feb 2023 08:04:49 -0800

> On 2/22/23 09:41, Felix Hüttner via discuss wrote:
> > Hello everyone,
> >
>
> Hi Felix,
>
> > we are currently running ovn 22.12 for our openstack environment.
> > We have a large logical switch which is connected to our internet 
> > connection.
> > On this switch there are currently around 350 logical routers connected 
> > (with more to
> come).
> >
> > If our physical switches now try an arp request targeted to the ip of one 
> > of the logical
> routers the request works fine.
> > However if they send an arp request targeting an ip that is not assigned we 
> > see packet
> drops on vswitchd because of "Translation failed (Too many resubmits), packet 
> is
> dropped.".
>
> In your case, who is owning this target IP?  Can't the LS proxy ARP
> reply for it if it's assigned to a logical switch port connected to the LS?
>


The target IP is unused, so nobody should answer that arp request.
Which I guess is the hardest case to solve.

> >
> > The flow that is failing is
> arp,in_port=1,vlan_tci=0x0000,dl_src=00:1c:73:00:00:99,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=ou
> r.phyiscal.switch.ip,arp_tpa=some.unassigned.ip,arp_op=1,arp_sha=00:1c:73:00:00:99,arp_tha
> =00:00:00:00:00:00
> >
> > It seems like it is send to the ingress pipeline of all logical routers 
> > based on the
> following logical flow:
> > table=25(ls_in_l2_lkup      ), priority=70   , match=(eth.mcast), 
> > action=(outport =
> "_MC_flood"; output;)
> > This in turn causes around 18 resubmit actions per router and additionaly a 
> > lot of load
> on the vswitchd/ovn-controllers.
> >
> > We currently see a few options on how to solve the "too many resubmits":
> >
> > ## Option 1:
> > Prevent sending unknown arp requests to the logical routers by adding the 
> > following flow
> > table=25(ls_in_l2_lkup      ), priority=72   , match=(eth.mcast && (arp.op 
> > == 1 ||
> nd_ns)), action=(outport = "_MC_flood_l2"; output;)
> >
> > This would still allow normal arp requests to the logical routers to work 
> > as they are
> already handled by a priority 80 flow in the same table.
> > However this would break garps, since we would no longer forward them to 
> > all logical
> routers.
> > It might therefor make sense to add this as an option to the logical switch 
> > instead of
> setting it as some default.
> >
> > We are currently already using this solution and it seems to solve this 
> > specific issue.
>
> Maybe it makes sense to combine this with mac binding aging?  At least
> after a while the routers will try to re-ARP the next-hops so if we
> missed gARPs traffic will still eventually flow correctly.
>

Ah yes, we are already setting mac binding aging. That is a nice effect.

I have also pushed the patch here to the mailing list: 
https://patchwork.ozlabs.org/project/ovn/patch/du0pr10mb5244fc4e6a785694fd84b6f9ea...@du0pr10mb5244.eurprd10.prod.outlook.com/

> >
> > ## Option 2:
> > Increase the resubmit limit in ovs to cover these cases.
> > However we see the following issues:
> >
> > 1. Independent of the value we would set there, it might always be too low 
> > for some
> cases (e.g. in our other openstack environment we currently have ~2k routers 
> on a network.
> That would be roughly 36000 resubmits for such a arp request)
> > 2. Too much load on the vswitchd/ovn-controller side
> >    1. because we would actually need to run through all of the routers only 
> > to find out
> that we can not answer the request (if it's a arp request for an ip that is 
> not assigned)
> >    2. because we would send all of these arp requests to the ovn-controller 
> > to
> potentially learn the mac_bindings (if configured)
> >
> > To reduce the load issue we could use the following flows. They would 
> > ensure that garps
> are flooded to all logical routers, while normal arp requests are only send 
> to routers
> that could actually answer them:
> >   table=25(ls_in_l2_lkup      ), priority=72   , match=(eth.mcast && arp.op 
> > == 1 &&
> arp.spa != arp.tpa), action=(outport = "_MC_flood_l2"; output;)
> >   table=25(ls_in_l2_lkup      ), priority=72   , match=(eth.mcast && nd_ns),
> action=(outport = "_MC_flood_l2"; output;)
> >   table=25(ls_in_l2_lkup      ), priority=70   , match=(eth.mcast), 
> > action=(outport =
> "_MC_flood"; output;)
> >
> > however that depends on being able to do the following match "arp.spa != 
> > arp.tpa" which
> from my knowledge is currently not possible (as you can not match fields 
> against other
> fields)
> >
>
> The fact that we can't do "arp.spa != arp.tpa" is unfortunate indeed.
>
> IIRC there was also a discussion at some point to do the learning in a
> single place, on the logical switch and inject mac bindings for all
> connected routers.  I'm not sure how feasible that is though.
>

I really like that idea since it removes a lot of unnecessary load.
Does anyone know if there is anything speaking against that, otherwise I would 
take a look at this at some point in the future.

> > --
> > Felix Huettner
> >
>
> Regards,
> Dumitru

--
Felix Huettner
Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die 
Verwertung durch den vorgesehenen Empfänger bestimmt. Sollten Sie nicht der 
vorgesehene Empfänger sein, setzen Sie den Absender bitte unverzüglich in 
Kenntnis und löschen diese E Mail. Hinweise zum Datenschutz finden Sie 
hier<https://www.datenschutz.schwarz>.
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN: too many resubmits for arp requests

Reply via email to