On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara <dce...@redhat.com> wrote:
> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote: > > Hi all > > > > Sorry for top posting. I want to thank you all for the discussion and > > give also some feedback from OpenStack perspective which is affected > > by the problem described here. > > > > In OpenStack, it's kind of common to have a shared external network > > (logical switch with a localnet port) across many tenants. Each tenant > > user may create their own router where their instances will be > > connected to access the external network. > > > > In such scenario, we are hitting the issue described here. In > > particular in our tests we exercise 3K VIFs (with 1 FIP) each spanning > > 300 LS; each LS connected to a LR (ie. 300 LRs) and that router > > connected to the public LS. This is creating a huge problem in terms > > of performance and tons of events due to the MAC_Binding entries > > generated as a consequence of the GARPs sent for the floating IPs. > > > > Just as an addition to this, GARPs wouldn't be the only reason why all > routers would learn the MAC_Binding. Even if we wouldn't be sending > GARPs for the FIPs, when a VM that's behind a FIP would send traffic to > the outside, the router will generate an ARP request for the next hop > using the FIP-IP and FIP-MAC. This will be broadcasted to all routers > connected to the public LS and will trigger them to learn the > FIP-IP:FIP-MAC binding. > Yeah we shouldn't be learning on regular ARP requests. > > > Thanks, > > Daniel > > > > > > On Thu, May 28, 2020 at 10:51 AM Dumitru Ceara <dce...@redhat.com> > wrote: > >> > >> On 5/28/20 8:34 AM, Han Zhou wrote: > >>> > >>> > >>> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara <dce...@redhat.com > >>> <mailto:dce...@redhat.com>> wrote: > >>>> > >>>> Hi Girish, Han, > >>>> > >>>> On 5/26/20 11:51 PM, Han Zhou wrote: > >>>>> > >>>>> > >>>>> On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail > >>> <gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com> > >>>>> <mailto:gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com>>> > wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Tue, May 26, 2020 at 12:42 PM Han Zhou <zhou...@gmail.com > >>> <mailto:zhou...@gmail.com> > >>>>> <mailto:zhou...@gmail.com <mailto:zhou...@gmail.com>>> wrote: > >>>>>>> > >>>>>>> Hi Girish, > >>>>>>> > >>>>>>> Thanks for the summary. I agree with you that GARP request v.s. > reply > >>>>> is irrelavent to the problem here. > >>>> > >>>> Well, actually I think GARP request vs reply is relevant (at least for > >>>> case 1 below) because if OVN would be generating GARP replies we > >>>> wouldn't need the priority 80 flow to determine if an ARP request > packet > >>>> is actually an OVN self originated GARP that needs to be flooded in > the > >>>> L2 broadcast domain. > >>>> > >>>> On the other hand, router3 would be learning mac_binding IP2,M2 from > the > >>>> GARP reply originated by router2 and vice versa so we'd have to > restrict > >>>> flooding of GARP replies to non-patch ports. > >>>> > >>> > >>> Hi Dumitru, the point was that, on the external LS, the GRs will have > to > >>> send ARP requests to resolve unknown IPs (at least for the external > GW), > >>> and it has to be broadcasted, which will cause all the GRs learn all > >>> MACs of other GRs. This is regardless of the GARP behavior. You are > >>> right that if we only consider the Join switch then the GARP request > >>> v.s. reply does make a difference. However, GARP request/reply may be > >>> really needed only on the external LS. > >>> > >> > >> Ok, but do you see an easy way to determine if we need to add the > >> logical flows that flood self originated GARP packets on a given logical > >> switch? Right now we add them on all switches. > >> > >>>>>>> Please see my comment inline below. > >>>>>>> > >>>>>>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail > >>>>> <gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com> > >>> <mailto:gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com>>> wrote: > >>>>>>>> > >>>>>>>> Hello Dumitru, > >>>>>>>> > >>>>>>>> There are several things that are being discussed on this thread. > >>>>> Let me see if I can tease them out for clarity. > >>>>>>>> > >>>>>>>> 1. All the router IPs are known to OVN (the join switch case) > >>>>>>>> 2. Some IPs are known and some are not known (the external logical > >>>>> switch that connects to physical network case). > >>>>>>>> > >>>>>>>> Let us look at each of the case above: > >>>>>>>> > >>>>>>>> 1. Join Switch Case > >>>>>>>> > >>>>>>>> +----------------+ +----------------+ > >>>>>>>> | l3gateway | | l3gateway | > >>>>>>>> | router2 | | router3 | > >>>>>>>> +-------------+--+ +-+--------------+ > >>>>>>>> IP2,M2 IP3,M3 > >>>>>>>> | | > >>>>>>>> +--+-------------+---+ > >>>>>>>> | join switch | > >>>>>>>> +---------+----------+ > >>>>>>>> | > >>>>>>>> IP1,M1 > >>>>>>>> +-------+--------+ > >>>>>>>> | distributed | > >>>>>>>> | router | > >>>>>>>> +----------------+ > >>>>>>>> > >>>>>>>> > >>>>>>>> Say, GR router2 wants to send the packet out to DR and that we > >>>>> don't have static mappings of MAC to IP in lr_in_arp_resolve table > on GR > >>>>> router2 (with Han's patch of dynamic_neigh_routes=true for all the > >>>>> Gateway Routers). With this in mind, when an ARP request is sent out > by > >>>>> router2's hypervisor the packet should be directly sent to the > >>>>> distributed router alone. Your commit 32f5ebb0622 (ovn-northd: Limit > >>>>> ARP/ND broadcast domain whenever possible) should have allowed only > >>>>> unicast. However, in ls_in_l2_lkup table we have > >>>>>>>> > >>>>>>>> table=19(ls_in_l2_lkup ), priority=80 , match=(eth.src == > >>>>> { M2 } && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood"; > >>> output;) > >>>>>>>> table=19(ls_in_l2_lkup ), priority=75 , match=(flags[1] > == > >>>>> 0 && arp.op == 1 && arp.tpa == { IP1}), action=(outport = > >>>>> "jtor-router2"; output;) > >>>>>>>> > >>>>>>>> As you can see, `priority=80` rule will always be hit and sent out > >>>>> to all the GRs. The `priority=75` rule is never hit. So, we will see > ARP > >>>>> packets on the GENEVE tunnel. So, we need to change `priority=80` to > >>>>> match GARP request packets. That way, for the known OVN IPs case we > >>>>> don't do broadcast. > >>>>>>> > >>>>>>> Since the solution to case 2) below (i.e. > >>>>> learn_from_arp_request=false) solves the problem of case 1), too, I > >>>>> think we don't need this change just for case 1). As @Dumitru Ceara > >>>>> mentioned, there is some cost because it adds extra flows. It would > be > >>>>> significant amount of flows if there are a lot of snat_and_dnat IPs. > >>>>> What do you think? > >>>> > >>>> I think the following might be a solution, although with the cost of > >>>> adding as many flows as dnat_and_snat IPs are configured: > >>>> > >>>> - priority 80: explicitly determine if an ARP request is a self > >>>> originated GARP for configured IP addresses and dnat_and_snat IPs (by > >>>> matching on all eth.src and arp.tpa pairs) and if so flood on all > >>>> non-patch ports. > >>>> - priority 75: if arp.tpa is owned by an OVN logical router port, > >>>> "unicast" it only on the patch port towards the router. > >>>> - priority 1: flood any broadcast packet. > >>>> > >>>> Together with the learn_from_arp_request=false knob this would cover > >>>> both case 1 (join switch) and case 2 (external switch). > >>>> > >>>> Wdyt? > >>>> > >>> Would the "learn_from_arp_request=false knob" cover both cases? If yes, > >>> we don't need to add more flows of priority 80, or more accurately: > >>> whether to update the priority-80 flows is not directly related to the > >>> current problem. > >>> > >> > >> Yes, it would, except for the fact that the ARP requests would still be > >> flooded to all routers (and ignored at the destination). Which is afaiu > >> what Girish was worried about. In order to address that part too I'm > >> afraid we have to update the priority-80 flows. > >> > >> Regards, > >> Dumitru > >> > >>>>>> > >>>>>> > >>>>>> Han, yes it will work. However, my only concern is that we would > send > >>>>> all these ARP requests via tunnel to each of 1000 hypervisors and > these > >>>>> hypervisors will just drop them on the floor. when they see > >>>>> learn_from_arp_request=false. > >>>>> > >>>>> I think maybe it is not a problem since it happens only once on the > Join > >>>>> switch. Once the MAC is learned, it won't broadcast again. It may be > >>>>> more of a problem on the external LS if periodical GARP is required > >>>>> there. However, I'd suggest to have some test and see if it is > really a > >>>>> problem, before trying to solve it. > >>>>> > >>>>>> > >>>>>> Han, Dumitru, > >>>>>> > >>>>>> Why can't we swap the priorities of the above two flows so that the > >>>>> ARP request for NexHop IP known to OVN will be always sent via > >>> `unicast`? > >>>>> > >>>>> If swapped, even GARP won't get broadcasted. Maybe that's not the > >>>>> desired behavior. > >>>>> > >>>> > >>>> This is definitely not desired as we'd be hitting the prio 75 flow > that > >>>> would send the self originated GARP request (IPx) packet back towards > >>>> the router port that owns IPx. > >>>> > >>>>>> > >>>>>> Regards, > >>>>>> ~Girish > >>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> 2. External Logical Switch Case > >>>>>>>> > >>>>>>>> 10.10.10.0/24 <http://10.10.10.0/24> > >>> <http://10.10.10.0/24> > >>>>> > >>>>>>>> -------------------------+-------------------------- > >>>>>>>> | > >>>>>>>> localnet > >>>>>>>> +-----+-----+ > >>>>>>>> | external | > >>>>>>>> +------------+ LS1 +-------------+ > >>>>>>>> | +-----+-----+ | > >>>>>>>> | | | > >>>>>>>> 10.10.10.2 10.10.10.3 10.10.10.4 > >>>>>>>> SNAT SNAT SNAT > >>>>>>>> +-----+-----+ +-----+-----+ +-----------+ > >>>>>>>> | l3gateway | | l3gateway | | l3gateway | > >>>>>>>> | node1 | | node2 | | node3 | > >>>>>>>> +-----------+ +-----------+ +-----------+ > >>>>>>>> > >>>>>>>> In this case, we have some of the IPs in OVN and some in the > >>>>> physical network. If we fix (1) above, all the ARP requests for the > >>>>> OVN's router IPs will be unicast. However, all the ARP requests to > >>>>> external IPs, say 10.10.10.1 on the "physical router", will be > >>>>> broadcast. Now, we will see these ARP broadcasts on all the L3 > gateway > >>>>> routers. With 'learn_from_arp_request=false' [a], then the > MAC_Binding > >>>>> table will not explode for both ARP and GARP requests. > >>>>>>>> > >>>>>>>> So, I don't think GARP requests and replies is the issue here? > >>>>> Furthermore, learning from the GARP replies are blocked on certain > >>>>> routers. For example: > >>>>> > >>> > https://www.juniper.net/documentation/en_US/junose15.1/topics/concept/ip-gratuitous-arps-transmission-overview.html > >>>>> says "By default, updating the ARP cache on GARP replies is > disabled on > >>>>> the router.". So, our NAT addresses mapping will not be learnt. > >>>> > >>>> Just as a side note, the above doesn't mean Juniper boxes don't > support > >>>> learning from GARP replies, just that they'd need extra > configuration. I > >>>> don't necessarily think that's a bad thing if properly documented in > OVN > >>>> that we would be generating GARP replies. > >>>> > >>>> Regards, > >>>> Dumitru > >>>> > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> ~Girish > >>>>>>>> > >>>>>>>> > >>>>>>>> [a] - From Han's mail, the meaning of learn_from_arp_request=false > >>>>> --> if the TPA is on the router, add a new entry (it means the > >>>>>>>>> remote wants to communicate with this node, so it makes > >>> sense to > >>>>>>>>> learn the remote as well). Otherwise, ignore it and no new > >>>>> entry added. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> -- > >>>>>> You received this message because you are subscribed to the Google > >>>>> Groups "ovn-kubernetes" group. > >>>>>> To unsubscribe from this group and stop receiving emails from it, > send > >>>>> an email to ovn-kubernetes+unsubscr...@googlegroups.com > >>> <mailto:ovn-kubernetes%2bunsubscr...@googlegroups.com> > >>>>> <mailto:ovn-kubernetes%2bunsubscr...@googlegroups.com > >>> <mailto:ovn-kubernetes%252bunsubscr...@googlegroups.com>>. > >>>>>> To view this discussion on the web visit > >>>>> > >>> > https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STRnem2PeSahuwhro1t%2BQJxchZNC7viq8n-ngM9KU%2B%2B-Xw%40mail.gmail.com > . > >>>> > >>> > >>> -- > >>> You received this message because you are subscribed to the Google > >>> Groups "ovn-kubernetes" group. > >>> To unsubscribe from this group and stop receiving emails from it, send > >>> an email to ovn-kubernetes+unsubscr...@googlegroups.com > >>> <mailto:ovn-kubernetes+unsubscr...@googlegroups.com>. > >>> To view this discussion on the web visit > >>> > https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com > >>> < > https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com?utm_medium=email&utm_source=footer > >. > >> > >> _______________________________________________ > >> discuss mailing list > >> disc...@openvswitch.org > >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > > > >
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss