On 10/16/24 13:11, Dumitru Ceara wrote: > On 10/16/24 12:43, Ilya Maximets wrote: >> On 10/16/24 09:53, Dumitru Ceara wrote: >>> Thanks Ilya and Numan, for your replies! >>> >>> On 10/15/24 18:57, Numan Siddique wrote: >>>> On Tue, Oct 15, 2024 at 11:41 AM Ilya Maximets <[email protected]> wrote: >>>>> >>>>> On 10/15/24 15:51, Dumitru Ceara wrote: >>>>>> Hi all, >>>>>> >>>>>> We received a bug report from the ovn-kubernetes development team: >>>>>> https://issues.redhat.com/browse/FDP-871 >>>>>> >>>>>> Given a gateway router with load balancers attached to it. If: >>>>>> >>>>>> - the gateway router has a router port attached to multiple IP networks >>>>>> >>>>>> and >>>>>> >>>>>> - the router is configured with lb_force_snat_ip=router_ip (SNAT to the >>>>>> router IP configured on the interface the traffic is sent out on _if the >>>>>> packet has been load balanced_ [0]) >>>>>> >>>>>> Then ovn-northd cannot know which router IP should be used when SNAT-ing >>>>>> the packets going out. Today ovn-northd chooses the "first" IP. >>>>>> >>>>>> E.g., in a sandbox: >>>>>> >>>>>> ovn-nbctl lr-add lr -- set logical_router lr options:chassis=chassis-1 \ >>>>>> -- set logical_router lr options:lb_force_snat_ip=router_ip \ >>>>>> -- lrp-add lr lrp 00:00:00:00:00:01 1.1.1.1/24 2.2.2.2/24 \ >>>>>> -- ls-add ls \ >>>>>> -- lsp-add ls lsp -- lsp-set-type lsp router \ >>>>>> -- set logical_switch_port lsp options:router-port=lrp \ >>>>>> -- lb-add lb 42.42.42.42:80 43.43.43.43:80 tcp \ >>>>>> -- lr-lb-add lr lb >>>>>> >>>>>> northd logs: >>>>>> 2024-10-15T13:40:02.484Z|00010|northd|WARN|Logical router port "lrp" is >>>>>> configured with multiple IPv4 addresses. Only the first IP [1.1.1.1] is >>>>>> considered as SNAT for load balancer >>>>>> >>>>>> However, the problem is that NB.Logical_Router_Port.networks is a _set_ >>>>>> of strings (networks). Sets are unordered (it happens that ovsdb-server >>>>>> sorts them lexicographically but that's no guarantee) so the "first" >>>>>> value is not really something that can be controlled by the CMS. >>>>>> >>>>>> The only idea I could came up with until now to fix this is to add >>>>>> another version of the lb_force_snat_ip option but this time on the >>>>>> router port. This would allow the CMS to explicitly specify the SNAT IP >>>>>> individually, per port. However, that's an (ugly?) feature and won't be >>>>>> usable by ovn-kubernetes until OVN 25.03 is released, if we implement it. >>>>>> >>>>>> Does anyone have any other simpler and cleaner ideas about how we could >>>>>> allow the CMS to choose one of the router port IPs? >>>>> >>>>> Should we allow CMS to choose? Or can we make a decision based on the >>>>> chosen next hop? Your example above is not good for that, but in a >>>>> typical setup router will not route just random addresses, it will >>>>> route to next hops that match one of the router port networks. >>>>> >>>>> So, maybe we can look at the networks and choose IP that is in the same >>>>> network as the next hop the packet is going to? >>>>> >>>>> Today in case next hop doesn't belong to the output ports' networks, OVN >>>>> is choosing the "first" one. We can keep doing that, but choose the >>>>> one that matches otherwise. >>>>> >>> >>> I'm afraid this doesn't really help with the use case ovn-kubernetes >>> has. Let me depict their topology in more detail: >>> >>> AZ1 AZ2 >>> >>> 1.1.0.1/16 1.1.0.2/16 >>> │ │ >>> │ │ >>> ┌──┼───┐ ┌──┼───┐ >>> │ │ │ │ >>> │ GR1 ┼──┐ ┌───┼ GR2 │ >>> │ │ │ │ │ │ >>> └──────┘ │ │ └──────┘ >>> 100.65.0.1/16 │ │ 100.65.0.2/16 >>> 10.100.200.1/16 │ │10.100.200.1/16 >>> │ │ >>> │ │ >>> │ ┌─────┐ │ >>> │ │ │ │ >>> └──┼ TS ┼──┘ >>> │ │ >>> └┬───┬┘ >>> │ │ >>> │ │ >>> │ │ >>> │ │ >>> 10.100.200.11/16 │ │ 10.100.200.21/16 >>> ┌──────┐ │ │ ┌──────┐ >>> │ │ │ │ │ │ >>> │ POD1 ┼──────┘ └───────┼ POD2 │ >>> │ │ │ │ >>> └──────┘ └──────┘ >>> >>> POD1's default gateway is 10.100.200.1 >>> POD2's default gateway is also 10.100.200.1 >>> >>> But the unusual thing is that 10.100.200.1 is assigned as interface to >>> both GR1's router port and to GR2's router port. In theory this is >>> wrong because the IP is duplicated in TS's broadcast domain but because >>> TS is a transit switch this "works": >>> - ARP requests from POD-X (AZ-X) are not flooded to all ports and >>> instead TS in AZ-X proxy ARP replies for them because there's a LRP in >>> that zone with the 10.100.200.1 IP >>> - 10.100.200.1 is never seen as packet source/destination IP on the TS >>> >>> Let's also consider there's a load balancer attached to GR2: >>> VIP: 42.42.42.42:80 TCP >>> backend: 10.100.200.11:80 (POD1) >>> >>> TCP traffic entering GR2 (from the top) with destination 42.42.42.42:80 >>> is load balanced (DNATed) to 10.100.200.11:80. >>> >>> Because replies for this traffic must be forced to come back on the same >>> path, GR2 also has lb_force_snat_ip=router_ip. "router_ip" is used as >>> value because that load balancer might also load balance traffic that's >>> going out other interfaces of the GR (1.1.0.x above) which have >>> different IPs than the ones on the port towards TS. >>> >>> IIUC onv-kubernetes' intention was to use the "first" IP configured on >>> the router port connecting the GRs to TS as SNAT IP, that is 100.65.0.2: >>> >>> GR1-port IPs: [100.65.0.2/16, 10.100.200.1/16] # I know this is a set >>> but it was incorrectly considered a list. >>> >>> So, after load balancing, the packet is also SNATed so that the tuple >>> now looks like 100.65.0.2:X -> 10.100.200.11:80. That is, when the >>> packet reaches POD1 it was masqueraded to look like it was originated by >>> GR2. >>> >>> If we use your suggestion and automatically select the IP that matches >>> the next hop in this case OVN would choose the SNAT IP to be >>> 10.100.200.1. As mentioned above this IP is duplicated in all AZs so it >>> won't achieve the result of getting replies back on the same path and >>> replies from POD1 will go to GR1 (and get dropped there). >>> >>> I am aware that aside from the potential undefined behavior in OVN >>> there's also a network topology issue here because of the duplicated >>> 10.100.200.1 IP. We are trying to suggest an alternative topology [1] >>> but that would require a new type of router in OVN (transit router). >> >> >> Ack. Thanks for the extra context as it is not clear what the full >> picture is just from the description here and in the linked jira issue. >> >> I see two separate issues here: >> >> 1. 'lb_force_snat_ip=router_ip' is poorly designed from the beginning, >> since router ports can have multiple IPs and it's unclear which one >> will be used. Yes, northd warns that it will use the "first" one, >> but it is not predictable in a general case which one will it be. >> > > FWIW, I think that log is very confusing. Maybe we should change it to > "WARN: undefined behavior ...".
It's not really undefined, northd knows which IP will be used and notifies the user through this message. So, it may be undefined from the user perspective, but it is defined from the northd perspective. The documentation may be updated though, because it doesn't mention this aspect at all today. > >> 2. ovn-kubernetes topology is incorrect from the networking perspective. >> Having the same IP owned by multiple entities attached to the same >> L2 switch is not a correct topology, so issues are expected. >> >> IIUC, the duplicated IP is there because we want a live-migrated VM >> to keep using the same default gateway, but this gateway should be on >> the same node where this VM ended up, not the one from the original >> node. So, without a "transit router" this is not really a possibility, >> unless IPs are duplicated in a hacky way, as it is done here. >> > > Right, I agree, I think it makes complete sense for OVN to implement the > "transit router" functionality - I think it's probably useful for other > users too, not only for ovn-kubernetes. But that's a new feature and it > will take at least the current release cycle for it to be implemented. > >> However, stupid problems call for stupid solutions. What happens if, >> in addition to choosing the SNAT IP by the network that matches the >> next hop, ovn-kubernetes will configure a second IP from the pod subnet, >> but unique across the cluster? E.g.: >> >> GR1-port IPs: [10.100.200.1/32, 10.100.200.42/16] >> GR2-port IPs: [10.100.200.1/32, 10.100.200.43/16] >> >> The 10.100.200.1 can still be used as a default gateway for pods, because >> it is still perfectly reachable and the same on every node. However, only >> the unique /16 IP from the pod subnet will be chosen for SNAT, because it >> is the network that contains the LB backend, /32 network doesn't have it. >> Since the second /16 IP is unique across the cluster, the SNATed traffic >> will return on the same path it arrived. >> >> Such approach solves the first issue in a reasonable way without need for >> extra knobs, but keeps the second issue as it is today until we have a >> transit router to solve it. >> >> Will that work? >> > > I think that we could even avoid the .1 completely if we go that way. > We might be even able to fix the topology without code changes in OVN. > > We could: > > GR1-port IPs: [10.100.200.42/16] > GR2-port IPs: [10.100.200.43/16] > > And enable proxy-arp on GR1-port and GR2-port to reply with the GR port > mac address for all arp requests targeting 10.100.200.1. I don't think > there's a strict requirement that the default gateway IP (10.100.200.1) > replies to any IP traffic. It's probably enough if pods can just route > traffic through that IP - which we could achieve with proxy-arp. proxy-arp should work, I suppose. But it's also sort of an abuse of the functionality. proxy-arp is not meant to represent non-existing addresses, it is meant for addresses from the same L2 domain that for some reason require routing to reach them. Having GR to own this address is what we actually want form the networking point of view. But I agree that this may be a reasonable stopgap until we have a transit router, since the topology is incorrect networking-wise anyway. I still think that issue 1 should be addressed though by choosing the appropriate network instead of the random one. If not as a bug fix, than at least as a new behavior. > >> I'm not sure if there will be any problem with overlapping networks on >> the router port, but it feels like northd ignores that fact, so it should >> be fine. >> > > IMO the only downside of doing this is that ovn-k will be consuming one > IP per node from the pod subnet. Which might cause issues when clusters > are scaled up and new nodes are added. Doesn't sound like a big deal. Pods to nodes ratio typically is high, so consuming extra 500 addresses from a /16 network on a 500 node cluster should not be a problem, IMO. > >>> >>>>> This should be backward compatible, as the current behavior is random. >>>>> >>>>> WDYT? >>>> >>>> How about accepting the router port name as another value to >>>> options:lb_force_snat_ip >>>> >>>> i.e ovn-nbctl set logical_router lr >>>> options:lb_force_snat_ip=<router_port_name> >>>> >>> >>> I don't think this would really work either. In the ovn-kubernetes use >>> case they actually need the lb_force_snat_ip behavior on all ports that >>> forward load balanced traffic. These ports might have multiple sets of >>> IPs (more than one IP per port) as shown above. >>> >>> Thanks, >>> Dumitru >>> >>>> Thanks >>>> Numan >>>> >>>>> >>>>> Best regards, Ilya Maximets. >>>>> >>>>>> >>>>>> Thanks, >>>>>> Dumitru >>>>>> >>>>>> [0] https://github.com/ovn-org/ovn/commit/c6e21a23bd8c >>> >>> [1] https://issues.redhat.com/browse/FDP-872 >>> >> > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
