Thanks Ilya and Numan, for your replies!
On 10/15/24 18:57, Numan Siddique wrote:
> On Tue, Oct 15, 2024 at 11:41 AM Ilya Maximets <[email protected]> wrote:
>>
>> On 10/15/24 15:51, Dumitru Ceara wrote:
>>> Hi all,
>>>
>>> We received a bug report from the ovn-kubernetes development team:
>>> https://issues.redhat.com/browse/FDP-871
>>>
>>> Given a gateway router with load balancers attached to it. If:
>>>
>>> - the gateway router has a router port attached to multiple IP networks
>>>
>>> and
>>>
>>> - the router is configured with lb_force_snat_ip=router_ip (SNAT to the
>>> router IP configured on the interface the traffic is sent out on _if the
>>> packet has been load balanced_ [0])
>>>
>>> Then ovn-northd cannot know which router IP should be used when SNAT-ing
>>> the packets going out. Today ovn-northd chooses the "first" IP.
>>>
>>> E.g., in a sandbox:
>>>
>>> ovn-nbctl lr-add lr -- set logical_router lr options:chassis=chassis-1 \
>>> -- set logical_router lr options:lb_force_snat_ip=router_ip \
>>> -- lrp-add lr lrp 00:00:00:00:00:01 1.1.1.1/24 2.2.2.2/24 \
>>> -- ls-add ls \
>>> -- lsp-add ls lsp -- lsp-set-type lsp router \
>>> -- set logical_switch_port lsp options:router-port=lrp \
>>> -- lb-add lb 42.42.42.42:80 43.43.43.43:80 tcp \
>>> -- lr-lb-add lr lb
>>>
>>> northd logs:
>>> 2024-10-15T13:40:02.484Z|00010|northd|WARN|Logical router port "lrp" is
>>> configured with multiple IPv4 addresses. Only the first IP [1.1.1.1] is
>>> considered as SNAT for load balancer
>>>
>>> However, the problem is that NB.Logical_Router_Port.networks is a _set_
>>> of strings (networks). Sets are unordered (it happens that ovsdb-server
>>> sorts them lexicographically but that's no guarantee) so the "first"
>>> value is not really something that can be controlled by the CMS.
>>>
>>> The only idea I could came up with until now to fix this is to add
>>> another version of the lb_force_snat_ip option but this time on the
>>> router port. This would allow the CMS to explicitly specify the SNAT IP
>>> individually, per port. However, that's an (ugly?) feature and won't be
>>> usable by ovn-kubernetes until OVN 25.03 is released, if we implement it.
>>>
>>> Does anyone have any other simpler and cleaner ideas about how we could
>>> allow the CMS to choose one of the router port IPs?
>>
>> Should we allow CMS to choose? Or can we make a decision based on the
>> chosen next hop? Your example above is not good for that, but in a
>> typical setup router will not route just random addresses, it will
>> route to next hops that match one of the router port networks.
>>
>> So, maybe we can look at the networks and choose IP that is in the same
>> network as the next hop the packet is going to?
>>
>> Today in case next hop doesn't belong to the output ports' networks, OVN
>> is choosing the "first" one. We can keep doing that, but choose the
>> one that matches otherwise.
>>
I'm afraid this doesn't really help with the use case ovn-kubernetes
has. Let me depict their topology in more detail:
AZ1 AZ2
1.1.0.1/16 1.1.0.2/16
│ │
│ │
┌──┼───┐ ┌──┼───┐
│ │ │ │
│ GR1 ┼──┐ ┌───┼ GR2 │
│ │ │ │ │ │
└──────┘ │ │ └──────┘
100.65.0.1/16 │ │ 100.65.0.2/16
10.100.200.1/16 │ │10.100.200.1/16
│ │
│ │
│ ┌─────┐ │
│ │ │ │
└──┼ TS ┼──┘
│ │
└┬───┬┘
│ │
│ │
│ │
│ │
10.100.200.11/16 │ │ 10.100.200.21/16
┌──────┐ │ │ ┌──────┐
│ │ │ │ │ │
│ POD1 ┼──────┘ └───────┼ POD2 │
│ │ │ │
└──────┘ └──────┘
POD1's default gateway is 10.100.200.1
POD2's default gateway is also 10.100.200.1
But the unusual thing is that 10.100.200.1 is assigned as interface to
both GR1's router port and to GR2's router port. In theory this is
wrong because the IP is duplicated in TS's broadcast domain but because
TS is a transit switch this "works":
- ARP requests from POD-X (AZ-X) are not flooded to all ports and
instead TS in AZ-X proxy ARP replies for them because there's a LRP in
that zone with the 10.100.200.1 IP
- 10.100.200.1 is never seen as packet source/destination IP on the TS
Let's also consider there's a load balancer attached to GR2:
VIP: 42.42.42.42:80 TCP
backend: 10.100.200.11:80 (POD1)
TCP traffic entering GR2 (from the top) with destination 42.42.42.42:80
is load balanced (DNATed) to 10.100.200.11:80.
Because replies for this traffic must be forced to come back on the same
path, GR2 also has lb_force_snat_ip=router_ip. "router_ip" is used as
value because that load balancer might also load balance traffic that's
going out other interfaces of the GR (1.1.0.x above) which have
different IPs than the ones on the port towards TS.
IIUC onv-kubernetes' intention was to use the "first" IP configured on
the router port connecting the GRs to TS as SNAT IP, that is 100.65.0.2:
GR1-port IPs: [100.65.0.2/16, 10.100.200.1/16] # I know this is a set
but it was incorrectly considered a list.
So, after load balancing, the packet is also SNATed so that the tuple
now looks like 100.65.0.2:X -> 10.100.200.11:80. That is, when the
packet reaches POD1 it was masqueraded to look like it was originated by
GR2.
If we use your suggestion and automatically select the IP that matches
the next hop in this case OVN would choose the SNAT IP to be
10.100.200.1. As mentioned above this IP is duplicated in all AZs so it
won't achieve the result of getting replies back on the same path and
replies from POD1 will go to GR1 (and get dropped there).
I am aware that aside from the potential undefined behavior in OVN
there's also a network topology issue here because of the duplicated
10.100.200.1 IP. We are trying to suggest an alternative topology [1]
but that would require a new type of router in OVN (transit router).
>> This should be backward compatible, as the current behavior is random.
>>
>> WDYT?
>
> How about accepting the router port name as another value to
> options:lb_force_snat_ip
>
> i.e ovn-nbctl set logical_router lr
> options:lb_force_snat_ip=<router_port_name>
>
I don't think this would really work either. In the ovn-kubernetes use
case they actually need the lb_force_snat_ip behavior on all ports that
forward load balanced traffic. These ports might have multiple sets of
IPs (more than one IP per port) as shown above.
Thanks,
Dumitru
> Thanks
> Numan
>
>>
>> Best regards, Ilya Maximets.
>>
>>>
>>> Thanks,
>>> Dumitru
>>>
>>> [0] https://github.com/ovn-org/ovn/commit/c6e21a23bd8c
[1] https://issues.redhat.com/browse/FDP-872
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev