On 3/27/26 3:08 PM, Rukomoinikova Aleksandra wrote:
> On 27.03.2026 12:58, Dumitru Ceara wrote:
>> On 3/26/26 10:35 PM, Rukomoinikova Aleksandra wrote:
>>> Hi,
>>>
>> Hi Aleksandra,
> 
> 
> Hi Dumitru! Thank you for your answer!
> 

Hi Aleksandra,

> The main reason I didn’t consider conntrack on gw node is that I’m 
> afraid return traffic
> might go through a different gateway. Like, SYN would go through one 
> gateway, but SYN+ACK
> from server might come back through another gateway. In that case, next 
> client packet
> would be dropped at first gateway because conntrack would consider it 
> invalid.
> 

Hmm, I didn't imagine this scenario.  Wouldn't this kind of asymmetric
forwarding cause issues anyway?  If it's a DGP, wouldn't reply traffic
from the backend come via the gateway anyway?

Or do you want to do direct client return of the backend-side of things?

> Or do you mean using it not as a “real” conntrack in usual sense, but 
> just as a convenient
> base to extract packet metadata from?
> In that case, it is indeed convenient for removing inactive backends, 
> but we would need to
> clean up conntrack entries ourselves after some time, because we can’t 
> rely on the
> conntrack timers anymore since states won’t be updated properly.
> Updating conntrack entries from OVN probably also seems crazy
> 
>>
>>> I’d like to bring up an idea for discussion regarding implementation of 
>>> stateless load balancing
>>> support. I would really appreciate your feedback.I’m also open to 
>>> alternative approaches or ideas
>>> that I may not have considered. Thank you in advance for your time and 
>>> input!
>>>
>> Thanks for researching this, it's very nice work!
>>
>>> So, we would like to move towards stateless traffic load balancing. Here is 
>>> how it would differ
>>> from current approach:
>>> At the moment, when we have a load balancer on router with DGP ports, 
>>> conntrack state is stored
>>> directly on the gateway which hosting DGP right now. So, we select backend 
>>> for the first packet
>>> of a connection, and after that, based on existing conntrack entry, we no 
>>> longer perform backend
>>> selection. Instead, we rely entirely on stored conntrack record for 
>>> subsequent packets.
>>>
>>> One of the key limitations of this solution is that gateway nodes cannot be 
>>> horizontally scaled,
>>> since conntrack state is stored on a single node. Achieving ability to 
>>> horizontally scale gateway
>>> nodes is actually one of our goals.
>>>
>>> Here are a few possible approaches I see:
>>> 1) The idea of synchronizing conntrack state already exists in the 
>>> community, but it seems rather
>>>     outdated and not very promising.
>> Yeah, this on its own has a lot of potential scalability issues so we
>> never really pursued it I guess.
>>
>>> 2) Avoid storing conntrack state on GW node and instead perform stateless 
>>> load balancing for every
>>>     packet in connection, while keeping conntrack state on node where 
>>> virtual machine resides - This
>>>     is approach I am currently leaning toward.
>> Maybe we _also_ need to store conntrack state on the GW node, I'll
>> detail below.
>>
>>> Now in OVN we have use_stateless_nat option for lb, but it comes with 
>>> several limitations:
>>> 1) It works by selecting a backend in a stateless manner for every packet, 
>>> while for return traffic
>>>     it performs 1:1 snat. So, for traffic from backend (ip.src && 
>>> src.port), source ip is rewritten to
>>>     lb.vip. This only works correctly if a backend belongs to a single 
>>> lb.vip. Otherwise, there is a
>>>     risk that return traffic will be snated to wrong lb.vip.
>>> 2) There is also an issue with preserving tcp sessions when number of 
>>> backends changes. Since backend
>>>     selection is done using select(), any change in number of backends can 
>>> break existing tcp sessions.
>>>     This problem is also relevant to solution I proposed above—I will 
>>> elaborate on it below.
>>>
>>> More details about my idea (I will attach some code below—this is not 
>>> production-ready, just a
>>> prototype for testing [1]):
>>> 1. Packet arrives at gateway with as this one:
>>>     eth.dst == dgp_mac
>>>     eth.src == client_mac
>>>     ip.dst == lb.vip
>>>     ip.src == client_ip
>>>     tcp.dst == lb.port
>>>     tcp.src == client_port
>>> 2. We detect that packet is addressed to load balancer → perform select 
>>> over backends.
>>> 3. We route traffic directly to selected backend by changing eth.dst to 
>>> backend’s MAC,
>>>     while keeping `ip.dst == lb.vip`.
>> So what needs to be persisted is actually the MAC address of the backend
>> that was selected (through whatever hashing method).
>>
>>> 4. We pass packet further down processing pipeline. At this point, it looks 
>>> like:
>>>     eth.dst == backend_mac
>>>     eth.src == (source port of switch where backend is connected)
>>>     ip.dst == lb.vip
>>>     ip.src == client_ip
>>>     tcp.dst == lb.port
>>>     tcp.src == client_port
>>> 5. packet goes through ingress pipeline of switch where backed port resides 
>>> → then it is sent
>>>     through tunnel.
>> This is neat indeed because the LS doesn't care about the destination IP
>> but it cares about the destination MAC (for the L2 lookup stage)!
>>
>>> 6. packet arrives at node hosting backend vm, where we perform egress
>>>     load balancing and store conntrack state.
>>> 7. Server responds, and return traffic performs SNAT at ingress of switch 
>>> pipeline.
>>>
>>> This has already been implemented in code for testing and its working. So, 
>>> storing conntrack state on
>>> node where virtual machine resides - this helps address first limitation of 
>>> use_stateless_nat option.
>>>
>>> Regarding second issue: we need to ensure session persistence when number 
>>> of backends changes.
>>> In general, as I understand it, stateless load balancing in such systems is 
>>> typically based on
>>> consistent hashing. However, consistent hashing alone is not sufficient, 
>>> since it does not preserve
>>> 100% of connections. I see solution as some kind of additional layer on top 
>>> of consistent hashing.
>>>
>>> First, about consistent hashing in OVS, Correct me if I'm wrong:
>>> Currently, out of two hashing methods in OVS, only `hash` provides 
>>> consistency, since it is based
>>> on rendezvous hashing and relies on bucket id. For this to work correctly, 
>>> bucket_ids must be
>>> preserved when number of backends changes. However, OVN recreates OpenFlow 
>>> group every time number
>>> of backends changes. At moment, when number of backends changes, we create 
>>> a bundle where first piece
>>> is an ADD group message, and subsequent messages are INSERT_BUCKET/REMOVE 
>>> messages. If we could rewrite
>>> this part to support granular insertion/removal of backends in group — by 
>>> using INSERT_BUCKET/REMOVE
>>> without recreating group — we could make backend selection consistent for 
>>> hash method. I wasn’t able
>>> to fully determine whether there are any limitations to this approach, but 
>>> I did manually test session
>>> preservation using ovs-ofctl while i am inserting/removing buckets. When 
>>> removing buckets from a group,
>>> sessions associated with other buckets were preserved.
>>>
>>> At same time, I understand downsides of using hash: it is expensive, since 
>>> we install dp_flows that match
>>> full 5-tuple of connection, which leads to a large number of datapath 
>>> flows. Additionally, there is an
>>> upcall for every SYN packet. Because of this, I’m not sure how feasible it 
>>> is, but it might be worth
>>> thinking about ways to make dp_hash consistent. I don’t yet have a concrete 
>>> proposal here, but I’d
>>> really appreciate any ideas or suggestions in this direction.
>>>
>>> About additional layer on top of consistent hashing I’ve come up with two 
>>> potential approaches, and I’m not
>>> yet sure which one would be better:
>>> 1) Using learn action
>>> 2) Hash-based sticky sessions in OVS
>>>
>> I'd say we can avoid the need for consistent hashing if we add a third
>> alternative here:
>>
>> In step "3" in your idea above, on the GW node, after we selected the
>> backend (MAC) somehow (e.g., we could still just use dp-hash) we commit
>> that session to conntrack and store the MAC address as metadata (e.g. in
>> the ct_label like we do for ecmp-symmetric-reply routes).
>>
>> Subsequent packets on the same sessions don't actually need to be hashed
>> we'll get the MAC to be used as destination from the conntrack state so
>> any changes to the set of backends won't be problematic.  We would have
>> to flush conntrack for backends that get removed (we do that for regular
>> load balancers too).
>>
>> I know this might sound counter intuitive because your proposal was to
>> make the load balancing "stateless" on the GW node and I'm actually
>> suggesting the GW node processing to rely on conntrack (stateful).
>>
>> But..
>>
>> You mentioned that one of your goals is:
>>
>>> One of the key limitations of this solution is that gateway nodes cannot be 
>>> horizontally scaled,
>>> since conntrack state is stored on a single node. Achieving ability to 
>>> horizontally scale gateway
>>> nodes is actually one of our goals.
>> With my proposal I think you'll achieve that.  We'd have to be careful
>> with the case when a DGP moves to a different chassis (HA failover) but
>> for that we could combine this solution with using a consistent hash on
>> all chassis for the backend selection (something you mention you're
>> looking into anyway).
>>
>> Now, if your goal is to avoid conntrack completely on the gateway
>> chassis, my proposal breaks that.  Then we could implement a similar
>> solution as you suggested with learn action, that might be heavier on
>> the datapath than using conntrack.  I don't have numbers to back this up
>> but maybe we can find a way to benchmark it.
>>
>>> With an additional layer, we can account for hash rebuilds and inaccuracies 
>>> introduced by consistent
>>> hashing. It’s not entirely clear how to handle these cases, considering 
>>> that after some time we need to
>>> remove flows when using `learn`, or clean up connections in hash. We also 
>>> need to manage
>>> connection removal when number of backends changes.
>> Or when backends go away.. which makes it more complex to manage from
>> ovn-controller if we use learn flows, I suspect (if we use conntrack we
>> have the infra to do that already, we use it for regular load balancers
>> that have ct_flush=true).
>>
>>> By using such a two-stage approach, ideally, we would always preserve 
>>> session for a given connection,
>>> losing connections only in certain cases. For example, suppose we have two 
>>> GW nodes:
>>> * first SYN packet arrives at first gw — we record this connection in our 
>>> additional layer
>>>    (using learn or OVS hash) and continue routing all packets for this 
>>> connection based on it.
>>> * If a packet from this same connection arrives at second GW, additional 
>>> layer there
>>>    (learn or OVS hash) will not have session. If number of backends has 
>>> changed since backend was selected
>>>    for this connection on first GW, there is a chance it will not hit same 
>>> backend and will lost session.
>>>
>>> The downsides I see for first solution are:
>>> 1. More complex handling of cleanup for old connections and removal of 
>>> hashes for expired backends.
>>>
>>> The downsides of learn action are:
>>> 1. High load on OVS at a high connection rate.
>>> 2. If I understand correctly, this also results in a large number of 
>>> dp_flows that track each individual 5-tuple hash.
>>>
>>> I would be glad to hear your thoughts on a possible implementation of 
>>> dp_hash consistency, as well
>>> as feedback on overall architecture. These are all ideas I’ve been able to 
>>> develop so far. I would greatly
>>> appreciate any criticism, suggestions, or alternative approaches. It would 
>>> also be interesting to know
>>> whether anyone else is interested in this new optional mode for load 
>>> balancers =)
>>>
>>> [1] https://github.com/Sashhkaa/ovn/commits/deferred-lb-dnat/ - Here's the 
>>> code, just in case
>>>
>> Hope my thoughts above make some sense and that I don't just create
>> confusion. :)
>>

Regards,
Dumitru

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to