On 3/26/26 10:35 PM, Rukomoinikova Aleksandra wrote:
> Hi,
> 

Hi Aleksandra,

> I’d like to bring up an idea for discussion regarding implementation of 
> stateless load balancing
> support. I would really appreciate your feedback.I’m also open to alternative 
> approaches or ideas
> that I may not have considered. Thank you in advance for your time and input!
> 

Thanks for researching this, it's very nice work!

> So, we would like to move towards stateless traffic load balancing. Here is 
> how it would differ
> from current approach:
> At the moment, when we have a load balancer on router with DGP ports, 
> conntrack state is stored
> directly on the gateway which hosting DGP right now. So, we select backend 
> for the first packet
> of a connection, and after that, based on existing conntrack entry, we no 
> longer perform backend
> selection. Instead, we rely entirely on stored conntrack record for 
> subsequent packets.
> 
> One of the key limitations of this solution is that gateway nodes cannot be 
> horizontally scaled,
> since conntrack state is stored on a single node. Achieving ability to 
> horizontally scale gateway
> nodes is actually one of our goals.
> 
> Here are a few possible approaches I see:
> 1) The idea of synchronizing conntrack state already exists in the community, 
> but it seems rather
>    outdated and not very promising.

Yeah, this on its own has a lot of potential scalability issues so we
never really pursued it I guess.

> 2) Avoid storing conntrack state on GW node and instead perform stateless 
> load balancing for every
>    packet in connection, while keeping conntrack state on node where virtual 
> machine resides - This
>    is approach I am currently leaning toward.

Maybe we _also_ need to store conntrack state on the GW node, I'll
detail below.

> 
> Now in OVN we have use_stateless_nat option for lb, but it comes with several 
> limitations:
> 1) It works by selecting a backend in a stateless manner for every packet, 
> while for return traffic
>    it performs 1:1 snat. So, for traffic from backend (ip.src && src.port), 
> source ip is rewritten to
>    lb.vip. This only works correctly if a backend belongs to a single lb.vip. 
> Otherwise, there is a
>    risk that return traffic will be snated to wrong lb.vip.
> 2) There is also an issue with preserving tcp sessions when number of 
> backends changes. Since backend
>    selection is done using select(), any change in number of backends can 
> break existing tcp sessions.
>    This problem is also relevant to solution I proposed above—I will 
> elaborate on it below.
> 
> More details about my idea (I will attach some code below—this is not 
> production-ready, just a
> prototype for testing [1]):
> 1. Packet arrives at gateway with as this one:
>    eth.dst == dgp_mac
>    eth.src == client_mac
>    ip.dst == lb.vip
>    ip.src == client_ip
>    tcp.dst == lb.port
>    tcp.src == client_port
> 2. We detect that packet is addressed to load balancer → perform select over 
> backends.
> 3. We route traffic directly to selected backend by changing eth.dst to 
> backend’s MAC,
>    while keeping `ip.dst == lb.vip`.

So what needs to be persisted is actually the MAC address of the backend
that was selected (through whatever hashing method).

> 4. We pass packet further down processing pipeline. At this point, it looks 
> like:
>    eth.dst == backend_mac
>    eth.src == (source port of switch where backend is connected)
>    ip.dst == lb.vip
>    ip.src == client_ip
>    tcp.dst == lb.port
>    tcp.src == client_port
> 5. packet goes through ingress pipeline of switch where backed port resides → 
> then it is sent
>    through tunnel.

This is neat indeed because the LS doesn't care about the destination IP
but it cares about the destination MAC (for the L2 lookup stage)!

> 6. packet arrives at node hosting backend vm, where we perform egress
>    load balancing and store conntrack state.
> 7. Server responds, and return traffic performs SNAT at ingress of switch 
> pipeline.
> 
> This has already been implemented in code for testing and its working. So, 
> storing conntrack state on
> node where virtual machine resides - this helps address first limitation of 
> use_stateless_nat option.
> 
> Regarding second issue: we need to ensure session persistence when number of 
> backends changes.
> In general, as I understand it, stateless load balancing in such systems is 
> typically based on
> consistent hashing. However, consistent hashing alone is not sufficient, 
> since it does not preserve
> 100% of connections. I see solution as some kind of additional layer on top 
> of consistent hashing.
> 
> First, about consistent hashing in OVS, Correct me if I'm wrong:
> Currently, out of two hashing methods in OVS, only `hash` provides 
> consistency, since it is based
> on rendezvous hashing and relies on bucket id. For this to work correctly, 
> bucket_ids must be
> preserved when number of backends changes. However, OVN recreates OpenFlow 
> group every time number
> of backends changes. At moment, when number of backends changes, we create a 
> bundle where first piece
> is an ADD group message, and subsequent messages are INSERT_BUCKET/REMOVE 
> messages. If we could rewrite
> this part to support granular insertion/removal of backends in group — by 
> using INSERT_BUCKET/REMOVE
> without recreating group — we could make backend selection consistent for 
> hash method. I wasn’t able
> to fully determine whether there are any limitations to this approach, but I 
> did manually test session
> preservation using ovs-ofctl while i am inserting/removing buckets. When 
> removing buckets from a group,
> sessions associated with other buckets were preserved.
> 
> At same time, I understand downsides of using hash: it is expensive, since we 
> install dp_flows that match
> full 5-tuple of connection, which leads to a large number of datapath flows. 
> Additionally, there is an
> upcall for every SYN packet. Because of this, I’m not sure how feasible it 
> is, but it might be worth
> thinking about ways to make dp_hash consistent. I don’t yet have a concrete 
> proposal here, but I’d
> really appreciate any ideas or suggestions in this direction.
> 
> About additional layer on top of consistent hashing I’ve come up with two 
> potential approaches, and I’m not
> yet sure which one would be better:
> 1) Using learn action
> 2) Hash-based sticky sessions in OVS
> 

I'd say we can avoid the need for consistent hashing if we add a third
alternative here:

In step "3" in your idea above, on the GW node, after we selected the
backend (MAC) somehow (e.g., we could still just use dp-hash) we commit
that session to conntrack and store the MAC address as metadata (e.g. in
the ct_label like we do for ecmp-symmetric-reply routes).

Subsequent packets on the same sessions don't actually need to be hashed
we'll get the MAC to be used as destination from the conntrack state so
any changes to the set of backends won't be problematic.  We would have
to flush conntrack for backends that get removed (we do that for regular
load balancers too).

I know this might sound counter intuitive because your proposal was to
make the load balancing "stateless" on the GW node and I'm actually
suggesting the GW node processing to rely on conntrack (stateful).

But..

You mentioned that one of your goals is:

> One of the key limitations of this solution is that gateway nodes cannot be 
> horizontally scaled,
> since conntrack state is stored on a single node. Achieving ability to 
> horizontally scale gateway
> nodes is actually one of our goals.

With my proposal I think you'll achieve that.  We'd have to be careful
with the case when a DGP moves to a different chassis (HA failover) but
for that we could combine this solution with using a consistent hash on
all chassis for the backend selection (something you mention you're
looking into anyway).

Now, if your goal is to avoid conntrack completely on the gateway
chassis, my proposal breaks that.  Then we could implement a similar
solution as you suggested with learn action, that might be heavier on
the datapath than using conntrack.  I don't have numbers to back this up
but maybe we can find a way to benchmark it.

> With an additional layer, we can account for hash rebuilds and inaccuracies 
> introduced by consistent
> hashing. It’s not entirely clear how to handle these cases, considering that 
> after some time we need to
> remove flows when using `learn`, or clean up connections in hash. We also 
> need to manage
> connection removal when number of backends changes.

Or when backends go away.. which makes it more complex to manage from
ovn-controller if we use learn flows, I suspect (if we use conntrack we
have the infra to do that already, we use it for regular load balancers
that have ct_flush=true).

> 
> By using such a two-stage approach, ideally, we would always preserve session 
> for a given connection,
> losing connections only in certain cases. For example, suppose we have two GW 
> nodes:
> * first SYN packet arrives at first gw — we record this connection in our 
> additional layer
>   (using learn or OVS hash) and continue routing all packets for this 
> connection based on it.
> * If a packet from this same connection arrives at second GW, additional 
> layer there
>   (learn or OVS hash) will not have session. If number of backends has 
> changed since backend was selected
>   for this connection on first GW, there is a chance it will not hit same 
> backend and will lost session.
> 
> The downsides I see for first solution are:
> 1. More complex handling of cleanup for old connections and removal of hashes 
> for expired backends.
> 
> The downsides of learn action are:
> 1. High load on OVS at a high connection rate.
> 2. If I understand correctly, this also results in a large number of dp_flows 
> that track each individual 5-tuple hash.
> 
> I would be glad to hear your thoughts on a possible implementation of dp_hash 
> consistency, as well
> as feedback on overall architecture. These are all ideas I’ve been able to 
> develop so far. I would greatly
> appreciate any criticism, suggestions, or alternative approaches. It would 
> also be interesting to know
> whether anyone else is interested in this new optional mode for load 
> balancers =)
> 
> [1] https://github.com/Sashhkaa/ovn/commits/deferred-lb-dnat/ - Here's the 
> code, just in case
> 

Hope my thoughts above make some sense and that I don't just create
confusion. :)

Regards,
Dumitru

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to