On 3/27/26 3:08 PM, Rukomoinikova Aleksandra wrote: > On 27.03.2026 12:58, Dumitru Ceara wrote: >> On 3/26/26 10:35 PM, Rukomoinikova Aleksandra wrote: >>> Hi, >>> >> Hi Aleksandra, > > > Hi Dumitru! Thank you for your answer! >
Hi Aleksandra, > The main reason I didn’t consider conntrack on gw node is that I’m > afraid return traffic > might go through a different gateway. Like, SYN would go through one > gateway, but SYN+ACK > from server might come back through another gateway. In that case, next > client packet > would be dropped at first gateway because conntrack would consider it > invalid. > Hmm, I didn't imagine this scenario. Wouldn't this kind of asymmetric forwarding cause issues anyway? If it's a DGP, wouldn't reply traffic from the backend come via the gateway anyway? Or do you want to do direct client return of the backend-side of things? > Or do you mean using it not as a “real” conntrack in usual sense, but > just as a convenient > base to extract packet metadata from? > In that case, it is indeed convenient for removing inactive backends, > but we would need to > clean up conntrack entries ourselves after some time, because we can’t > rely on the > conntrack timers anymore since states won’t be updated properly. > Updating conntrack entries from OVN probably also seems crazy > >> >>> I’d like to bring up an idea for discussion regarding implementation of >>> stateless load balancing >>> support. I would really appreciate your feedback.I’m also open to >>> alternative approaches or ideas >>> that I may not have considered. Thank you in advance for your time and >>> input! >>> >> Thanks for researching this, it's very nice work! >> >>> So, we would like to move towards stateless traffic load balancing. Here is >>> how it would differ >>> from current approach: >>> At the moment, when we have a load balancer on router with DGP ports, >>> conntrack state is stored >>> directly on the gateway which hosting DGP right now. So, we select backend >>> for the first packet >>> of a connection, and after that, based on existing conntrack entry, we no >>> longer perform backend >>> selection. Instead, we rely entirely on stored conntrack record for >>> subsequent packets. >>> >>> One of the key limitations of this solution is that gateway nodes cannot be >>> horizontally scaled, >>> since conntrack state is stored on a single node. Achieving ability to >>> horizontally scale gateway >>> nodes is actually one of our goals. >>> >>> Here are a few possible approaches I see: >>> 1) The idea of synchronizing conntrack state already exists in the >>> community, but it seems rather >>> outdated and not very promising. >> Yeah, this on its own has a lot of potential scalability issues so we >> never really pursued it I guess. >> >>> 2) Avoid storing conntrack state on GW node and instead perform stateless >>> load balancing for every >>> packet in connection, while keeping conntrack state on node where >>> virtual machine resides - This >>> is approach I am currently leaning toward. >> Maybe we _also_ need to store conntrack state on the GW node, I'll >> detail below. >> >>> Now in OVN we have use_stateless_nat option for lb, but it comes with >>> several limitations: >>> 1) It works by selecting a backend in a stateless manner for every packet, >>> while for return traffic >>> it performs 1:1 snat. So, for traffic from backend (ip.src && >>> src.port), source ip is rewritten to >>> lb.vip. This only works correctly if a backend belongs to a single >>> lb.vip. Otherwise, there is a >>> risk that return traffic will be snated to wrong lb.vip. >>> 2) There is also an issue with preserving tcp sessions when number of >>> backends changes. Since backend >>> selection is done using select(), any change in number of backends can >>> break existing tcp sessions. >>> This problem is also relevant to solution I proposed above—I will >>> elaborate on it below. >>> >>> More details about my idea (I will attach some code below—this is not >>> production-ready, just a >>> prototype for testing [1]): >>> 1. Packet arrives at gateway with as this one: >>> eth.dst == dgp_mac >>> eth.src == client_mac >>> ip.dst == lb.vip >>> ip.src == client_ip >>> tcp.dst == lb.port >>> tcp.src == client_port >>> 2. We detect that packet is addressed to load balancer → perform select >>> over backends. >>> 3. We route traffic directly to selected backend by changing eth.dst to >>> backend’s MAC, >>> while keeping `ip.dst == lb.vip`. >> So what needs to be persisted is actually the MAC address of the backend >> that was selected (through whatever hashing method). >> >>> 4. We pass packet further down processing pipeline. At this point, it looks >>> like: >>> eth.dst == backend_mac >>> eth.src == (source port of switch where backend is connected) >>> ip.dst == lb.vip >>> ip.src == client_ip >>> tcp.dst == lb.port >>> tcp.src == client_port >>> 5. packet goes through ingress pipeline of switch where backed port resides >>> → then it is sent >>> through tunnel. >> This is neat indeed because the LS doesn't care about the destination IP >> but it cares about the destination MAC (for the L2 lookup stage)! >> >>> 6. packet arrives at node hosting backend vm, where we perform egress >>> load balancing and store conntrack state. >>> 7. Server responds, and return traffic performs SNAT at ingress of switch >>> pipeline. >>> >>> This has already been implemented in code for testing and its working. So, >>> storing conntrack state on >>> node where virtual machine resides - this helps address first limitation of >>> use_stateless_nat option. >>> >>> Regarding second issue: we need to ensure session persistence when number >>> of backends changes. >>> In general, as I understand it, stateless load balancing in such systems is >>> typically based on >>> consistent hashing. However, consistent hashing alone is not sufficient, >>> since it does not preserve >>> 100% of connections. I see solution as some kind of additional layer on top >>> of consistent hashing. >>> >>> First, about consistent hashing in OVS, Correct me if I'm wrong: >>> Currently, out of two hashing methods in OVS, only `hash` provides >>> consistency, since it is based >>> on rendezvous hashing and relies on bucket id. For this to work correctly, >>> bucket_ids must be >>> preserved when number of backends changes. However, OVN recreates OpenFlow >>> group every time number >>> of backends changes. At moment, when number of backends changes, we create >>> a bundle where first piece >>> is an ADD group message, and subsequent messages are INSERT_BUCKET/REMOVE >>> messages. If we could rewrite >>> this part to support granular insertion/removal of backends in group — by >>> using INSERT_BUCKET/REMOVE >>> without recreating group — we could make backend selection consistent for >>> hash method. I wasn’t able >>> to fully determine whether there are any limitations to this approach, but >>> I did manually test session >>> preservation using ovs-ofctl while i am inserting/removing buckets. When >>> removing buckets from a group, >>> sessions associated with other buckets were preserved. >>> >>> At same time, I understand downsides of using hash: it is expensive, since >>> we install dp_flows that match >>> full 5-tuple of connection, which leads to a large number of datapath >>> flows. Additionally, there is an >>> upcall for every SYN packet. Because of this, I’m not sure how feasible it >>> is, but it might be worth >>> thinking about ways to make dp_hash consistent. I don’t yet have a concrete >>> proposal here, but I’d >>> really appreciate any ideas or suggestions in this direction. >>> >>> About additional layer on top of consistent hashing I’ve come up with two >>> potential approaches, and I’m not >>> yet sure which one would be better: >>> 1) Using learn action >>> 2) Hash-based sticky sessions in OVS >>> >> I'd say we can avoid the need for consistent hashing if we add a third >> alternative here: >> >> In step "3" in your idea above, on the GW node, after we selected the >> backend (MAC) somehow (e.g., we could still just use dp-hash) we commit >> that session to conntrack and store the MAC address as metadata (e.g. in >> the ct_label like we do for ecmp-symmetric-reply routes). >> >> Subsequent packets on the same sessions don't actually need to be hashed >> we'll get the MAC to be used as destination from the conntrack state so >> any changes to the set of backends won't be problematic. We would have >> to flush conntrack for backends that get removed (we do that for regular >> load balancers too). >> >> I know this might sound counter intuitive because your proposal was to >> make the load balancing "stateless" on the GW node and I'm actually >> suggesting the GW node processing to rely on conntrack (stateful). >> >> But.. >> >> You mentioned that one of your goals is: >> >>> One of the key limitations of this solution is that gateway nodes cannot be >>> horizontally scaled, >>> since conntrack state is stored on a single node. Achieving ability to >>> horizontally scale gateway >>> nodes is actually one of our goals. >> With my proposal I think you'll achieve that. We'd have to be careful >> with the case when a DGP moves to a different chassis (HA failover) but >> for that we could combine this solution with using a consistent hash on >> all chassis for the backend selection (something you mention you're >> looking into anyway). >> >> Now, if your goal is to avoid conntrack completely on the gateway >> chassis, my proposal breaks that. Then we could implement a similar >> solution as you suggested with learn action, that might be heavier on >> the datapath than using conntrack. I don't have numbers to back this up >> but maybe we can find a way to benchmark it. >> >>> With an additional layer, we can account for hash rebuilds and inaccuracies >>> introduced by consistent >>> hashing. It’s not entirely clear how to handle these cases, considering >>> that after some time we need to >>> remove flows when using `learn`, or clean up connections in hash. We also >>> need to manage >>> connection removal when number of backends changes. >> Or when backends go away.. which makes it more complex to manage from >> ovn-controller if we use learn flows, I suspect (if we use conntrack we >> have the infra to do that already, we use it for regular load balancers >> that have ct_flush=true). >> >>> By using such a two-stage approach, ideally, we would always preserve >>> session for a given connection, >>> losing connections only in certain cases. For example, suppose we have two >>> GW nodes: >>> * first SYN packet arrives at first gw — we record this connection in our >>> additional layer >>> (using learn or OVS hash) and continue routing all packets for this >>> connection based on it. >>> * If a packet from this same connection arrives at second GW, additional >>> layer there >>> (learn or OVS hash) will not have session. If number of backends has >>> changed since backend was selected >>> for this connection on first GW, there is a chance it will not hit same >>> backend and will lost session. >>> >>> The downsides I see for first solution are: >>> 1. More complex handling of cleanup for old connections and removal of >>> hashes for expired backends. >>> >>> The downsides of learn action are: >>> 1. High load on OVS at a high connection rate. >>> 2. If I understand correctly, this also results in a large number of >>> dp_flows that track each individual 5-tuple hash. >>> >>> I would be glad to hear your thoughts on a possible implementation of >>> dp_hash consistency, as well >>> as feedback on overall architecture. These are all ideas I’ve been able to >>> develop so far. I would greatly >>> appreciate any criticism, suggestions, or alternative approaches. It would >>> also be interesting to know >>> whether anyone else is interested in this new optional mode for load >>> balancers =) >>> >>> [1] https://github.com/Sashhkaa/ovn/commits/deferred-lb-dnat/ - Here's the >>> code, just in case >>> >> Hope my thoughts above make some sense and that I don't just create >> confusion. :) >> Regards, Dumitru _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
