On 3/30/26 5:38 PM, Rukomoinikova Aleksandra wrote: > On 27.03.2026 19:21, Dumitru Ceara wrote: >> On 3/27/26 3:08 PM, Rukomoinikova Aleksandra wrote: >>> On 27.03.2026 12:58, Dumitru Ceara wrote: >>>> On 3/26/26 10:35 PM, Rukomoinikova Aleksandra wrote: >>>>> Hi, >>>>> >>>> Hi Aleksandra, >>> >>> Hi Dumitru! Thank you for your answer! >>> >> Hi Aleksandra, >> >>> The main reason I didn’t consider conntrack on gw node is that I’m >>> afraid return traffic >>> might go through a different gateway. Like, SYN would go through one >>> gateway, but SYN+ACK >>> from server might come back through another gateway. In that case, next >>> client packet >>> would be dropped at first gateway because conntrack would consider it >>> invalid. >>> > > Hi Dumitru! Sorry for the late reply! >
Hi Aleksandra, > I think I didn’t explain the target setup very well. By scaling I meant > just adding more hosts as gw. > I have two levels of routers, the first level hosts DGP, and the second > level already has load balancers attached. > Packets are distributed from the fabric across hosts that run DGP, and > actual load balancing happens on second routers level connected to the > subnet switch. > In this setup, return path would use ECMP route which doesn’t guarantee > that traffic returns through original GW. Why not use ecmp-symmetric-reply for these ECMP routes to make the traffic returns on the same path? > I understand NAT will not work in this kind of architecture. We’re not > using NAT right now in this setup for lb since we want this for private > networks, > but I think if this approach works, NAT could be addressed as well the > same way (like, save conntrack on node with vm). > I don't know if there might be any other issues with asymmetry here yet. > > I'll try to think about how to return the traffic to the original > gateway — not sure yet what the best approach is, but I like your idea > =) Hopefully something good will come out of it. Thanks! > Thanks, looking forward to the next steps! Regards, Dumitru > It would be great if it works out. The issues we had with the learn > action and the hash seem to be resolved now, but I'm still not quite > sure how to approach the consistent hashing part. > >> Hmm, I didn't imagine this scenario. Wouldn't this kind of asymmetric >> forwarding cause issues anyway? If it's a DGP, wouldn't reply traffic >> from the backend come via the gateway anyway? >> >> Or do you want to do direct client return of the backend-side of things? >> >>> Or do you mean using it not as a “real” conntrack in usual sense, but >>> just as a convenient >>> base to extract packet metadata from? >>> In that case, it is indeed convenient for removing inactive backends, >>> but we would need to >>> clean up conntrack entries ourselves after some time, because we can’t >>> rely on the >>> conntrack timers anymore since states won’t be updated properly. >>> Updating conntrack entries from OVN probably also seems crazy >>> >>>>> I’d like to bring up an idea for discussion regarding implementation of >>>>> stateless load balancing >>>>> support. I would really appreciate your feedback.I’m also open to >>>>> alternative approaches or ideas >>>>> that I may not have considered. Thank you in advance for your time and >>>>> input! >>>>> >>>> Thanks for researching this, it's very nice work! >>>> >>>>> So, we would like to move towards stateless traffic load balancing. Here >>>>> is how it would differ >>>>> from current approach: >>>>> At the moment, when we have a load balancer on router with DGP ports, >>>>> conntrack state is stored >>>>> directly on the gateway which hosting DGP right now. So, we select >>>>> backend for the first packet >>>>> of a connection, and after that, based on existing conntrack entry, we no >>>>> longer perform backend >>>>> selection. Instead, we rely entirely on stored conntrack record for >>>>> subsequent packets. >>>>> >>>>> One of the key limitations of this solution is that gateway nodes cannot >>>>> be horizontally scaled, >>>>> since conntrack state is stored on a single node. Achieving ability to >>>>> horizontally scale gateway >>>>> nodes is actually one of our goals. >>>>> >>>>> Here are a few possible approaches I see: >>>>> 1) The idea of synchronizing conntrack state already exists in the >>>>> community, but it seems rather >>>>> outdated and not very promising. >>>> Yeah, this on its own has a lot of potential scalability issues so we >>>> never really pursued it I guess. >>>> >>>>> 2) Avoid storing conntrack state on GW node and instead perform stateless >>>>> load balancing for every >>>>> packet in connection, while keeping conntrack state on node where >>>>> virtual machine resides - This >>>>> is approach I am currently leaning toward. >>>> Maybe we _also_ need to store conntrack state on the GW node, I'll >>>> detail below. >>>> >>>>> Now in OVN we have use_stateless_nat option for lb, but it comes with >>>>> several limitations: >>>>> 1) It works by selecting a backend in a stateless manner for every >>>>> packet, while for return traffic >>>>> it performs 1:1 snat. So, for traffic from backend (ip.src && >>>>> src.port), source ip is rewritten to >>>>> lb.vip. This only works correctly if a backend belongs to a single >>>>> lb.vip. Otherwise, there is a >>>>> risk that return traffic will be snated to wrong lb.vip. >>>>> 2) There is also an issue with preserving tcp sessions when number of >>>>> backends changes. Since backend >>>>> selection is done using select(), any change in number of backends >>>>> can break existing tcp sessions. >>>>> This problem is also relevant to solution I proposed above—I will >>>>> elaborate on it below. >>>>> >>>>> More details about my idea (I will attach some code below—this is not >>>>> production-ready, just a >>>>> prototype for testing [1]): >>>>> 1. Packet arrives at gateway with as this one: >>>>> eth.dst == dgp_mac >>>>> eth.src == client_mac >>>>> ip.dst == lb.vip >>>>> ip.src == client_ip >>>>> tcp.dst == lb.port >>>>> tcp.src == client_port >>>>> 2. We detect that packet is addressed to load balancer → perform select >>>>> over backends. >>>>> 3. We route traffic directly to selected backend by changing eth.dst to >>>>> backend’s MAC, >>>>> while keeping `ip.dst == lb.vip`. >>>> So what needs to be persisted is actually the MAC address of the backend >>>> that was selected (through whatever hashing method). >>>> >>>>> 4. We pass packet further down processing pipeline. At this point, it >>>>> looks like: >>>>> eth.dst == backend_mac >>>>> eth.src == (source port of switch where backend is connected) >>>>> ip.dst == lb.vip >>>>> ip.src == client_ip >>>>> tcp.dst == lb.port >>>>> tcp.src == client_port >>>>> 5. packet goes through ingress pipeline of switch where backed port >>>>> resides → then it is sent >>>>> through tunnel. >>>> This is neat indeed because the LS doesn't care about the destination IP >>>> but it cares about the destination MAC (for the L2 lookup stage)! >>>> >>>>> 6. packet arrives at node hosting backend vm, where we perform egress >>>>> load balancing and store conntrack state. >>>>> 7. Server responds, and return traffic performs SNAT at ingress of switch >>>>> pipeline. >>>>> >>>>> This has already been implemented in code for testing and its working. >>>>> So, storing conntrack state on >>>>> node where virtual machine resides - this helps address first limitation >>>>> of use_stateless_nat option. >>>>> >>>>> Regarding second issue: we need to ensure session persistence when number >>>>> of backends changes. >>>>> In general, as I understand it, stateless load balancing in such systems >>>>> is typically based on >>>>> consistent hashing. However, consistent hashing alone is not sufficient, >>>>> since it does not preserve >>>>> 100% of connections. I see solution as some kind of additional layer on >>>>> top of consistent hashing. >>>>> >>>>> First, about consistent hashing in OVS, Correct me if I'm wrong: >>>>> Currently, out of two hashing methods in OVS, only `hash` provides >>>>> consistency, since it is based >>>>> on rendezvous hashing and relies on bucket id. For this to work >>>>> correctly, bucket_ids must be >>>>> preserved when number of backends changes. However, OVN recreates >>>>> OpenFlow group every time number >>>>> of backends changes. At moment, when number of backends changes, we >>>>> create a bundle where first piece >>>>> is an ADD group message, and subsequent messages are INSERT_BUCKET/REMOVE >>>>> messages. If we could rewrite >>>>> this part to support granular insertion/removal of backends in group — by >>>>> using INSERT_BUCKET/REMOVE >>>>> without recreating group — we could make backend selection consistent for >>>>> hash method. I wasn’t able >>>>> to fully determine whether there are any limitations to this approach, >>>>> but I did manually test session >>>>> preservation using ovs-ofctl while i am inserting/removing buckets. When >>>>> removing buckets from a group, >>>>> sessions associated with other buckets were preserved. >>>>> >>>>> At same time, I understand downsides of using hash: it is expensive, >>>>> since we install dp_flows that match >>>>> full 5-tuple of connection, which leads to a large number of datapath >>>>> flows. Additionally, there is an >>>>> upcall for every SYN packet. Because of this, I’m not sure how feasible >>>>> it is, but it might be worth >>>>> thinking about ways to make dp_hash consistent. I don’t yet have a >>>>> concrete proposal here, but I’d >>>>> really appreciate any ideas or suggestions in this direction. >>>>> >>>>> About additional layer on top of consistent hashing I’ve come up with two >>>>> potential approaches, and I’m not >>>>> yet sure which one would be better: >>>>> 1) Using learn action >>>>> 2) Hash-based sticky sessions in OVS >>>>> >>>> I'd say we can avoid the need for consistent hashing if we add a third >>>> alternative here: >>>> >>>> In step "3" in your idea above, on the GW node, after we selected the >>>> backend (MAC) somehow (e.g., we could still just use dp-hash) we commit >>>> that session to conntrack and store the MAC address as metadata (e.g. in >>>> the ct_label like we do for ecmp-symmetric-reply routes). >>>> >>>> Subsequent packets on the same sessions don't actually need to be hashed >>>> we'll get the MAC to be used as destination from the conntrack state so >>>> any changes to the set of backends won't be problematic. We would have >>>> to flush conntrack for backends that get removed (we do that for regular >>>> load balancers too). >>>> >>>> I know this might sound counter intuitive because your proposal was to >>>> make the load balancing "stateless" on the GW node and I'm actually >>>> suggesting the GW node processing to rely on conntrack (stateful). >>>> >>>> But.. >>>> >>>> You mentioned that one of your goals is: >>>> >>>>> One of the key limitations of this solution is that gateway nodes cannot >>>>> be horizontally scaled, >>>>> since conntrack state is stored on a single node. Achieving ability to >>>>> horizontally scale gateway >>>>> nodes is actually one of our goals. >>>> With my proposal I think you'll achieve that. We'd have to be careful >>>> with the case when a DGP moves to a different chassis (HA failover) but >>>> for that we could combine this solution with using a consistent hash on >>>> all chassis for the backend selection (something you mention you're >>>> looking into anyway). >>>> >>>> Now, if your goal is to avoid conntrack completely on the gateway >>>> chassis, my proposal breaks that. Then we could implement a similar >>>> solution as you suggested with learn action, that might be heavier on >>>> the datapath than using conntrack. I don't have numbers to back this up >>>> but maybe we can find a way to benchmark it. >>>> >>>>> With an additional layer, we can account for hash rebuilds and >>>>> inaccuracies introduced by consistent >>>>> hashing. It’s not entirely clear how to handle these cases, considering >>>>> that after some time we need to >>>>> remove flows when using `learn`, or clean up connections in hash. We also >>>>> need to manage >>>>> connection removal when number of backends changes. >>>> Or when backends go away.. which makes it more complex to manage from >>>> ovn-controller if we use learn flows, I suspect (if we use conntrack we >>>> have the infra to do that already, we use it for regular load balancers >>>> that have ct_flush=true). >>>> >>>>> By using such a two-stage approach, ideally, we would always preserve >>>>> session for a given connection, >>>>> losing connections only in certain cases. For example, suppose we have >>>>> two GW nodes: >>>>> * first SYN packet arrives at first gw — we record this connection in our >>>>> additional layer >>>>> (using learn or OVS hash) and continue routing all packets for this >>>>> connection based on it. >>>>> * If a packet from this same connection arrives at second GW, additional >>>>> layer there >>>>> (learn or OVS hash) will not have session. If number of backends has >>>>> changed since backend was selected >>>>> for this connection on first GW, there is a chance it will not hit >>>>> same backend and will lost session. >>>>> >>>>> The downsides I see for first solution are: >>>>> 1. More complex handling of cleanup for old connections and removal of >>>>> hashes for expired backends. >>>>> >>>>> The downsides of learn action are: >>>>> 1. High load on OVS at a high connection rate. >>>>> 2. If I understand correctly, this also results in a large number of >>>>> dp_flows that track each individual 5-tuple hash. >>>>> >>>>> I would be glad to hear your thoughts on a possible implementation of >>>>> dp_hash consistency, as well >>>>> as feedback on overall architecture. These are all ideas I’ve been able >>>>> to develop so far. I would greatly >>>>> appreciate any criticism, suggestions, or alternative approaches. It >>>>> would also be interesting to know >>>>> whether anyone else is interested in this new optional mode for load >>>>> balancers =) >>>>> >>>>> [1] https://github.com/Sashhkaa/ovn/commits/deferred-lb-dnat/ - Here's >>>>> the code, just in case >>>>> >>>> Hope my thoughts above make some sense and that I don't just create >>>> confusion. :) >>>> >> Regards, >> Dumitru >> > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
