On 2/11/25 18:50, Han Zhou wrote: > > > On Mon, Feb 10, 2025 at 2:29 AM Ilya Maximets <[email protected] > <mailto:[email protected]>> wrote: >> >> On 2/10/25 10:13, Dumitru Ceara wrote: >> > On 2/9/25 9:47 PM, Numan Siddique wrote: >> >> On Sat, Feb 8, 2025 at 2:34 AM Han Zhou <[email protected] >> >> <mailto:[email protected]>> wrote: >> >>> >> >>> >> >>> >> >>> On Fri, Feb 7, 2025 at 1:09 PM Ilya Maximets <[email protected] >> >>> <mailto:[email protected]>> wrote: >> >>>> >> >>>> On 2/7/25 21:45, Ilya Maximets wrote: >> >>>>> On 2/7/25 15:01, Ilya Maximets wrote: >> >>>>>> On 2/7/25 05:31, Numan Siddique wrote: >> >>>>>>> On Thu, Feb 6, 2025 at 3:57 AM Ilya Maximets <[email protected] >> >>>>>>> <mailto:[email protected]>> wrote: >> >>>>>>>> >> >>>>>>>> This patch set introduces ability to directly connect switches, >> >>>>>>>> including transit switches, in order to achieve higher total port >> >>>>>>>> count >> >>>>>>>> and locality of changes in L2 topologies spread across multiple >> >>>>>>>> availability zones. And while tailored for this use case, the >> >>>>>>>> changes >> >>>>>>>> do not impose any limitations and should allow for all kinds of >> >>>>>>>> other >> >>>>>>>> different topologies. >> >>>>>>>> >> >>>>>>>> Amount of the logic code changes is relatively small, most of the >> >>>>>>>> diff >> >>>>>>>> are new tests for the introduced functionality. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> Version 2: >> >>>>>>>> >> >>>>>>>> * Rebased on top of latest changes on the main branch. >> >>>>>>>> * Improved validation of the peer in northd. [Mark] >> >>>>>>>> * Added a test for a switch port with an address set. [Mark] >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> Ilya Maximets (2): >> >>>>>>>> northd: Add support for spine-leaf logical switch topology. >> >>>>>>>> ic: Add support for spine-leaf topology for transit switches. >> >>>>>>> >> >>>>>>> Hi Ilya, >> >>>>>>> >> >>>>>>> Thanks for adding this feature. I applied both the patches to the >> >>>>>>> main. I had to do a minor rebase. >> >>>>>> >> >>>>>> Thanks, Numan and Mark! >> >>>>>> >> >>>>>>> >> >>>>>>> I see one small issue with the spine switch. Since all the spine >> >>>>>>> switch ports have unknown address, >> >>>>>>> the packet will be cloned to N - 1 logical switches if N switches >> >>>>>>> connect to it. >> >>>>>>> >> >>>>>>> I think we should enable "fdb learn" in the spine switch with the >> >>>>>>> below changes >> >>>>>>> >> >>>>>>> ----------------------------------------------- >> >>>>>>> diff --git a/northd/northd.c b/northd/northd.c >> >>>>>>> index 880112c3b9..e77936fbe9 100644 >> >>>>>>> --- a/northd/northd.c >> >>>>>>> +++ b/northd/northd.c >> >>>>>>> @@ -5682,6 +5682,7 @@ build_lswitch_learn_fdb_op( >> >>>>>>> ovs_assert(op->nbsp); >> >>>>>>> >> >>>>>>> if (!op->n_ps_addrs && op->has_unknown && >> >>>>>>> (!strcmp(op->nbsp->type, "") || >> >>>>>>> + !strcmp(op->nbsp->type, "switch") || >> >>>>>>> (lsp_is_localnet(op->nbsp) && >> >>>>>>> localnet_can_learn_mac(op->nbsp)))) { >> >>>>>>> ds_clear(match); >> >>>>>>> ds_clear(actions); >> >>>>>>> ----------------------------------------------- >> >>>>>>> >> >>>>>>> What do you think ? Any concerns or objections ? If not, can you >> >>>>>>> submit a follow up patch to enable this ? >> >>>>>> >> >>>>>> This seems reasonable. I missed the part that we enable FDB >> >>>>>> automatically >> >>>>>> for some ports with "unknown". Will look into that and submit a fix. >> >>>>> >> >>>>> So, I gave this some more thoughts and I'm not sure if we actually need >> >>>>> FDB for "switch" ports in a common case. In a non-IC setup OVN >> >>>>> actually >> >>>>> knows the destination, if it's part of OVN network. The whole >> >>>>> processing >> >>>>> of the spine switch will happen on the source node and only the egress >> >>>>> pipeline of the leaf switch will be executed on the destination node. >> >>>>> If we have actual VM ports with "unknown", then those will have the FDB >> >>>>> enabled, but we should not need to consult FDB for the spine switch >> >>>>> itself >> >>>>> otherwise. Does that make sense? >> >> >> >> You're right. But enabling mac learning for "switch" ports does have >> >> some benefits. >> >> By clone what I mean is that during the first packet upcall flow >> >> translation, vswitchd will >> >> clone the packet to all the leaf switch pipelines and yes, the final >> >> datapath flow will >> >> of course not clone the packets. So it's not a big deal. But I >> >> think it's better >> >> to enable fdb for switch ports. >> >> >> >> >> >>>> >> >>>> Thinking more about this again, there is still an issue if we have >> >>>> "unknown" >> >>>> ports in leaf switches. Since those switches are separate from the >> >>>> spine, >> >>>> will broadcast in the spine and then every leaf that has "unknown" will >> >>>> also >> >>>> broadcast within itself. And so, "unknown" VM ports will receive >> >>>> traffic >> >>>> destined to ports on other leaf switches. So, yes, we still need FDB >> >>>> for >> >>>> this case. >> >> >> >> Ack. >> >> >> >>>> >> >>>>> >> >>>>> OTOH, what we may actually need is FDB learning for remote ports in IC >> >>>>> setup. In this case, OVN doesn't actually know the whole topology and >> >>>>> can't figure out to which of the remote ports the packet should go, so >> >>>>> it >> >>>>> will broadcast. The problem here, however, is that FDB stages are in >> >>>>> the >> >>>>> ingress pipeline, and they will not be executed on the remote node that >> >>>>> actually needs them. So, we may need to create FDB stages in the >> >>>>> egress >> >>>>> pipeline for packets coming from remote ports. Seems like a generic >> >>>>> issue with "unknown" ports on a transit switch. >> >>>>> >> >>>>> Mark, Numan, Dumitru, what do you think? >> >>>> >> >>>> But this issue also still stands as we need a way to learn MAC addresses >> >>>> from packet arriving from a remote port, so we either need FDB in the >> >>>> egress >> >>>> pipeline of the transit switch for remote ports, or we need an external >> >>>> service like ovn-ic synchronizing FDB entries between AZs, which is not >> >>>> good. >> >>>> >> >>> >> >>> Hi Ilya, I agree with you that synchronizing FDB entries between AZs is >> >>> a really bad idea. FDB in the egress pipeline (and learning the mac for >> >>> inport of the spine switch) sounds reasonable. >> >> >> >> I agree. FDB in the egress pipeline sounds reasonable. >> >> >> > >> > +1 >> > >> > I think it's reasonable implementation for all "learning features" in IC >> > mode. We already do that (for the same reason) for IGMP/MLD: >> > >> > https://github.com/ovn-org/ovn/commit/c6b20c9940ed41f028347f559b21a25fb7625eb4 >> > >> > <https://github.com/ovn-org/ovn/commit/c6b20c9940ed41f028347f559b21a25fb7625eb4> >> > >> >> If a spine transit switch has N ports, then each packet (both request >> >> and reply) will get flooded to >> >> all the remote ports of the spine transit switch. I tested it and I >> >> can confirm. >> >> >> >> If we can't add FDB in the egress pipeline in a reasonable time for >> >> 25.03, I'd suggest documenting this >> >> flood behavior and mark this feature as experimental so that the users >> >> of 25.03 are well informed of this. >> >> >> > >> > Sounds like something we can add after branching (either learning in the >> > egress pipeline or marking the feature as experimental) indeed. >> >> Thanks, all! I think, the fixes should be relatively simple, I'll try >> to post some patches patches today or tomorrow. >> >> Best regards, Ilya Maximets. > > I've given this more thought and believe it's worth documenting the > implications of this approach in more detail. Instead of maintaining static > port-to-location mappings for all ports, we are dynamically learning only the > mappings needed for communication. In most cases, where only a small portion > of workloads interact, this significantly reduces the port-to-location (or > MAC-to-location) mappings that each node must maintain. However, a side > effect is the flooding of initial packets. > > While this trade-off is beneficial in many scenarios, it's important to > highlight that in the worst case—where every port communicates with every > other port—the MAC learning overhead would ultimately be comparable to that > of a single flat logical switch. > > To help users make informed decisions about this feature, we should document > the following considerations: > > 1. Flooding behavior – Either clarify that this approach is experimental due > to the potential for flooding every packet, or specify that only initial > packets are flooded when MAC learning is implemented.
I posted fixes here: https://patchwork.ozlabs.org/project/ovn/list/?series=444002&state=* And included a documentation update to highlight the broadcasting and learning behaviors, so users can make an informed decision while designing their networks. > > 2. Workload communication patterns – This approach assumes that only a > relatively small subset of workloads communicate with each other. If most > workloads interact with all others, the benefits of a spine-leaf topology > diminish, making it less effective at scale. There is still a control plane benefit of locality of changes, i.e. every node doesn't have to know all the ports, and a spine-leaf topology can host more ports than a single switch. But yes, there are trade-offs to be aware of. Please, take a look at the documentation updates I made in the patch set above. I didn't mention any specific scenarios as the doc is for a port type, but I hope I made it clear enough how a switch with such ports will behave, so it's not hidden from the potential users. Best regards, Ilya Maximets. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
