Re: [ovs-dev] [PATCH ovn v2 0/2] Spine-Leaf Logical Switch Topology.

Ilya Maximets Wed, 12 Feb 2025 04:51:24 -0800

On 2/11/25 18:50, Han Zhou wrote:
> 
> 
> On Mon, Feb 10, 2025 at 2:29 AM Ilya Maximets <[email protected] 
> <mailto:[email protected]>> wrote:
>>
>> On 2/10/25 10:13, Dumitru Ceara wrote:
>> > On 2/9/25 9:47 PM, Numan Siddique wrote:
>> >> On Sat, Feb 8, 2025 at 2:34 AM Han Zhou <[email protected] 
>> >> <mailto:[email protected]>> wrote:
>> >>>
>> >>>
>> >>>
>> >>> On Fri, Feb 7, 2025 at 1:09 PM Ilya Maximets <[email protected] 
>> >>> <mailto:[email protected]>> wrote:
>> >>>>
>> >>>> On 2/7/25 21:45, Ilya Maximets wrote:
>> >>>>> On 2/7/25 15:01, Ilya Maximets wrote:
>> >>>>>> On 2/7/25 05:31, Numan Siddique wrote:
>> >>>>>>> On Thu, Feb 6, 2025 at 3:57 AM Ilya Maximets <[email protected] 
>> >>>>>>> <mailto:[email protected]>> wrote:
>> >>>>>>>>
>> >>>>>>>> This patch set introduces ability to directly connect switches,
>> >>>>>>>> including transit switches, in order to achieve higher total port 
>> >>>>>>>> count
>> >>>>>>>> and locality of changes in L2 topologies spread across multiple
>> >>>>>>>> availability zones.  And while tailored for this use case, the 
>> >>>>>>>> changes
>> >>>>>>>> do not impose any limitations and should allow for all kinds of 
>> >>>>>>>> other
>> >>>>>>>> different topologies.
>> >>>>>>>>
>> >>>>>>>> Amount of the logic code changes is relatively small, most of the 
>> >>>>>>>> diff
>> >>>>>>>> are new tests for the introduced functionality.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Version 2:
>> >>>>>>>>
>> >>>>>>>>   * Rebased on top of latest changes on the main branch.
>> >>>>>>>>   * Improved validation of the peer in northd. [Mark]
>> >>>>>>>>   * Added a test for a switch port with an address set. [Mark]
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Ilya Maximets (2):
>> >>>>>>>>   northd: Add support for spine-leaf logical switch topology.
>> >>>>>>>>   ic: Add support for spine-leaf topology for transit switches.
>> >>>>>>>
>> >>>>>>> Hi Ilya,
>> >>>>>>>
>> >>>>>>> Thanks for adding this feature.  I applied both the patches to the
>> >>>>>>> main.  I had to do a minor rebase.
>> >>>>>>
>> >>>>>> Thanks, Numan and Mark!
>> >>>>>>
>> >>>>>>>
>> >>>>>>> I see one small issue  with the spine switch.  Since all the spine
>> >>>>>>> switch ports have unknown address,
>> >>>>>>> the packet will be cloned to N - 1 logical switches if N switches 
>> >>>>>>> connect to it.
>> >>>>>>>
>> >>>>>>> I think we should enable "fdb learn" in the spine switch with the 
>> >>>>>>> below changes
>> >>>>>>>
>> >>>>>>> -----------------------------------------------
>> >>>>>>> diff --git a/northd/northd.c b/northd/northd.c
>> >>>>>>> index 880112c3b9..e77936fbe9 100644
>> >>>>>>> --- a/northd/northd.c
>> >>>>>>> +++ b/northd/northd.c
>> >>>>>>> @@ -5682,6 +5682,7 @@ build_lswitch_learn_fdb_op(
>> >>>>>>>      ovs_assert(op->nbsp);
>> >>>>>>>
>> >>>>>>>      if (!op->n_ps_addrs && op->has_unknown && 
>> >>>>>>> (!strcmp(op->nbsp->type, "") ||
>> >>>>>>> +        !strcmp(op->nbsp->type, "switch") ||
>> >>>>>>>          (lsp_is_localnet(op->nbsp) && 
>> >>>>>>> localnet_can_learn_mac(op->nbsp)))) {
>> >>>>>>>          ds_clear(match);
>> >>>>>>>          ds_clear(actions);
>> >>>>>>> -----------------------------------------------
>> >>>>>>>
>> >>>>>>> What do you think ?  Any concerns or objections ?   If not, can you
>> >>>>>>> submit a follow up patch to enable this ?
>> >>>>>>
>> >>>>>> This seems reasonable.  I missed the part that we enable FDB 
>> >>>>>> automatically
>> >>>>>> for some ports with "unknown".  Will look into that and submit a fix.
>> >>>>>
>> >>>>> So, I gave this some more thoughts and I'm not sure if we actually need
>> >>>>> FDB for "switch" ports in a common case.  In a non-IC setup OVN 
>> >>>>> actually
>> >>>>> knows the destination, if it's part of OVN network.  The whole 
>> >>>>> processing
>> >>>>> of the spine switch will happen on the source node and only the egress
>> >>>>> pipeline of the leaf switch will be executed on the destination node.
>> >>>>> If we have actual VM ports with "unknown", then those will have the FDB
>> >>>>> enabled, but we should not need to consult FDB for the spine switch 
>> >>>>> itself
>> >>>>> otherwise.  Does that make sense?
>> >>
>> >> You're right.  But enabling mac learning for "switch" ports does have
>> >> some benefits.
>> >>  By clone what I mean is that during the first packet upcall flow
>> >> translation, vswitchd will
>> >> clone the packet to all the leaf switch pipelines and yes, the final
>> >> datapath flow will
>> >> of course not clone the packets.    So it's not a big deal.  But I
>> >> think it's better
>> >> to enable fdb for switch ports.
>> >>
>> >>
>> >>>>
>> >>>> Thinking more about this again, there is still an issue if we have 
>> >>>> "unknown"
>> >>>> ports in leaf switches.  Since those switches are separate from the 
>> >>>> spine,
>> >>>> will broadcast in the spine and then every leaf that has "unknown" will 
>> >>>> also
>> >>>> broadcast within itself.  And so, "unknown"  VM ports will receive 
>> >>>> traffic
>> >>>> destined to ports on other leaf switches.  So, yes, we still need FDB 
>> >>>> for
>> >>>> this case.
>> >>
>> >> Ack.
>> >>
>> >>>>
>> >>>>>
>> >>>>> OTOH, what we may actually need is FDB learning for remote ports in IC
>> >>>>> setup.  In this case, OVN doesn't actually know the whole topology and
>> >>>>> can't figure out to which of the remote ports the packet should go, so 
>> >>>>> it
>> >>>>> will broadcast.  The problem here, however, is that FDB stages are in 
>> >>>>> the
>> >>>>> ingress pipeline, and they will not be executed on the remote node that
>> >>>>> actually needs them.  So, we may need to create FDB stages in the 
>> >>>>> egress
>> >>>>> pipeline for packets coming from remote ports.  Seems like a generic
>> >>>>> issue with "unknown" ports on a transit switch.
>> >>>>>
>> >>>>> Mark, Numan, Dumitru, what do you think?
>> >>>>
>> >>>> But this issue also still stands as we need a way to learn MAC addresses
>> >>>> from packet arriving from a remote port, so we either need FDB in the 
>> >>>> egress
>> >>>> pipeline of the transit switch for remote ports, or we need an external
>> >>>> service like ovn-ic synchronizing FDB entries between AZs, which is not 
>> >>>> good.
>> >>>>
>> >>>
>> >>> Hi Ilya, I agree with you that synchronizing FDB entries between AZs is 
>> >>> a really bad idea. FDB in the egress pipeline (and learning the mac for 
>> >>> inport of the spine switch) sounds reasonable.
>> >>
>> >> I agree.  FDB in the egress pipeline sounds reasonable.
>> >>
>> >
>> > +1
>> >
>> > I think it's reasonable implementation for all "learning features" in IC
>> > mode.  We already do that (for the same reason) for IGMP/MLD:
>> >
>> > https://github.com/ovn-org/ovn/commit/c6b20c9940ed41f028347f559b21a25fb7625eb4
>> >  
>> > <https://github.com/ovn-org/ovn/commit/c6b20c9940ed41f028347f559b21a25fb7625eb4>
>> >
>> >> If a spine transit switch has N ports,  then each packet (both request
>> >> and reply) will get flooded to
>> >> all the remote ports of the spine transit switch.  I tested it and I
>> >> can confirm.
>> >>
>> >> If we can't add FDB in the egress pipeline in a reasonable time for
>> >> 25.03,  I'd suggest documenting this
>> >> flood behavior and mark this feature as experimental so that the users
>> >> of 25.03 are well informed of this.
>> >>
>> >
>> > Sounds like something we can add after branching (either learning in the
>> > egress pipeline or marking the feature as experimental) indeed.
>>
>> Thanks, all!  I think, the fixes should be relatively simple, I'll try
>> to post some patches patches today or tomorrow.
>>
>> Best regards, Ilya Maximets.
> 
> I've given this more thought and believe it's worth documenting the 
> implications of this approach in more detail. Instead of maintaining static 
> port-to-location mappings for all ports, we are dynamically learning only the 
> mappings needed for communication. In most cases, where only a small portion 
> of workloads interact, this significantly reduces the port-to-location (or 
> MAC-to-location) mappings that each node must maintain. However, a side 
> effect is the flooding of initial packets.
> 
> While this trade-off is beneficial in many scenarios, it's important to 
> highlight that in the worst case—where every port communicates with every 
> other port—the MAC learning overhead would ultimately be comparable to that 
> of a single flat logical switch.
> 
> To help users make informed decisions about this feature, we should document 
> the following considerations:
> 
> 1. Flooding behavior – Either clarify that this approach is experimental due 
> to the potential for flooding every packet, or specify that only initial 
> packets are flooded when MAC learning is implemented.


I posted fixes here: 
https://patchwork.ozlabs.org/project/ovn/list/?series=444002&state=*
And included a documentation update to highlight the broadcasting and learning
behaviors, so users can make an informed decision while designing their 
networks.

> 
> 2. Workload communication patterns – This approach assumes that only a 
> relatively small subset of workloads communicate with each other. If most 
> workloads interact with all others, the benefits of a spine-leaf topology 
> diminish, making it less effective at scale.

There is still a control plane benefit of locality of changes, i.e. every node 
doesn't
have to know all the ports, and a spine-leaf topology can host more ports than 
a single
switch.  But yes, there are trade-offs to be aware of.  Please, take a look at 
the
documentation updates I made in the patch set above.  I didn't mention any 
specific
scenarios as the doc is for a port type, but I hope I made it clear enough how 
a switch
with such ports will behave, so it's not hidden from the potential users.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v2 0/2] Spine-Leaf Logical Switch Topology.

Reply via email to