On Wed, Feb 12, 2025 at 4:51 AM Ilya Maximets <[email protected]> wrote:
>
> On 2/11/25 18:50, Han Zhou wrote:
> >
> >
> > On Mon, Feb 10, 2025 at 2:29 AM Ilya Maximets <[email protected]
<mailto:[email protected]>> wrote:
> >>
> >> On 2/10/25 10:13, Dumitru Ceara wrote:
> >> > On 2/9/25 9:47 PM, Numan Siddique wrote:
> >> >> On Sat, Feb 8, 2025 at 2:34 AM Han Zhou <[email protected] <mailto:
[email protected]>> wrote:
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Fri, Feb 7, 2025 at 1:09 PM Ilya Maximets <[email protected]
<mailto:[email protected]>> wrote:
> >> >>>>
> >> >>>> On 2/7/25 21:45, Ilya Maximets wrote:
> >> >>>>> On 2/7/25 15:01, Ilya Maximets wrote:
> >> >>>>>> On 2/7/25 05:31, Numan Siddique wrote:
> >> >>>>>>> On Thu, Feb 6, 2025 at 3:57 AM Ilya Maximets <
[email protected] <mailto:[email protected]>> wrote:
> >> >>>>>>>>
> >> >>>>>>>> This patch set introduces ability to directly connect
switches,
> >> >>>>>>>> including transit switches, in order to achieve higher total
port count
> >> >>>>>>>> and locality of changes in L2 topologies spread across
multiple
> >> >>>>>>>> availability zones.  And while tailored for this use case,
the changes
> >> >>>>>>>> do not impose any limitations and should allow for all kinds
of other
> >> >>>>>>>> different topologies.
> >> >>>>>>>>
> >> >>>>>>>> Amount of the logic code changes is relatively small, most of
the diff
> >> >>>>>>>> are new tests for the introduced functionality.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Version 2:
> >> >>>>>>>>
> >> >>>>>>>>   * Rebased on top of latest changes on the main branch.
> >> >>>>>>>>   * Improved validation of the peer in northd. [Mark]
> >> >>>>>>>>   * Added a test for a switch port with an address set. [Mark]
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Ilya Maximets (2):
> >> >>>>>>>>   northd: Add support for spine-leaf logical switch topology.
> >> >>>>>>>>   ic: Add support for spine-leaf topology for transit
switches.
> >> >>>>>>>
> >> >>>>>>> Hi Ilya,
> >> >>>>>>>
> >> >>>>>>> Thanks for adding this feature.  I applied both the patches to
the
> >> >>>>>>> main.  I had to do a minor rebase.
> >> >>>>>>
> >> >>>>>> Thanks, Numan and Mark!
> >> >>>>>>
> >> >>>>>>>
> >> >>>>>>> I see one small issue  with the spine switch.  Since all the
spine
> >> >>>>>>> switch ports have unknown address,
> >> >>>>>>> the packet will be cloned to N - 1 logical switches if N
switches connect to it.
> >> >>>>>>>
> >> >>>>>>> I think we should enable "fdb learn" in the spine switch with
the below changes
> >> >>>>>>>
> >> >>>>>>> -----------------------------------------------
> >> >>>>>>> diff --git a/northd/northd.c b/northd/northd.c
> >> >>>>>>> index 880112c3b9..e77936fbe9 100644
> >> >>>>>>> --- a/northd/northd.c
> >> >>>>>>> +++ b/northd/northd.c
> >> >>>>>>> @@ -5682,6 +5682,7 @@ build_lswitch_learn_fdb_op(
> >> >>>>>>>      ovs_assert(op->nbsp);
> >> >>>>>>>
> >> >>>>>>>      if (!op->n_ps_addrs && op->has_unknown &&
(!strcmp(op->nbsp->type, "") ||
> >> >>>>>>> +        !strcmp(op->nbsp->type, "switch") ||
> >> >>>>>>>          (lsp_is_localnet(op->nbsp) &&
localnet_can_learn_mac(op->nbsp)))) {
> >> >>>>>>>          ds_clear(match);
> >> >>>>>>>          ds_clear(actions);
> >> >>>>>>> -----------------------------------------------
> >> >>>>>>>
> >> >>>>>>> What do you think ?  Any concerns or objections ?   If not,
can you
> >> >>>>>>> submit a follow up patch to enable this ?
> >> >>>>>>
> >> >>>>>> This seems reasonable.  I missed the part that we enable FDB
automatically
> >> >>>>>> for some ports with "unknown".  Will look into that and submit
a fix.
> >> >>>>>
> >> >>>>> So, I gave this some more thoughts and I'm not sure if we
actually need
> >> >>>>> FDB for "switch" ports in a common case.  In a non-IC setup OVN
actually
> >> >>>>> knows the destination, if it's part of OVN network.  The whole
processing
> >> >>>>> of the spine switch will happen on the source node and only the
egress
> >> >>>>> pipeline of the leaf switch will be executed on the destination
node.
> >> >>>>> If we have actual VM ports with "unknown", then those will have
the FDB
> >> >>>>> enabled, but we should not need to consult FDB for the spine
switch itself
> >> >>>>> otherwise.  Does that make sense?
> >> >>
> >> >> You're right.  But enabling mac learning for "switch" ports does
have
> >> >> some benefits.
> >> >>  By clone what I mean is that during the first packet upcall flow
> >> >> translation, vswitchd will
> >> >> clone the packet to all the leaf switch pipelines and yes, the final
> >> >> datapath flow will
> >> >> of course not clone the packets.    So it's not a big deal.  But I
> >> >> think it's better
> >> >> to enable fdb for switch ports.
> >> >>
> >> >>
> >> >>>>
> >> >>>> Thinking more about this again, there is still an issue if we
have "unknown"
> >> >>>> ports in leaf switches.  Since those switches are separate from
the spine,
> >> >>>> will broadcast in the spine and then every leaf that has
"unknown" will also
> >> >>>> broadcast within itself.  And so, "unknown"  VM ports will
receive traffic
> >> >>>> destined to ports on other leaf switches.  So, yes, we still need
FDB for
> >> >>>> this case.
> >> >>
> >> >> Ack.
> >> >>
> >> >>>>
> >> >>>>>
> >> >>>>> OTOH, what we may actually need is FDB learning for remote ports
in IC
> >> >>>>> setup.  In this case, OVN doesn't actually know the whole
topology and
> >> >>>>> can't figure out to which of the remote ports the packet should
go, so it
> >> >>>>> will broadcast.  The problem here, however, is that FDB stages
are in the
> >> >>>>> ingress pipeline, and they will not be executed on the remote
node that
> >> >>>>> actually needs them.  So, we may need to create FDB stages in
the egress
> >> >>>>> pipeline for packets coming from remote ports.  Seems like a
generic
> >> >>>>> issue with "unknown" ports on a transit switch.
> >> >>>>>
> >> >>>>> Mark, Numan, Dumitru, what do you think?
> >> >>>>
> >> >>>> But this issue also still stands as we need a way to learn MAC
addresses
> >> >>>> from packet arriving from a remote port, so we either need FDB in
the egress
> >> >>>> pipeline of the transit switch for remote ports, or we need an
external
> >> >>>> service like ovn-ic synchronizing FDB entries between AZs, which
is not good.
> >> >>>>
> >> >>>
> >> >>> Hi Ilya, I agree with you that synchronizing FDB entries between
AZs is a really bad idea. FDB in the egress pipeline (and learning the mac
for inport of the spine switch) sounds reasonable.
> >> >>
> >> >> I agree.  FDB in the egress pipeline sounds reasonable.
> >> >>
> >> >
> >> > +1
> >> >
> >> > I think it's reasonable implementation for all "learning features"
in IC
> >> > mode.  We already do that (for the same reason) for IGMP/MLD:
> >> >
> >> >
https://github.com/ovn-org/ovn/commit/c6b20c9940ed41f028347f559b21a25fb7625eb4
<
https://github.com/ovn-org/ovn/commit/c6b20c9940ed41f028347f559b21a25fb7625eb4
>
> >> >
> >> >> If a spine transit switch has N ports,  then each packet (both
request
> >> >> and reply) will get flooded to
> >> >> all the remote ports of the spine transit switch.  I tested it and I
> >> >> can confirm.
> >> >>
> >> >> If we can't add FDB in the egress pipeline in a reasonable time for
> >> >> 25.03,  I'd suggest documenting this
> >> >> flood behavior and mark this feature as experimental so that the
users
> >> >> of 25.03 are well informed of this.
> >> >>
> >> >
> >> > Sounds like something we can add after branching (either learning in
the
> >> > egress pipeline or marking the feature as experimental) indeed.
> >>
> >> Thanks, all!  I think, the fixes should be relatively simple, I'll try
> >> to post some patches patches today or tomorrow.
> >>
> >> Best regards, Ilya Maximets.
> >
> > I've given this more thought and believe it's worth documenting the
implications of this approach in more detail. Instead of maintaining static
port-to-location mappings for all ports, we are dynamically learning only
the mappings needed for communication. In most cases, where only a small
portion of workloads interact, this significantly reduces the
port-to-location (or MAC-to-location) mappings that each node must
maintain. However, a side effect is the flooding of initial packets.
> >
> > While this trade-off is beneficial in many scenarios, it's important to
highlight that in the worst case—where every port communicates with every
other port—the MAC learning overhead would ultimately be comparable to that
of a single flat logical switch.
> >
> > To help users make informed decisions about this feature, we should
document the following considerations:
> >
> > 1. Flooding behavior – Either clarify that this approach is
experimental due to the potential for flooding every packet, or specify
that only initial packets are flooded when MAC learning is implemented.
>
> I posted fixes here:
https://patchwork.ozlabs.org/project/ovn/list/?series=444002&state=*
> And included a documentation update to highlight the broadcasting and
learning
> behaviors, so users can make an informed decision while designing their
networks.
>

Thanks Ilya.

> >
> > 2. Workload communication patterns – This approach assumes that only a
relatively small subset of workloads communicate with each other. If most
workloads interact with all others, the benefits of a spine-leaf topology
diminish, making it less effective at scale.
>
> There is still a control plane benefit of locality of changes, i.e. every
node doesn't
> have to know all the ports,

Although every node doesn't have to know all the port-bindings, they still
need to know every port's MAC's FDB information in the worst case. I agree
that per port-binding entries cost is higher than FDB entries, but both are
O(n) in the worst case, n = <total number of ports on the L2 network>. Of
course, this is only for the worst case, and hopefully most cases will
benefit from the locality.

> and a spine-leaf topology can host more ports than a single
> switch.

Yes, this is always true.

Thanks,
Han

> But yes, there are trade-offs to be aware of.  Please, take a look at the
> documentation updates I made in the patch set above.  I didn't mention
any specific
> scenarios as the doc is for a port type, but I hope I made it clear
enough how a switch
> with such ports will behave, so it's not hidden from the potential users.
>
> Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to