Resending from the email address i'm subscribed to...
> From: Pedro Marques <[email protected]>
> Subject: Re: draft-marques-l3vpn-mcast-edge-00
> Date: May 28, 2012 8:57:53 PM PDT
> To: Petr Lapukhov <[email protected]>
> Cc: Yiqun Cai <[email protected]>, "[email protected]"
> <[email protected]>, "[email protected]" <[email protected]>,
> "[email protected]" <[email protected]>
>
>
> On May 27, 2012, at 11:11 PM, Petr Lapukhov wrote:
>
>> Hi Pedro,
>
> Petr,
> Thank you for your comments. Answers inline.
>
>>
>> Thanks for an interesting read! However, I have some concerns regarding the
>> problem statement in the document:
>>
>>> For Clos topologies with multiple stages native multicast support
>>> within the switching infrastructure is both unnecessary and
>>> undesirable. By definition the Clos network has enough bandwidth to
>>> deliver a packet from any input port to any output port. Native
>>> multicast support would however make it such that the network would
>>> no longer be non-blocking. Bringing with it the need to devise
>>> congestion management procedures.
>>
>> Here they are:
>>
>> 1) Multicast routing over Clos topology could be non-blocking provided that
>> some criteria on Clos topology dimensions are met and multicast distribution
>> tree fan-outs are properly balanced at ingress and middle stages of the Clos
>> fabric.
>
> Multicast over a CLOS topology creates congestion management issues. One way
> to address the problem, in large scale CLOS topologies, is to eliminate
> native multicast in the fabric. That is an approach taken in several
> networks, including networks that are fully enclosed in a chassis or set of
> chassis.
>
>>
>> 2) Congestion management in Clos networks would be necessary in any case,
>> due to statistical multiplexing and possibility of (N -> 1) port traffic
>> flow.
>
> In practice, many networks are running CLOS topologies with no congestion
> management support. The assumption is that if hash based load balancing of
> flows is "good enough" and if the flows are small compared to link size, that
> the fabric is non-blocking. This allows one to build very large scale CLOS
> fabrics with off-the-shelf and/or heterogenous components, where each switch
> works independently. Congestion management at large scale is a very torny
> issue…
>
> I believe that there are several efforts in the IEEE under the umbrella of
> "data-center ethernet" in order to bring global congestion notification/flow
> control into a heterogenous environment. It is my understanding that there is
> a non-trivial number of networks that prefer to operate with simple hash
> based mechanism.
>
>> 3) The "ingress unicast replication" in VPN forwarder creates the following
>> issues:
>>
>> 3.1) If done at software hypervisor level, it will most likely overload
>> physical uplink(s) on the server: N replicas sent as opposed to 1 in case of
>> native multicast
>
> This is the main rational for this work. One could have started with just
> plain ingress replication. But in that case the ingress would have to
> replicate to the full membership of the group. With an edge replication tree,
> the number of copies is limited to N.
> As with any other network design, it is a question of trade-offs. The authors
> believe there is a non-trivial number of applications (e.g. discovery) where
> this is a useful approach.
>
>> 3.2) If done at hardware switch level (edge of physical Clos topology), it
>> cannot leverage hardware capabilities for multicast replication, and thus
>> could be difficult to implement and will stress the switch internal fabric.
>
> Building hardware with no multicast support can also simplify the hardware
> design.
>
>>
>> 4) If L3 VPN spans WAN for Inter-DC communications, unicast replication
>> makes any WAN multicast optimization impossible, unless there is a
>> "translating" WAN gateway that will forward packets as native multicast.
>
> The document only covers intra-DC scenarios, as of now. For WAN traffic, we
> do assume that there are systems that support L3VPN multicast as defined
> currently.
>
>> 5) Optimizing overlay multicast distribution tree could be difficult, since
>> underlying network metrics may be hidden from VPN gateways.
>
> In several practical scenarios i aware of, the intra-DC network has 2 costs
> points: same rack, different racks. Even in scenarios where there are
> multiple metrics, the BGP signaling gateway can be made aware of the physical
> topology of the network. My understanding is that the intra-DC network can be
> optimized.
>
>>
>> I'm reviewing the rest of the document, and hopefully can come up with more
>> comments later.
>
> Thank you very much for your attention.
>
>>
>> Best regards,
>>
>> Petr Lapukhov
>> Microsoft
>>
>