Pedro,

Thanks for your response! Firstly, I should have noted that I find the proposed 
multicast "emulation" to be very helpful in some realistic scenarios, 
specifically in the cases where multicast could not be easily turned on (e.g. 
risking a software problem, since PIM SM is not a nice protocol to implement). 

I still believe that optimizing multicast routing is "dense" networks is a 
separate interesting problem, that does have "some" solutions, and I'm sure you 
are well aware of the research done in that field. Of course, in the real world 
of "custom" Clos fabrics use of regular PIM SM with SPT trees has obvious 
optimality implications (e.g. load-sharing the SPT's, optimum fan-out 
distribution and so on).

For the congestion management in Clos networks - we have observed very high 
buffer utilization watermarks on practically all of our "spine" switches 
(unicast traffic only), due to peak loads of various compute traffic. 
Furthermore, experimentation have shown that even a very simple QoS policy with 
two xWRR queues differentiating bulk and query traffic results in significant 
performance improvements. However, I agree that having multicast flows to the 
mix here may create interesting complications in the switch internal fabrics.

My last question was whether there exists an analysis/research on tradeoffs 
associated with edge replication using an "overlay" tree  - e.g. having 
replicated packets cross the bisection twice as part of the constructed 
distribution tree.

Thank you,

Petr Lapukhov
Microsoft

-----Original Message-----
From: Pedro Marques [mailto:[email protected]] 
Sent: Monday, May 28, 2012 8:58 PM
To: Petr Lapukhov
Cc: Yiqun Cai; [email protected]; [email protected]; [email protected]
Subject: Re: draft-marques-l3vpn-mcast-edge-00


On May 27, 2012, at 11:11 PM, Petr Lapukhov wrote:

> Hi Pedro,

Petr,
Thank you for your comments. Answers inline.

> 
> Thanks for an interesting read! However, I have some concerns regarding the 
> problem statement in the document:
> 
>> For Clos topologies with multiple stages native multicast support 
>> within the switching infrastructure is both unnecessary and 
>> undesirable.  By definition the Clos network has enough bandwidth to 
>> deliver a packet from any input port to any output port.  Native 
>> multicast support would however make it such that the network would 
>> no longer be non-blocking.  Bringing with it the need to devise 
>> congestion management procedures.
> 
> Here they are:
> 
> 1) Multicast routing over Clos topology could be non-blocking provided that 
> some criteria on Clos topology dimensions are met and multicast distribution 
> tree fan-outs are properly balanced at ingress and middle stages of the Clos 
> fabric.

Multicast over a CLOS topology creates congestion management issues. One way to 
address the problem, in large scale CLOS topologies, is to eliminate native 
multicast in the fabric. That is an approach taken in several networks, 
including networks that are fully enclosed in a chassis or set of chassis.

> 
> 2) Congestion management in Clos networks would be necessary in any case, due 
> to statistical multiplexing and possibility of (N -> 1) port traffic flow.

In practice, many networks are running CLOS topologies with no congestion 
management support. The assumption is that if hash based load balancing of 
flows is "good enough" and if the flows are small compared to link size, that 
the fabric is non-blocking. This allows one to build very large scale CLOS 
fabrics with off-the-shelf and/or heterogenous components, where each switch 
works independently. Congestion management at large scale is a very torny 
issue...

I believe that there are several efforts in the IEEE under the umbrella of 
"data-center ethernet" in order to bring global congestion notification/flow 
control into a heterogenous environment. It is my understanding that there is a 
non-trivial number of networks that prefer to operate with simple hash based 
mechanism.

> 3) The "ingress unicast replication" in VPN forwarder creates the following 
> issues:
> 
> 3.1) If done at software hypervisor level, it will most likely 
> overload physical uplink(s) on the server: N replicas sent as opposed 
> to 1 in case of native multicast

This is the main rational for this work. One could have started with just plain 
ingress replication. But in that case the ingress would have to replicate to 
the full membership of the group. With an edge replication tree, the number of 
copies is limited to N.
As with any other network design, it is a question of trade-offs. The authors 
believe there is a non-trivial number of applications (e.g. discovery) where 
this is a useful approach.

> 3.2) If done at hardware switch level (edge of physical Clos topology), it 
> cannot leverage hardware capabilities for multicast replication, and thus 
> could be difficult to implement and will stress the switch internal fabric.

Building hardware with no multicast support can also simplify the hardware 
design.

> 
> 4) If L3 VPN spans WAN for Inter-DC communications, unicast replication makes 
> any WAN multicast optimization impossible, unless there is a "translating" 
> WAN gateway that will forward packets as native multicast.

The document only covers intra-DC scenarios, as of now. For WAN traffic, we do 
assume that there are systems that support L3VPN multicast as defined currently.

> 5) Optimizing overlay multicast distribution tree could be difficult, since 
> underlying network metrics may be hidden from VPN gateways.

In several practical scenarios i aware of, the intra-DC network has 2 costs 
points: same rack, different racks. Even in scenarios where there are multiple 
metrics, the BGP signaling gateway can be made aware of the physical topology 
of the network. My understanding is that the intra-DC network can be optimized.

> 
> I'm reviewing the rest of the document, and hopefully can come up with more 
> comments later.

Thank you very much for your attention.

> 
> Best regards,
> 
> Petr Lapukhov
> Microsoft
> 



Reply via email to