Re: [Int-area] Call for adoption of draft-nachum-sarp-06.txt

Zhenghewen Wed, 09 Oct 2013 19:45:06 -0700

Hi, Dave,

        MPLS or PBB or Trill can use shim to make intermediate nodes not being 
aware the address of inner payload, it is right. It can reduce the FDB table 
burden of intermediate nodes, but it does not impact the edge nodes. Those 
solutions do not solve MAC address scalability issue of edge nodes. Those edge 
nodes include TOR switch and DC gateways.
        A host must consume at least a MAC entry of FDB table of edge nodes, 
VLAN only divides those MAC addresses into many groups but not reduce the 
consumption.
        Some solutions in L2VPN, such as PBB or VPLS, solve MAC address 
scalability issue of core network, but MAC address scalability issue of edge 
nodes is out of the scope, for example, we do not see a solution to solve C-MAC 
scalability issue on CE or DC Gateway.
        Network virtualization breaks subnet into many pieces in Cloud era, 
edge nodes (e.g., DC Gateway or TOR) will have to be aware other MAC address 
from other data centres, part or all of the other hosts in other data centres, 
it is a fact.

Hewen

-----Original Message-----
From: David Allan I [mailto:david.i.al...@ericsson.com] 
Sent: Wednesday, October 09, 2013 10:01 PM
To: Zhenghewen; 'Thomas Narten'; Suresh Krishnan
Cc: 'Internet Area'
Subject: RE: [Int-area] Call for adoption of draft-nachum-sarp-06.txt

My point was I would not have a design whereby every MAC for every vNIC in a 
cloud had to be in every FDB, which is what I believe you are describing.

I do not think I've seen mention of how SARP plays into an Ethernet active 
topology or VLANs. But if  you ran a single instance of spanning tree and a 
single VLAN and each VM had a sufficiently diverse set of peers that MAC table 
entries never aged out, you would end up with the scenario you described.  I do 
not think that is a realistic scenario.

Once it is partitioned into multiple VLANs and uses MSTP or SPBV the number of 
MACs in any individual FDB starts to divide, if you add MACinMAC/SPBM in 
divides significantly further, orders of magnitude.  The set of MACs a TOR 
would see would be a fraction of the whole network even in extreme scenarios. 
It is not a simple linear sum of endpoints.

Dave

-----Original Message-----
From: Zhenghewen [mailto:zhenghe...@huawei.com] 
Sent: Tuesday, October 08, 2013 6:11 PM
To: David Allan I; 'Thomas Narten'; Suresh Krishnan
Cc: 'Internet Area'
Subject: RE: [Int-area] Call for adoption of draft-nachum-sarp-06.txt

Hi, Dave,

        I do not catch your meanings, and I think that maybe you do not 
understand my meanings.
        My description shows the fact that current FDB tables seem small, 
especially for cloud data centre, it has nothing with VLAN. It is not one big 
subnet ("subnet" in my understanding means IP subnet) of 10M MACs, it is only 
about larger layer2 domain, for example, one operator provide layer2 service 
across data centres for tenants and those tenants can own separate and 
overlapped IP address space.
        I do not see any other solution for this issue, would you provide some 
info? thanks.

Hewen

-----Original Message-----
From: int-area-boun...@ietf.org [mailto:int-area-boun...@ietf.org] On Behalf Of 
David Allan I
Sent: Tuesday, October 08, 2013 10:12 PM
To: Zhenghewen; 'Thomas Narten'; Suresh Krishnan
Cc: 'Internet Area'
Subject: Re: [Int-area] Call for adoption of draft-nachum-sarp-06.txt

I'm confused here, is the problem you are trying to solve one big subnet of 10M 
MACs? Your message would seem to suggest this.

IMO the problem is a large number of VLANs with a much much smaller number of 
MACs per VLAN, which makes the problem divisible and tractable with current 
technology. I consider that a solved problem.

Dave

-----Original Message-----
From: int-area-boun...@ietf.org [mailto:int-area-boun...@ietf.org] On Behalf Of 
Zhenghewen
Sent: Tuesday, October 08, 2013 5:08 AM
To: 'Thomas Narten'; Suresh Krishnan
Cc: 'Internet Area'
Subject: Re: [Int-area] Call for adoption of draft-nachum-sarp-06.txt

Hi Thomas,

> 1) As DaveA points out, FDB tables are large on chips these days.

[Hewen]In real world,  FDB tables seem small, especially for large-size data 
centre in cloud era. 
        Engineers at least two choices to implement FDB table, TCAM or DRAM, 
but TCAM is very expensive and costs more power (thus high temperature), those 
limit feasible TCAM capacity on a board, then TCAM is not a good choice for 
very large table. DRAM is more popular and cheap, but DRAM chip has the 
limitation of capacity and speed. Usually, the single DRAM chip in the market 
has the maximum capacity of 4Gb (for most available DRAM chip in the market).
        Now, let's assume the packet lookup chipset would use 128 bits 
search/hash item, one MAC lookup means at least 3 items, then a 4Gb DRAM 
chipset can support up to 10M MAC entries.
                In real world, some operator has 80 data centres in the scope 
with the diameter of 1000KM, each data centre has average 1,500 racks, each 
rack has average 20 servers, each servers has 20 VMs. That means that there are 
up to 48millions MAC entries. Even only 20% VMs generate traffic outside from 
local data centres, there are about 10M MAC entries.
        It seems that single DRAM chip is enough, but there are some other 
obvious limitations:
        First, is 20% enough? Are all capacity of all DRAM chipsets used for 
MAC entries (nothing for L3 FIB, nothing for ACL)?
        Second, a DRAM at least means a SerDes@12.5Gbps link with packet lookup 
chipset, now if we want 125Gbps line-rate, we must place 10 copies of DRAM 
chipsets for 10M MAC entries; if we want 500Gbps line-rate, it means 40 DRAM 
chipsets for 10M MAC entries. It would be not feasible for engineers, 
especially considering the available board space, power consumption and SerDes 
links number.
        Now, as we see, current FDB tables seem small, especially for cloud 
data centre. Usually high-speed switch or router only has limited table size, 
for example 128K or 512K MAC entries.

        Maybe DRAM chipset capacity or speed's grow will solve this issue, but 
according to RFC4984, "Historically, DRAM capacity grows at about 4x every 3.3 
years.  This translates to 2.4x every 2 years, so DRAM capacity actually grows 
faster than Moore's law would suggest.  DRAM speed, however, only grows about 
10% per year, or 1.2x every 2 years ". It seems that we need some years to 
support 50millions MAC entries with single high-speed DRAM chipset.
        Some solutions to solve the issue will be valuable before the day with 
high-speed & high-capacity DRAM chip come.

Best regards,
Hewen

-----Original Message-----
From: int-area-boun...@ietf.org [mailto:int-area-boun...@ietf.org] On Behalf Of 
Thomas Narten
Sent: Saturday, September 28, 2013 5:28 AM
To: Suresh Krishnan
Cc: Internet Area
Subject: Re: [Int-area] Call for adoption of draft-nachum-sarp-06.txt

Given the flurry of mail, I went back and reviewed my notes on this document.

I believe I understand the problem (reduce size of FDB), but am skeptical that 
solving it via the approach in this document is worthwhile.

1) As DaveA points out, FDB tables are large on chips these days. However, at 
least initially, SARP would be implemented in the service processor (consuming 
CPU cycles). It would be years (if ever) before silicon would implement this. 
But in parallel, silicon will just get better and have bigger FDB tables... 
Moreover, the SARP motivation section specifically cites a shortage of 
processor cycles as a problem ... but doing SARP in software will increase the 
demands on the CPU...  its unclear to me that the increased CPU burdens SARP 
implies would be offset by the reduced amount of ARP processing that the 
solution is hoping will result... I.e., at least in the short term, it's 
unclear that SARP would actually help in practice.

2) Doing L2 NAT at line rates (an absolute requirement for edge
devices) will only happen if this is done in silicon. I don't see that 
happening unless there is strong support from vendors/operators/chip makers... 
Software based solutions simply will not have sufficient performance. I think 
the IEEE would be a better place to have the discussion about what L2 chip 
makers are willing to implement...

3) Only works for IP. Doesn't work for non-IP L2. Doesn't that only solve part 
of the problem?

4) High availability is complex (but presumably also necessary). It smells a 
lot like multi-chassis LAG (with the need to synchronize state between 
tightly-coupled peers). Although IEEE is working in this general area now, 
there are tons of vendor-specific solutions for doing this at L2. Do we really 
want to tackle standardizing this in the IETF?  Isn't the relevant expertise 
for this general area in IEEE?

5) This solution touchs on both L2 (NATing L2 addresses) and L3 (ARPing for IP 
addresses). We absolutely would need coordinate with IEEE before deciding to 
take on this work.

6) ARP caching will be a tradeoff. You want to cache responses for better 
performance, but long cache timeouts will result in black holes after VMs move. 
There is no great answer here. I expect long timeouts to be unacceptable 
operationally, which means that the benefits of caching will be limited (and 
these benefits are the key value proposition of this approach). It is really 
hard to estimate whether the benefits will be sufficient in practice. 
Gratuitous ARPs can help, but they are not reliable, so they will help, but 
have limitations...

7) I haven't fully worked this out, but I wonder if loops can form between 
proxies. There is the notion that when a VM moves, proxies will need to update 
their tables. But careful analysis will be needed to be sure that one can't 
ever have loops where proxies end up pointing to each other. And since the 
packets are L2 (with no TTL), such loops would be disasterous. (Note point 6: 
we are using heuristics (like gratuitous ARP) to get tables to converge after a 
change. Heuristics tend to have transient inconsistencies.. i.e., possibly 
leading to loops.)

Again, overall, I understand the generic problem and that it would be nice to 
have a solution. However, I don't see a simple solution here. I see a fair 
amount of complexity, and I'm skeptical that it's worth it (e.g., when the next 
gen of silicon will just have a larger FDB).

What I'd really like to see (before having the IETF commit to this
work) is:

1) operators who are feeling the pain described in the document stepping 
forward and saying they think the solution being proposed is something they 
would be willing to deploy and is better than other approaches they have in 
their toolkit.

2) Vendors (including silicon) saying they see the need for this and think the 
would implement it.

Thomas

_______________________________________________
Int-area mailing list
Int-area@ietf.org
https://www.ietf.org/mailman/listinfo/int-area

_______________________________________________
Int-area mailing list
Int-area@ietf.org
https://www.ietf.org/mailman/listinfo/int-area
_______________________________________________
Int-area mailing list
Int-area@ietf.org
https://www.ietf.org/mailman/listinfo/int-area

_______________________________________________
Int-area mailing list
Int-area@ietf.org
https://www.ietf.org/mailman/listinfo/int-area

Re: [Int-area] Call for adoption of draft-nachum-sarp-06.txt

Reply via email to