Hi Jeff,

There are 2 scenarios

1. vtep - Vtep
2. VM - VM  (Now this scenario has no issues of vtep-vtep algorithm)

Now for 1.

If we didn’t had controller I would have agree with your point.
Datacenter network by east - west traffic design with leaf - spine or leaf, 
spine and super spine architecture all the paths can be achieved. Anyway 
controller knows all the path even in complex network and if all the path can’t 
be achieved newer algorithm can be exercised. If sending packet on full sport 
range also doesn’t give all the path then there’s definitely something wrong 
with box ecmp.

In this solution all original packet is sent back to controller so controller 
and ops guys are even knowing how efficient hashing or load balancing for each 
switch is working in the network. Also it learns from this distribution and can 
change the algorithm by sending different vm-vm traffic to cover all the path.

If all else fails we can work on solution like done in George swallow mpls 
draft to learn the sport by quering each node but that’s outside the scope of 
this document.

Thanks,
Deepak

From: Jeff Tantsura 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, November 4, 2015 at 6:16 PM
To: dekumar <[email protected]<mailto:[email protected]>>, Haoweiguo 
<[email protected]<mailto:[email protected]>>, Sam Aldrin 
<[email protected]<mailto:[email protected]>>
Cc: Shahram Davari <[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>>, 
Dacheng Zhang <[email protected]<mailto:[email protected]>>
Subject: Re: [nvo3] draft-­-pang-­-nvo3-­-vxlan-­-path-­-detection-­-01

Hi,

Going back to more fundamental problem – unpredictability.
It is not uncommon to have 64-128 ways ECMP between source and destination 
VTEP’s across IP fabric.
In order to exercise every possible path enough entropy (in this case – UDP 
source port) has to be created by the sending VTEP, while possible – hashing is 
local to the devices applying it, so there’s high probability that some paths 
would never be exercised.
Exposing hashing semantics to a controller in a multivendor environment sounds 
a bit unrealistic.

Cheers,
Jeff

From: nvo3 <[email protected]<mailto:[email protected]>> on behalf of 
"Deepak Kumar (dekumar)" <[email protected]<mailto:[email protected]>>
Date: Thursday, November 5, 2015 at 09:43
To: Haoweiguo <[email protected]<mailto:[email protected]>>, Sam Aldrin 
<[email protected]<mailto:[email protected]>>
Cc: Shahram Davari <[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>>, 
Dacheng Zhang <[email protected]<mailto:[email protected]>>
Subject: Re: [nvo3] draft-­-pang-­-nvo3-­-vxlan-­-path-­-detection-­-01

Hi,

I double checked with all of my hardware teams, Reserved bit are usually 
ignored on reception so there’s no side effect to vxlan hardware, I believe TCP 
also was extended to support ECN this way. Also we are keeping this within 
network so server/vm won’t see this traffic.

Now when new encapsulation which are hardware friendly comes (Geneve/GUe/GPE we 
will respin the ASIC if required) and we will have hardware OAM solution even 
more powerful as we will be able to generate oam packet from ingress pipeline 
with OAM bit set so we verify actual datapath with QoS at initiating switch.

Hardware based oam solution for hardware vtep is necessary as TTL or existing 
solutions can’t address all the problem.

1.
As I said TTL expiry I like but it doesn’t verify the exact datapath as real 
vxlan traffic travels on the Leaf so it’s more of something better than nothing 
solution.
TTL pipeline in ASIC based on few implementation I know from our ASIC occur(s) 
before the path for vxlan packet and it doesn’t verify the exact datapath, vni 
to vlan/bd mapping, etc.
When packet is received on the core facing interface from the core side 
platform adds their own header, decrement the TTL and that cause exception, so 
complete packet is sent to software slow path.
When packet is getting forwarded in hardware from core port <—-> core 
—vtep—bridge — Edge port.
So steps
1. Add platform specific header
2. check for forwarding packet, if it’s for our (due to ip matching the vtep)
2. Decrement TTL
3. Check exception of TTL, if exception punt to software slow path
4. As it’s matching peer vtep ip, de-cap the packet and use VNI to add vlan 
header in the inner packet.
5. Now bridge the packet towards host on that vlan.

Even new hardware has resilient ECMP where software and hardware ECMP are not 
the same, so TTL expiry method will be little hard even in underlay to get 
egress interface.

2.
Also Leaf are deployed in resilent manner so traffic flow is not just vtep - 
vtep in same path.
-------
L3 core
-------
  |
L1—L2   —> Both leafs acting as Virtual Vtep IP.
\  /
 \/
 VM

Now what happen is traffic to VM can be hashed using ECMP from L1 or L2 both, 
Now if link between VM and L1 goes down for some reason this information is 
local and not known to the core and packet will still be delivered to both L1 
and L2, but traffic to VM is delivered over link connecting L1 -> L2 -> VM.

3. Even on intitiator leaf without having VM’s traffic profile VM path can’t be 
traced.
Lets for example VM does ping between 2 endpoints it doesn’t cover the same 
path as real traffic as hardware at initiator does L3 hashing instead of L4 
hashing for udp/tcp flows.

Now VM hashing is required to get the sport and after that if packet is 
delivered from software we need to find the right egress interface which 
require outer header and in this scenario due to tunnel this hashing is 
different and implementation specific but usually different than l3 hashing in 
the core.
In case of vxlan gpe, we won’t have to do all this software tricks as we can 
insert the packet right from the ingress pipeline and it will cover the right 
path as data, so we can move the bit position for gpe, geneve, gue as 
appropriately.


Thanks,
Deepak

From: Haoweiguo <[email protected]<mailto:[email protected]>>
Date: Wednesday, November 4, 2015 at 4:31 AM
To: dekumar <[email protected]<mailto:[email protected]>>, Sam Aldrin 
<[email protected]<mailto:[email protected]>>
Cc: Shahram Davari <[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>>, 
Dacheng Zhang <[email protected]<mailto:[email protected]>>
Subject: RE: [nvo3] draft-­-pang-­-nvo3-­-vxlan-­-path-­-detection-­-01


Hi Sam,

The extra bit in VXLAN reserved field has no side effect on regular VXLAN 
forwarding process. The hardware requirements for intermediate nodes is also 
low, the intermediate nodes only need to grab the data packets with the OAM 
flag to control plane using regular ACL, most current commertial chipsets can 
support this behavior.

Thanks,

weiguo

________________________________
From: Deepak Kumar (dekumar) [[email protected]<mailto:[email protected]>]
Sent: Wednesday, November 04, 2015 14:51
To: Sam Aldrin
Cc: Shahram Davari; [email protected]<mailto:[email protected]>; Dacheng Zhang
Subject: Re: [nvo3] draft-­-pang-­-nvo3-­-vxlan-­-path-­-detection-­-01

HI Sam,

Vxlan field that is used is reserved field and so existing Asic based hardware 
won't add this in transmit but receiving packet with reserved bit set has no 
side effect.
If hardware is programmable their is no issue even in transmit.

Can you give me example of any Asic implementation which will have problem, we 
can add text for user to be careful before turning  on the solution.

We can even call this extension of vxlan with pd bit.

Thanks,
Deepak

Sent from my iPhone

On Nov 4, 2015, at 3:09 PM, Sam Aldrin 
<[email protected]<mailto:[email protected]>> wrote:

Hi Deepak,

Aren’t you or aren’t you not changing the packet format by introducing PD flag 
bit in the reserved field. i.e changing RFC7348?
If so, how can you claim to be informational? Is it because RFC is 
informational?

For ex, VXLAN-GPE is in standards track, although it is now in expired state.

Irrespective of technical differences, if a specific format is being changed, 
it will impact existing future deployments as well, informational or not.
Being informational does not avoid that.

-sam
On Nov 3, 2015, at 8:03 PM, Deepak Kumar (dekumar) 
<[email protected]<mailto:[email protected]>> wrote:

HI Sam,

This is good discussion and we are bringing this draft as informatiinal draft 
for narrow scenario for some operators but not for other operators.

Ttl solution is too slow at scale and instead of argument we can give data of 
how much time it takes but for some operator that amount of time is okay but 
for some they have will want it to complete it quickly. As this being 
informational solution it's brought to working group as hardware driven 
controller controlled scenario and make its language may and should so all the 
issues it may cause to software vtep can be fixed.

Why can't software based and hardware based solution co-exist when information 
draft won't force everyone to implement it.

Thanks
Deepak

Sent from my iPhone

On Nov 4, 2015, at 12:41 PM, Sam Aldrin 
<[email protected]<mailto:[email protected]>> wrote:

Hi Deepak,

What you are describing is very narrow scenario, which has its own pitfalls.
Inline for my comments.
On Nov 3, 2015, at 7:10 PM, Deepak Kumar (dekumar) 
<[email protected]<mailto:[email protected]>> wrote:

Hi Shahram/Sam,

This solution is hardware centric with controller and policy needs to be 
created on each hop.
This solution is not applicable for all scenarios.

Policy example
Match peer vtep ip == destination ip of packet destination  port 4789, pd bit 
action punt and drop.
Match peer vtep ip!=destination ip destination port == 4789, pd bit action punt 
and forward.
If you want to employ policy for every vtep and on every device in the network, 
IMO, a bad design to start with.

Now drop takes care of leak scenario from leafs.

Now controller eats up the packet so no issue of loop.
Also in network packet is going as data packet as per vxlan rule of max ttl so 
not sure where's loop.
You mean there cannot be loops in n/w, just because TTL is used? (loop life is 
dependent on ttl)

If loop is there oam and data both will suffer.
Yes both will suffer. You use OAM to detect whether data plane has problem or 
not. With this, it will compound the problem.

Loop with controller can be avoided but that's outside the scope.

Alibaba is also operator and using this data center for cloud services.

I agree Ttl expiry will also work but that's software solution and separate 
draft not this draft intention.
If you already have a solution, why invent a new one? Are you saying controller 
is not efficient and cannot perform oam efficiently with existing ttl 
mechanism? :D


On Concern of policy application controller will apply the policy and if 
network is not hardware oam capable they won't initiate it and use software oam 
method.
Well, you have the answer right there.
In other words, if a device cannot support your proposed solution, you will 
revert back to ttl solution. why don’t you just use that solution instead?

We evaluated multiple Asic and found out solution can be done on multiple 
broadcom and custom Asic and Alibaba network is running on 2 different Broadcom 
Asic.
And your point being? :D

-sam

Thanks
Deepak

Sent from my iPhone

On Nov 4, 2015, at 11:29 AM, Sam Aldrin 
<[email protected]<mailto:[email protected]>> wrote:

I expressed the same concern at last IETF meeting, as Shahram raised here.
Haven’t gotten the  explanation yet.

If TTL expiry mechanism is used, then the definition of IP TTL will have to be 
redefined in order to make a copy and forward to next hop.
But if L3 devices have to read into VXLAN header to determine OAM bit is set, 
they need to implement DPI for the same.

Secondly, imagine when there exists a loop. In fact, they do exist even in 
controller based networks.

Speaking as an operator, as mentioned yesterday, this will cause packet storm 
and unintended consequences.

Why are we solving the problem when it doesn’t exist?

-sam

On Nov 3, 2015, at 6:02 PM, Shahram Davari 
<[email protected]<mailto:[email protected]>> wrote:

I think your assumption is broken. But you have an alternative method and that 
is using TTL expiry.

Thx
SD

From: Dacheng Zhang [mailto:[email protected]]
Sent: Tuesday, November 03, 2015 5:53 PM
To: Shahram Davari; [email protected]<mailto:[email protected]>
Subject: Re: [nvo3] draft-­-pang-­-nvo3-­-vxlan-­-path-­-detection-­-01

This draft actually proposes a mechanism where the intermediates are required 
to recognize the vxlan oam packets. If this assumption is broken, the solutions 
proposed in this draft may not be effective.

Cheers

Dacheng

发件人: nvo3 <[email protected]<mailto:[email protected]>> on behalf of 
Shahram Davari <[email protected]<mailto:[email protected]>>
日期: 2015年11月4日 星期三 上午9:33
至: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>>
主题: [nvo3] draft-­‐pang-­‐nvo3-­‐vxlan-­‐path-­‐detection-­‐01

Hi,

This draft needs to address how intermediate L3 routers are going to see these 
VXLAN OAM packets, since L3 routers just do L3 routing and don’t look at the 
payload to see it is VXLAN and then see that these are PD OAM packets. The only 
option I can think of is TTL expiry, otherwise it won’t work, the way it is 
defined now,

Thx
Shahram
_______________________________________________ nvo3 mailing list 
[email protected]<mailto:[email protected]>https://www.ietf.org/mailman/listinfo/nvo3
_______________________________________________
nvo3 mailing list
[email protected]<mailto:[email protected]>
https://www.ietf.org/mailman/listinfo/nvo3



_______________________________________________
nvo3 mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/nvo3

Reply via email to