Re: [nvo3] Comments on draft-kompella-nvo3-server2nve

david.black Sun, 15 Jul 2012 09:58:20 -0700

Hi Kireeti,

Your response looks good - we appear to be thinking along similar lines.
I'll add a few comments and wait for the next version of the draft ...

> What I take from your suggestion is to move the lNVE to rNVE part to a 
> different draft;
> I buy that, especially if there are other mechanisms for doing this that can 
> plug-in to
> this server2nve signaling, so that one can mix-and-match server2nve signaling 
> and
> lNVE2rNVE signaling.  Does that seem reasonable?

Yes, that's *exactly* what I had in mind.

> This step should also include an optional authorization check, as network
> policy may limit which NVEs are allowed to participate in which VNs.
>
> Okay.  Authorization locally, or from the provisioning system?  (Or either?)

Either, both and more :-).  An important case is that this may be from the
"network" (e.g., via the control protocol) if the network imposes scope
boundaries on virtual networks.  For example, suppose that a tenant has not
paid the service provider for data-center-wide scope (or multiple-data-center
scope) of her virtual networks - in this case the service provider will impose
restrictions on where her VNs are usable, independent of what she may be
able to convince the provisioning system to ask for.

> Perhaps the VM is paused on S, started on S'; if that's successful, the VM is
> destroyed on S, otherwise the migration is aborted and the VM is continued on 
> S.

Something like that - that summary is good enough for this draft, and the
important points (on which I believe we agree) are:
- There is a "paused" or "suspended" state of a VM in addition to the
powered-on and powered-off states.
- The concurrency (lack of ordering) requirements on the network-visible
events that we agree on and that you're going to send a separate
message about.

I have a related consideration based on thinking about this further.  The 
network
SHOULD NOT rely on dissociate messages always being sent - a server crash at the
wrong point during a VM migration may cause a dissociate to be missed (e.g., the
VM made it to S', but S crashed before sending the dissociate).  More 
importantly,
not relying on the dissociate messages (in particular, not having the inter-NVE
control protocol rely on them) helps if one wants to mix hypervisors that 
support
the attach/detach protocol with (exiting) ones that don't.  For existing 
hypervisors,
under suitable restrictions and assuming some advance configuration, "associate"
can be inferred from a gratuitous ARP or RARP, but nothing is sent for 
dissociate.
The inference of "associate" won't be possible if things have not been set up
to enable the gratuitous ARP or RARP.

Thanks,
--David

From: [email protected] [mailto:[email protected]] On Behalf Of Kireeti 
Kompella
Sent: Saturday, July 14, 2012 8:16 PM
To: Black, David
Cc: [email protected]
Subject: Re: [nvo3] Comments on draft-kompella-nvo3-server2nve

Hi David,

Thanks for your detailed comments!  More inline.
On Fri, Jul 13, 2012 at 1:01 PM, 
<[email protected]<mailto:[email protected]>> wrote:
Authors (Kireeti, Yakov and Thomas),

This is a good draft - it looks like a good foundation to focus discussion
around what the server-to-NVE (attach/detach) protocol needs to do.  I
like a lot of the contents - I have a few high level comments and some
more detailed feedback.

Thanks!

(1) This draft starts out dealing with the attach/detach (server-to-NVE)
protocol and then includes some material on the control protocol for
distributing and managing mapping information on the NVEs.  I suggest
focusing the draft on the attach/detach protocol, removing control
protocol discussion (e.g., Section 3), and minimizing assumptions about
the control protocol (see detailed comments for where I think assumptions
could be minimized).  The result should be more general and more useful.

About the control plane: it really concerns me that the control plane 
discussion has not happened so far (not really).  ARP doesn't scale; neither 
does flooding.  The goal here is to signal networking parameters: from server 
(vswitch) to local NVE to remote NVEs to remote servers.  Fine, call the local 
NVE to remote NVEs part "control plane" -- but that's a critical part of the 
picture.

What I take from your suggestion is to move the lNVE to rNVE part to a 
different draft; I buy that, especially if there are other mechanisms for doing 
this that can plug-in to this server2nve signaling, so that one can 
mix-and-match server2nve signaling and lNVE2rNVE signaling.  Does that seem 
reasonable?

(2) Section 2.2.3 on detach is trying to cover at least a couple of use
cases,  VM live migration, and VM removal (e.g., power-off) that probably
want to be separated. The current text really doesn't get the live migration
case right, D.4 comes before D.3 for the power-off case, and I think things
get more complex when the live migration detach functionality is corrected.

More on this below.

(3) Section 2.2.4 appears to assume a specific order of events between
the two servers involved in VM migration.  As those servers are operating
concurrently, that's not a robust assumption, and the NVE functionality
should be specified to not depend on the order of events.

Ordering assumptions weren't intended, so we'll tweak the wording to remove any 
such implications.

--- Detailed comments by section ---

A pre-disassociate operation is defined in section 2.2.1 but not used
in the rest of the draft.  Is it actually needed?

Good catch!  I'd put that in early, worked out the rest of the details, 
couldn't figure out a use for it, but forgot to remove it.  I'll remove it.

-- Section 2.2.2

   A.1:  Validate the authentication (if present).  If not, inform the
         provisioning system, log the error, and stop processing the
         associate message.

This step should also include an optional authorization check, as network
policy may limit which NVEs are allowed to participate in which VNs.

Okay.  Authorization locally, or from the provisioning system?  (Or either?)

   A.3:  If the VID in the associate message is non-zero, look up <VNID,
         P>.  If the result is zero, or equal to VID, all's well.
         Otherwise, respond to S with an error, and stop processing the
         associate message.

Why is a zero VID lookup result ok for a non-zero VID in the associate
message?

Just means no mapping yet.  With respect to the refcounting suggested below, 
good place to set it to 1; otherwise increment.

Should the NVE copy the VID from the associate message to the
<VNID,P> entry before responding?

Good point.  Will fix.

   A.5:  Communicate with each rNVE device to advertise the VM's
         addresses, and also to get the addresses of other VMs in the
         DCVPN.  Populate the table with the VM's addresses and
         addresses learned from each rNVE.

This assumes that the control protocol does active propagation of all
address info, and assumes that no other addresses for the VN are present
in the NVE.  Neither of those are good general assumptions, IMHO, and
in particular, lazy evaluation is possible (e.g., load address mappings
on demand to reduce the amount of invalidation traffic caused by
each mapping change).

I'm leery of on-demand/cache-based address mappings and lazy evaluation (love 
it in general, but not for address mappings).  However, you're right: there may 
be cases where this is a valid approach.

I'd suggest rephrasing to something like:

   A.5:  Use the overlay control protocol to inform the network of the
         VM's addresses and the VM's association with this NVE.

Something like that.  Will work on text.

-- Section 2.2.3

   D.1:  Validate the authentication (if present).  If not, inform the
         provisioning system, log the error, and stop processing the
         associate message.

Like A.1, this should include an optional authorization check, as some
<VNID,P> -> VID mappings may be statically configured and hence not
permit removal.

Okay, will copy wording from there once we've agreed on it.

   D.2:  If the hold time is non-zero, point the VM's addresses in the
         VNID table to the new location of the VM, if known, or to
         "discard", and start a timer for the period of the hold time.
         If hold time is zero, immediately perform step D.4, then go to
         D.3.

This is where the power-off and migration cases start to interact -
Hold time would be zero for power-off, non-zero for detach.  For migration,
this change potentially races with a change to the VM's addresses received
via the control protocol, so the VM's address may already point somewhere
else if the control protocol did its update before the dissociate (in
which case nothing should be done to those addresses).

Definitely worth looking at again, especially with respect to your comments 
about the order for migration.

With regard to the race condition, I'll send a separate email on that.

   D.3:  Set the VID for <VNID, P> as unassigned.  Respond to S saying
         that the operation was successful.

If there are multiple VMs using the VNID on that port, this
"pulls the rug" out from under the others by disabling their forwarding.
This <VNID,P> -> VID mapping needs a reference count of some form, and
corresponding changes would be needed to A.2 and A.3.  Not using a
reference count may be ok under the assumption that the NVE does not
share ports among VMs (or VSIs/vNICs), but that may not be a good
assumption for an external NVE (e.g., in a ToR switch).

Good point!  I'll go with refcounting.

   D.4:  When the hold timer expires, delete the VM's addresses from the
         VNID table.  Delete any VM-specific network policies associated
         with any of the VM addresses.  If the VNID table is empty after
         deleting the VM's addresses, optionally delete the table and
         any network policies for the VNID.

Well, that's the right thing to do in the power-off case, but not
when the VM has moved and there are other VMs on this NVE (possibly even
the same port) that still need to communicate with the moved VM.  Also,
the power-off case needs to include (at least optionally) informing the
control protocol of the withdrawal of the VM's addresses.

See separate email.

As noted in (2) above, I think it would be clearer if there were separate
versions of 2.2.3 for the migration departure and power-down use cases.

Perhaps.  Let's get the semantics right first, then see if there are common 
elements or not.

-- Section 2.2.4

   M.3:  S then gets a request to terminate the VM on S.

   M.4:  Finally, S' gets a request to start up the VM on S'.

Not exactly ;-).

Terminating the VM on S (and destroying its state) before confirming
its startup on S' risks losing the VM entirely if something goes wrong
on S'.

Interesting point.  However, if the VM starts on S' without first being stopped 
on S, then (for some time) both S and S' are running, and I'd think that the 
results would be unpredictable, especially if the VM is just about to engage in 
some I/O.  However, I'll bow to those who've implemented VM migration and know 
what they're doing.  Perhaps the VM is paused on S, started on S'; if that's 
successful, the VM is destroyed on S, otherwise the migration is aborted and 
the VM is continued on S.  I'd like to know, as this affects the "tentative 
address changes" you talk about below, and dealing with migration abort.

This level of detail isn't necessary - from the point of view
of the network:
- Startup on S' generates an associate request to the NVE for S'.
- The dissociate request from S to its NVE may occur before or after
        that S' associate request
- The dissociate request from S to its NVE may occur before or after
        control protocol propagation of the results of the S' associate
        request to the NVE for S.
The server-to-NVE functionality should be specified to operate properly
independent of the order of these events.

Agreed.  Separate email thread to work this out.

   PA.5:  Communicate with each rNVE device to advertise the VM's
      addresses but as non-preferred destinations(*).  Also get the
      addresses of other VMs in the DCVPN.  Populate the table with the
      VM's addresses and addresses learned from each rNVE.

That assumes aggressive push of the new address information by the
control protocol directly to the rNVEs - while a control protocol
may choose to do that, it's not strictly necessary and the interaction
may not be directly between the lNVE and the rNVEs.  Generalizing in
a fashion similar to A.5, I'd suggest something like:

   PA.5:  The overlay control protocol may be used to inform the
      network of the forthcoming change to the VM's addresses
      that will occur when the VM is associated with this NVE.

Okay, something like.

If this is done, withdrawal of the tentative address changes
needs to be discussed, as VM migrations can abort for a variety
of reasons (e.g., S' may crash during the copy).  This PA.5
step can be skipped for a control protocol only does on-demand
provisioning of the address mapping information.

Interesting thought.  Will follow up once we get the migration "right" (for 
some value of right).

-- Section 3

This appears to be entirely about the control protocol and (IMHO)
doesn't fit well with the rest of the draft.

Will discuss putting this in a separate draft with co-authors.

Thanks again for the detailed comments!
Kireeti.

Thanks,
--David
----------------------------------------------------
David L. Black, Distinguished Engineer
EMC Corporation, 176 South St., Hopkinton, MA  01748
+1 (508) 293-7953             FAX: +1 (508) 293-7786
[email protected]<mailto:[email protected]>        Mobile: +1 (978) 
394-7754<tel:%2B1%20%28978%29%20394-7754>
----------------------------------------------------

_______________________________________________
nvo3 mailing list
[email protected]<mailto:[email protected]>
https://www.ietf.org/mailman/listinfo/nvo3

--
Kireeti

_______________________________________________
nvo3 mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/nvo3

Re: [nvo3] Comments on draft-kompella-nvo3-server2nve

Reply via email to