Re: [nvo3] Comments on draft-kompella-nvo3-server2nve

Kireeti Kompella Sat, 14 Jul 2012 17:16:34 -0700

Hi David,

Thanks for your detailed comments!  More inline.


On Fri, Jul 13, 2012 at 1:01 PM, <[email protected]> wrote:

> Authors (Kireeti, Yakov and Thomas),
>
> This is a good draft - it looks like a good foundation to focus discussion
> around what the server-to-NVE (attach/detach) protocol needs to do.  I
> like a lot of the contents - I have a few high level comments and some
> more detailed feedback.
>

Thanks!


> (1) This draft starts out dealing with the attach/detach (server-to-NVE)
> protocol and then includes some material on the control protocol for
> distributing and managing mapping information on the NVEs.  I suggest
> focusing the draft on the attach/detach protocol, removing control
> protocol discussion (e.g., Section 3), and minimizing assumptions about
> the control protocol (see detailed comments for where I think assumptions
> could be minimized).  The result should be more general and more useful.
>

About the control plane: it really concerns me that the control plane
discussion has not happened so far (not really).  ARP doesn't scale;
neither does flooding.  The goal here is to signal networking parameters:
from server (vswitch) to local NVE to remote NVEs to remote servers.  Fine,
call the local NVE to remote NVEs part "control plane" -- but that's a
critical part of the picture.

What I take from your suggestion is to move the lNVE to rNVE part to a
different draft; I buy that, especially if there are other mechanisms for
doing this that can plug-in to this server2nve signaling, so that one can
mix-and-match server2nve signaling and lNVE2rNVE signaling.  Does that seem
reasonable?


> (2) Section 2.2.3 on detach is trying to cover at least a couple of use
> cases,  VM live migration, and VM removal (e.g., power-off) that probably
> want to be separated. The current text really doesn't get the live
> migration
> case right, D.4 comes before D.3 for the power-off case, and I think things
> get more complex when the live migration detach functionality is corrected.
>

More on this below.


> (3) Section 2.2.4 appears to assume a specific order of events between
> the two servers involved in VM migration.  As those servers are operating
> concurrently, that's not a robust assumption, and the NVE functionality
> should be specified to not depend on the order of events.
>

Ordering assumptions weren't intended, so we'll tweak the wording to remove
any such implications.


> --- Detailed comments by section ---
>
> A pre-disassociate operation is defined in section 2.2.1 but not used
> in the rest of the draft.  Is it actually needed?
>

Good catch!  I'd put that in early, worked out the rest of the details,
couldn't figure out a use for it, but forgot to remove it.  I'll remove it.


> -- Section 2.2.2
>
>    A.1:  Validate the authentication (if present).  If not, inform the
>          provisioning system, log the error, and stop processing the
>          associate message.
>
> This step should also include an optional authorization check, as network
> policy may limit which NVEs are allowed to participate in which VNs.
>

Okay.  Authorization locally, or from the provisioning system?  (Or either?)


>    A.3:  If the VID in the associate message is non-zero, look up <VNID,
>          P>.  If the result is zero, or equal to VID, all's well.
>          Otherwise, respond to S with an error, and stop processing the
>          associate message.
>
> Why is a zero VID lookup result ok for a non-zero VID in the associate
> message?


Just means no mapping yet.  With respect to the refcounting suggested
below, good place to set it to 1; otherwise increment.


> Should the NVE copy the VID from the associate message to the
> <VNID,P> entry before responding?
>

Good point.  Will fix.


>    A.5:  Communicate with each rNVE device to advertise the VM's
>          addresses, and also to get the addresses of other VMs in the
>          DCVPN.  Populate the table with the VM's addresses and
>          addresses learned from each rNVE.
>
> This assumes that the control protocol does active propagation of all
> address info, and assumes that no other addresses for the VN are present
> in the NVE.  Neither of those are good general assumptions, IMHO, and
> in particular, lazy evaluation is possible (e.g., load address mappings
> on demand to reduce the amount of invalidation traffic caused by
> each mapping change).


I'm leery of on-demand/cache-based address mappings and lazy evaluation
(love it in general, but not for address mappings).  However, you're right:
there may be cases where this is a valid approach.

I'd suggest rephrasing to something like:
>
>    A.5:  Use the overlay control protocol to inform the network of the
>          VM's addresses and the VM's association with this NVE.
>

Something like that.  Will work on text.


> -- Section 2.2.3
>
>    D.1:  Validate the authentication (if present).  If not, inform the
>          provisioning system, log the error, and stop processing the
>          associate message.
>
> Like A.1, this should include an optional authorization check, as some
> <VNID,P> -> VID mappings may be statically configured and hence not
> permit removal.
>

Okay, will copy wording from there once we've agreed on it.


>    D.2:  If the hold time is non-zero, point the VM's addresses in the
>          VNID table to the new location of the VM, if known, or to
>          "discard", and start a timer for the period of the hold time.
>          If hold time is zero, immediately perform step D.4, then go to
>          D.3.
>
> This is where the power-off and migration cases start to interact -
> Hold time would be zero for power-off, non-zero for detach.  For migration,
> this change potentially races with a change to the VM's addresses received
> via the control protocol, so the VM's address may already point somewhere
> else if the control protocol did its update before the dissociate (in
> which case nothing should be done to those addresses).
>

Definitely worth looking at again, especially with respect to your comments
about the order for migration.

With regard to the race condition, I'll send a separate email on that.

   D.3:  Set the VID for <VNID, P> as unassigned.  Respond to S saying
>          that the operation was successful.
>
> If there are multiple VMs using the VNID on that port, this
> "pulls the rug" out from under the others by disabling their forwarding.
> This <VNID,P> -> VID mapping needs a reference count of some form, and
> corresponding changes would be needed to A.2 and A.3.  Not using a
> reference count may be ok under the assumption that the NVE does not
> share ports among VMs (or VSIs/vNICs), but that may not be a good
> assumption for an external NVE (e.g., in a ToR switch).
>

Good point!  I'll go with refcounting.


>    D.4:  When the hold timer expires, delete the VM's addresses from the
>          VNID table.  Delete any VM-specific network policies associated
>          with any of the VM addresses.  If the VNID table is empty after
>          deleting the VM's addresses, optionally delete the table and
>          any network policies for the VNID.
>
> Well, that's the right thing to do in the power-off case, but not
> when the VM has moved and there are other VMs on this NVE (possibly even
> the same port) that still need to communicate with the moved VM.  Also,
> the power-off case needs to include (at least optionally) informing the
> control protocol of the withdrawal of the VM's addresses.
>

See separate email.


> As noted in (2) above, I think it would be clearer if there were separate
> versions of 2.2.3 for the migration departure and power-down use cases.
>

Perhaps.  Let's get the semantics right first, then see if there are common
elements or not.


> -- Section 2.2.4
>
>    M.3:  S then gets a request to terminate the VM on S.
>
>    M.4:  Finally, S' gets a request to start up the VM on S'.
>
> Not exactly ;-).
>
> Terminating the VM on S (and destroying its state) before confirming
> its startup on S' risks losing the VM entirely if something goes wrong
> on S'.


Interesting point.  However, if the VM starts on S' without first being
stopped on S, then (for some time) both S and S' are running, and I'd think
that the results would be unpredictable, especially if the VM is just about
to engage in some I/O.  However, I'll bow to those who've implemented VM
migration and know what they're doing.  Perhaps the VM is paused on S,
started on S'; if that's successful, the VM is destroyed on S, otherwise
the migration is aborted and the VM is continued on S.  I'd like to know,
as this affects the "tentative address changes" you talk about below, and
dealing with migration abort.

This level of detail isn't necessary - from the point of view
> of the network:
> - Startup on S' generates an associate request to the NVE for S'.
> - The dissociate request from S to its NVE may occur before or after
>         that S' associate request
> - The dissociate request from S to its NVE may occur before or after
>         control protocol propagation of the results of the S' associate
>         request to the NVE for S.

The server-to-NVE functionality should be specified to operate properly
> independent of the order of these events.
>

Agreed.  Separate email thread to work this out.

   PA.5:  Communicate with each rNVE device to advertise the VM's
>       addresses but as non-preferred destinations(*).  Also get the
>       addresses of other VMs in the DCVPN.  Populate the table with the
>       VM's addresses and addresses learned from each rNVE.
>
> That assumes aggressive push of the new address information by the
> control protocol directly to the rNVEs - while a control protocol
> may choose to do that, it's not strictly necessary and the interaction
> may not be directly between the lNVE and the rNVEs.  Generalizing in
> a fashion similar to A.5, I'd suggest something like:
>
>    PA.5:  The overlay control protocol may be used to inform the
>       network of the forthcoming change to the VM's addresses
>       that will occur when the VM is associated with this NVE.
>

Okay, something like.


> If this is done, withdrawal of the tentative address changes
> needs to be discussed, as VM migrations can abort for a variety
> of reasons (e.g., S' may crash during the copy).  This PA.5
> step can be skipped for a control protocol only does on-demand
> provisioning of the address mapping information.
>

Interesting thought.  Will follow up once we get the migration "right" (for
some value of right).


> -- Section 3
>
> This appears to be entirely about the control protocol and (IMHO)
> doesn't fit well with the rest of the draft.
>

Will discuss putting this in a separate draft with co-authors.

Thanks again for the detailed comments!
Kireeti.



> Thanks,
> --David
> ----------------------------------------------------
> David L. Black, Distinguished Engineer
> EMC Corporation, 176 South St., Hopkinton, MA  01748
> +1 (508) 293-7953             FAX: +1 (508) 293-7786
> [email protected]        Mobile: +1 (978) 394-7754
> ----------------------------------------------------
>
> _______________________________________________
> nvo3 mailing list
> [email protected]
> https://www.ietf.org/mailman/listinfo/nvo3
>



-- 
Kireeti

_______________________________________________
nvo3 mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/nvo3

Re: [nvo3] Comments on draft-kompella-nvo3-server2nve

Reply via email to