Just because millions of applications misuse a simplistic protocol in a way it was never designed to handle doesn’t make it a good idea. Not to mention the total lack of security.
@Xiaohu: how would you distinguish a gratuitous ARP send from the hypervisor to indicate a VM move from a gratuitous ARP sent by a VM with a misconfigured IP address or a malicious gratuitous ARP sent by an intruder (physical or virtual)? Unless you can totally control the VM attachment point (= hypervisor switch unless you’re using something like 802.1BR) you cannot trust ARP ... but then if you do control the hypervisor switch, you don’t need ARP. Ivan From: [email protected] [mailto:[email protected]] On Behalf Of Linda Dunbar Sent: Monday, July 23, 2012 5:59 PM To: Xuxiaohu; [email protected]; [email protected] Cc: [email protected] Subject: Re: [nvo3] Comments on draft-kompella-nvo3-server2nve Millions of applications being deployed already use ARP to signal their presence. The widely deployed vMotion makes VMs in new location to send ARP (RARP) to inform the network of their new location. It doesn’t hurt to utilize the available messages from applications. My two cents. Linda Dunbar From: [email protected] [mailto:[email protected]] On Behalf Of Xuxiaohu Sent: Wednesday, July 18, 2012 9:48 PM To: [email protected]; [email protected] Cc: [email protected] Subject: Re: [nvo3] Comments on draft-kompella-nvo3-server2nve Does that mean the ARP could also be considered as an option for signaling the VM attachment/detachment event? For example, a gratuitous ARP packet can be inferred as an attachment event by the NVE which receives such packet via the NVE-TES interface. Meanwhile, for those L2VPN (e.g., VPLS) or L3VPN overlay approaches which only allows one next-hop to be available for a given MAC route or a host route in the forwarding table, a gratuitous ARP packet received from a remote NVE could be inferred as a detachment event by the NVE to which the ARP sending VM was previously attached. Moreover, in case a gratuitous ARP packet triggers the NVE which received that packet via the NVE-TES interface to generate a MAC route or a host route for the ARP sending VM, and the NVE to which that VM was previously attached, upon receiving that route, could also infer that route as a detachment event of that VM. Best regards, Xiaohu <skipped> I have a related consideration based on thinking about this further. The network SHOULD NOT rely on dissociate messages always being sent - a server crash at the wrong point during a VM migration may cause a dissociate to be missed (e.g., the VM made it to S’, but S crashed before sending the dissociate). More importantly, not relying on the dissociate messages (in particular, not having the inter-NVE control protocol rely on them) helps if one wants to mix hypervisors that support the attach/detach protocol with (exiting) ones that don’t. For existing hypervisors, under suitable restrictions and assuming some advance configuration, “associate” can be inferred from a gratuitous ARP or RARP, but nothing is sent for dissociate. The inference of “associate” won’t be possible if things have not been set up to enable the gratuitous ARP or RARP. Thanks, --David From: [email protected] [mailto:[email protected]] On Behalf Of Kireeti Kompella Sent: Saturday, July 14, 2012 8:16 PM To: Black, David Cc: [email protected] Subject: Re: [nvo3] Comments on draft-kompella-nvo3-server2nve Hi David, Thanks for your detailed comments! More inline. On Fri, Jul 13, 2012 at 1:01 PM, <[email protected]> wrote: Authors (Kireeti, Yakov and Thomas), This is a good draft - it looks like a good foundation to focus discussion around what the server-to-NVE (attach/detach) protocol needs to do. I like a lot of the contents - I have a few high level comments and some more detailed feedback. Thanks! (1) This draft starts out dealing with the attach/detach (server-to-NVE) protocol and then includes some material on the control protocol for distributing and managing mapping information on the NVEs. I suggest focusing the draft on the attach/detach protocol, removing control protocol discussion (e.g., Section 3), and minimizing assumptions about the control protocol (see detailed comments for where I think assumptions could be minimized). The result should be more general and more useful. About the control plane: it really concerns me that the control plane discussion has not happened so far (not really). ARP doesn't scale; neither does flooding. The goal here is to signal networking parameters: from server (vswitch) to local NVE to remote NVEs to remote servers. Fine, call the local NVE to remote NVEs part "control plane" -- but that's a critical part of the picture. What I take from your suggestion is to move the lNVE to rNVE part to a different draft; I buy that, especially if there are other mechanisms for doing this that can plug-in to this server2nve signaling, so that one can mix-and-match server2nve signaling and lNVE2rNVE signaling. Does that seem reasonable? (2) Section 2.2.3 on detach is trying to cover at least a couple of use cases, VM live migration, and VM removal (e.g., power-off) that probably want to be separated. The current text really doesn't get the live migration case right, D.4 comes before D.3 for the power-off case, and I think things get more complex when the live migration detach functionality is corrected. More on this below. (3) Section 2.2.4 appears to assume a specific order of events between the two servers involved in VM migration. As those servers are operating concurrently, that's not a robust assumption, and the NVE functionality should be specified to not depend on the order of events. Ordering assumptions weren't intended, so we'll tweak the wording to remove any such implications. --- Detailed comments by section --- A pre-disassociate operation is defined in section 2.2.1 but not used in the rest of the draft. Is it actually needed? Good catch! I'd put that in early, worked out the rest of the details, couldn't figure out a use for it, but forgot to remove it. I'll remove it. -- Section 2.2.2 A.1: Validate the authentication (if present). If not, inform the provisioning system, log the error, and stop processing the associate message. This step should also include an optional authorization check, as network policy may limit which NVEs are allowed to participate in which VNs. Okay. Authorization locally, or from the provisioning system? (Or either?) A.3: If the VID in the associate message is non-zero, look up <VNID, P>. If the result is zero, or equal to VID, all's well. Otherwise, respond to S with an error, and stop processing the associate message. Why is a zero VID lookup result ok for a non-zero VID in the associate message? Just means no mapping yet. With respect to the refcounting suggested below, good place to set it to 1; otherwise increment. Should the NVE copy the VID from the associate message to the <VNID,P> entry before responding? Good point. Will fix. A.5: Communicate with each rNVE device to advertise the VM's addresses, and also to get the addresses of other VMs in the DCVPN. Populate the table with the VM's addresses and addresses learned from each rNVE. This assumes that the control protocol does active propagation of all address info, and assumes that no other addresses for the VN are present in the NVE. Neither of those are good general assumptions, IMHO, and in particular, lazy evaluation is possible (e.g., load address mappings on demand to reduce the amount of invalidation traffic caused by each mapping change). I'm leery of on-demand/cache-based address mappings and lazy evaluation (love it in general, but not for address mappings). However, you're right: there may be cases where this is a valid approach. I'd suggest rephrasing to something like: A.5: Use the overlay control protocol to inform the network of the VM's addresses and the VM's association with this NVE. Something like that. Will work on text. -- Section 2.2.3 D.1: Validate the authentication (if present). If not, inform the provisioning system, log the error, and stop processing the associate message. Like A.1, this should include an optional authorization check, as some <VNID,P> -> VID mappings may be statically configured and hence not permit removal. Okay, will copy wording from there once we've agreed on it. D.2: If the hold time is non-zero, point the VM's addresses in the VNID table to the new location of the VM, if known, or to "discard", and start a timer for the period of the hold time. If hold time is zero, immediately perform step D.4, then go to D.3. This is where the power-off and migration cases start to interact - Hold time would be zero for power-off, non-zero for detach. For migration, this change potentially races with a change to the VM's addresses received via the control protocol, so the VM's address may already point somewhere else if the control protocol did its update before the dissociate (in which case nothing should be done to those addresses). Definitely worth looking at again, especially with respect to your comments about the order for migration. With regard to the race condition, I'll send a separate email on that. D.3: Set the VID for <VNID, P> as unassigned. Respond to S saying that the operation was successful. If there are multiple VMs using the VNID on that port, this "pulls the rug" out from under the others by disabling their forwarding. This <VNID,P> -> VID mapping needs a reference count of some form, and corresponding changes would be needed to A.2 and A.3. Not using a reference count may be ok under the assumption that the NVE does not share ports among VMs (or VSIs/vNICs), but that may not be a good assumption for an external NVE (e.g., in a ToR switch). Good point! I'll go with refcounting. D.4: When the hold timer expires, delete the VM's addresses from the VNID table. Delete any VM-specific network policies associated with any of the VM addresses. If the VNID table is empty after deleting the VM's addresses, optionally delete the table and any network policies for the VNID. Well, that's the right thing to do in the power-off case, but not when the VM has moved and there are other VMs on this NVE (possibly even the same port) that still need to communicate with the moved VM. Also, the power-off case needs to include (at least optionally) informing the control protocol of the withdrawal of the VM's addresses. See separate email. As noted in (2) above, I think it would be clearer if there were separate versions of 2.2.3 for the migration departure and power-down use cases. Perhaps. Let's get the semantics right first, then see if there are common elements or not. -- Section 2.2.4 M.3: S then gets a request to terminate the VM on S. M.4: Finally, S' gets a request to start up the VM on S'. Not exactly ;-). Terminating the VM on S (and destroying its state) before confirming its startup on S' risks losing the VM entirely if something goes wrong on S'. Interesting point. However, if the VM starts on S' without first being stopped on S, then (for some time) both S and S' are running, and I'd think that the results would be unpredictable, especially if the VM is just about to engage in some I/O. However, I'll bow to those who've implemented VM migration and know what they're doing. Perhaps the VM is paused on S, started on S'; if that's successful, the VM is destroyed on S, otherwise the migration is aborted and the VM is continued on S. I'd like to know, as this affects the "tentative address changes" you talk about below, and dealing with migration abort. This level of detail isn't necessary - from the point of view of the network: - Startup on S' generates an associate request to the NVE for S'. - The dissociate request from S to its NVE may occur before or after that S' associate request - The dissociate request from S to its NVE may occur before or after control protocol propagation of the results of the S' associate request to the NVE for S. The server-to-NVE functionality should be specified to operate properly independent of the order of these events. Agreed. Separate email thread to work this out. PA.5: Communicate with each rNVE device to advertise the VM's addresses but as non-preferred destinations(*). Also get the addresses of other VMs in the DCVPN. Populate the table with the VM's addresses and addresses learned from each rNVE. That assumes aggressive push of the new address information by the control protocol directly to the rNVEs - while a control protocol may choose to do that, it's not strictly necessary and the interaction may not be directly between the lNVE and the rNVEs. Generalizing in a fashion similar to A.5, I'd suggest something like: PA.5: The overlay control protocol may be used to inform the network of the forthcoming change to the VM's addresses that will occur when the VM is associated with this NVE. Okay, something like. If this is done, withdrawal of the tentative address changes needs to be discussed, as VM migrations can abort for a variety of reasons (e.g., S' may crash during the copy). This PA.5 step can be skipped for a control protocol only does on-demand provisioning of the address mapping information. Interesting thought. Will follow up once we get the migration "right" (for some value of right). -- Section 3 This appears to be entirely about the control protocol and (IMHO) doesn't fit well with the rest of the draft. Will discuss putting this in a separate draft with co-authors. Thanks again for the detailed comments! Kireeti. Thanks, --David ---------------------------------------------------- David L. Black, Distinguished Engineer EMC Corporation, 176 South St., Hopkinton, MA 01748 +1 (508) 293-7953 FAX: +1 (508) 293-7786 [email protected] Mobile: +1 (978) 394-7754 <tel:%2B1%20%28978%29%20394-7754> ---------------------------------------------------- _______________________________________________ nvo3 mailing list [email protected] https://www.ietf.org/mailman/listinfo/nvo3 -- Kireeti
_______________________________________________ nvo3 mailing list [email protected] https://www.ietf.org/mailman/listinfo/nvo3
