andrijapanicsb commented on issue #3721: network: cleanup dhcp/dns entries 
while remove a nic from vm
URL: https://github.com/apache/cloudstack/pull/3721#issuecomment-569343955
 
 
   Alright, here we go (apologies for long one... ... ...):
   
   Working fine and in general LGTM, but...
   
   The "garbage" removed fine when detaching VM from network while:
   - VM running, VR running
   - VM stopped, VR running
   
   But some edge cases are theer and not sure if this is possible to address at 
all (same/worse happens when VM is being expunged while VR is stopped - read at 
the end of this comment)
   
   **Issues when VM stopped, VR stopped, VM detached from the network. VR 
started - garbage is left inside the VR.**
   Reproduce the issue case:
   - Have a VM attached to an additional (shared) network. 
   - Start VM - DHCP/DNS stuff provisioned fine.
   - Stop VM (nothing deleted from VR - fine!), stop the VR.
   - Detach VM from that additional network while it's VR is stopped - VR can't 
be contacted so "garbage" can't be removed.
   
   Now do a) or b):
   - a) Start VR - all the "garbage" is there (recreate.systemvm.enabled=false 
- default behaviour)
   - a) Start the VM, attach the VM again to the same network - VM's record 
added/updated in the /etc/dhcphosts.txt, while the record from /etc/hosts is 
deleted! (new not added) and also old/garbage lease deleted from  
/var/lib/misc/dnsmasq.leases
   - b) Start VR - all the "garbage" is there (recreate.systemvm.enabled=false)
   - b) Attach the VM again to the same network, Start the VM - duplicate VM 
rows (different IP/MAC) in all 3 files (new records provisioned, old one not 
removed). DNS resolution is broken due to duplicate records in /etc/hosts.
   - b) detach the Network from the VM again, just the new records are cleaned 
up.
   
   I'm not sure how/if this can be fixed.
   The proper workaround is to restart the network with the cleanup.
   
   **Additionally, if VM is expunged while the VR is stopped, later starting 
the VR (recreate.systemvm.enabled=false) will result in garbage being left and 
never removed.**
   
   I believe both issues (detaching NIC or expunging a VM while VR is stopped) 
are somewhat edge cases that - but these can happen in i.e. following scenario:
   - Single VM in the network:
   - Delete the VM, withOUT expunging it (the default behaviour for a regular 
user).
   - Having the default values of 1day for expunge.delay and expunge.interval, 
it will happen that the VR will be stopped after 
(network.gc.wait/network.gc.interval) - 600-1200 seconds, so when later the VM 
is being expunged in 86400+ seconds, the VR is already down. This leaves 
garbage in the VR if it's started again after i.e. 2 days from the VM deletion 
time. This can happen during some test scenarios / other very small 
environments.
   
   @weizhouapache @rhtyd @PaulAngus @wido @nvazquez @DaanHoogland @onitake 
@GabrielBrascher @nathanejohnson @kiwiflyer (pinging you on the below **only** 
- no need to read above unless you are interested in that specifically):
   I'm wondering if it would make sense to make the 
"recreate.systemvm.enabled=**TRUE**" a default value in 4.14 and onwards, since:
   - doesn't leave all the garbage VR's old ROOT disk had since new ROOT disk 
created
     - no orphaned DHCP/DNS configuration data (issues explained in this 
comment)
     - no potential log garbage (not that much of the issue afaik)
     - /var/cache/cloud/processed files can be huge on old VRs
   - doesn't bring any usable (day to day) improvements for cloud 
operator/normal user
     - CPVM and SSVM are not restarted daily and cloud-operator can wait for 2 
minutes more to configure CPVM/SSVM from scratch vs. booting an 
existing/configured VM
     - VR's are also not restarted daily and since VR can be stopped either 
manually (you planned some actions, so you can wait for 2 more minutes?) or 
automatically by network.gc when there are no Running VMs in the network - 
again one can wait for 2 more minutes for a clean VR for the very first 
existing VM he starts in that existing network.
   
   Having "recreate.systemvm.enabled=**FALSE**" by default (current behaviour):
   - keeps all the garbage as explained above
   
   Opinions?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to