phsm commented on issue #8055: URL: https://github.com/apache/cloudstack/issues/8055#issuecomment-1863135079
@weizhouapache An update on this issue: This bug only occurs on VPCs on large platforms "with history": the platforms that have a lot of both live and removed VMs, NICs etc. The mechanics of the process: 1. The VPC restart initiated 2. The first VR is destroyed 3. The new VR is created and remains in "Starting" state for 30+ minutes 4. The hypervisor node fencing mechanism detects that the VM instance doesn't exist in libvirt 5. The node notifies the management server about it 6. The management server abandons the start and puts the VR into Stopped state. With trusty s_logger.debug() I was able to narrow down the VR start execution procedure to a specific method that introduces such a big delay. It is this method: [getRouterHealthChecksConfig](https://github.com/apache/cloudstack/blob/33e2a4dd6635798f98d4726406ed1af4c00a4cc5/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java#L1780) More specifically, its this call inside the method that introduces the delay: [userVmJoinDao.search(scvm, null)](https://github.com/apache/cloudstack/blob/33e2a4dd6635798f98d4726406ed1af4c00a4cc5/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java#L1796C14-L1796C14) It is executed in a loop for each virtual router NIC. Each call to search() is expensive, so the getRouterHealthChecksConfig() takes 10-20 seconds to complete. But there is more: getRouterHealthChecksConfig() is executed for each VPC tier as a part of [createMonitorServiceCommand()](https://github.com/apache/cloudstack/blob/33e2a4dd6635798f98d4726406ed1af4c00a4cc5/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java#L1640) call during VR startup process. The createMonitorServiceCommand() is a part of [finalizeMonitorService()](https://github.com/apache/cloudstack/blob/33e2a4dd6635798f98d4726406ed1af4c00a4cc5/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java#L2366) method. And finalizeMonitorService() method is executed twice here: [1](https://github.com/apache/cloudstack/blob/33e2a4dd6635798f98d4726406ed1af4c00a4cc5/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java#L486), [2](https://github.com/apache/cloudstack/blob/33e2a4dd6635798f98d4726406ed1af4c00a4cc5/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java#L498) So eventually the getRouterHealthChecksConfig() is executed more than a **hundred times** during the VR startup, and each time it adds its 10-20 seconds to the process. **What do I propose** It seems that getRouterHealthChecksConfig() returns the same result per VR object. That means that it can be executed once in the beginning of VR initialisation instead of deep inside of foreach loops. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
