timmyzhu opened a new issue, #13497:
URL: https://github.com/apache/cloudstack/issues/13497

   ### problem
   
   Symptom: My hosts are getting stuck in an Alert state with cloudstack 
4.22.1.0. Restarting the agents, rebooting the hosts, and even reinstalling and 
re-adding the hosts does not fix the issue.
   
   Cause: When the management server sends a ReadyCommand to the agent, it 
takes an excessively long time, so the management server tries to reinitialize 
the agent and eventually just kills the connection. The agent is able to 
communicate with the management server perfectly fine, so it is not a network 
issue or SSL issue as the SSL handshake succeeded and logs indicate they are 
able to communicate.
   
   Root cause: The ReadyCommand process was modified in 4.22.1.0 such that it 
could be excessively slow. The change comes from #12970 in the 
detectVddkLibDir() function, which is called even if we do not use any instance 
conversion or VDDK. The function executes a shell command defined in 
VDDK_AUTODETECT_PATH_CMD, which performs a linux find search over the entire 
host OS. This should never be on the critical path or on anything that needs to 
complete quickly. We have large, mounted network filesystems in our hosts, so 
trying to search the entire filesystem will take minutes and lead to the 
timeouts and the corresponding Alert state.
   
   ### versions
   
   Cloudstack version: 4.22.1.0
   Hypervisor: KVM
   Storage: NFS mounted filesystems
   
   ### The steps to reproduce the bug
   
   1. Have a complex host OS filesystem with many directories, some of which 
may be network mounted. Basically any setup where doing a search of the entire 
filesystem from the root directory takes more than a few minutes.
   2. Restart an agent on a host.
   3. Management server will show the Alert state after being in the Connecting 
state for a couple minutes.
   
   ### What to do about it?
   
   Workaround: Till a fix can be implemented, my current workaround is to 
define a dummy vddk directory for each host and provide this directory in the 
agent.properties files under vddk.lib.dir. This avoids the expensive search, 
which allows my hosts to finish the ReadyCommand quickly and enter the Up 
state. Here's an example script that performs the workaround:
   ```shell
   #!/bin/bash
   
   sudo mkdir -p /workaround/vmware-vix-disklib-distrib/lib64
   sudo touch /workaround/vmware-vix-disklib-distrib/lib64/libvixDiskLib.so
   if ! sudo grep -q "vddk.lib.dir" /etc/cloudstack/agent/agent.properties; then
       echo "vddk.lib.dir=/workaround/vmware-vix-disklib-distrib" | sudo tee -a 
/etc/cloudstack/agent/agent.properties
   fi
   ```
   
   Fix: I don't know what the desired long-term fix is, but it should 
definitely not involve recursively searching the entire root filesystem when 
trying to connect a host to the management server. Removing the library 
directory auto-detection may be the easiest fix since users could just specify 
the library path if they choose to enable the optional vddk feature. Another 
possibility is to ensure the optional features are enabled before trying to 
search for libraries. The hostSupportsVddk function executes the 
hostSupportsInstanceConversion() function at the end, but it could be done 
earlier before the expensive detectVddkLibDir() function is called. However, 
changes like this may be hiding the true issue of performing an expensive 
filesystem search in the critical path of connecting hosts. If there's a faster 
way of finding the library, that would be an ideal solution, but that may not 
be possible without knowing where it's installed. Restricting the search to 
well-known library ins
 tallation locations may be one way to reduce the search time. Lastly, it would 
be good if the command had a timeout specified rather than the default timeout, 
which is 1 hour. I saw some other places use the timeout specified in the 
agent.properties file, but that didn't apply to this command. Users may 
struggle to find detailed timeout configurations, so this wouldn't be a great 
fix, but at least it would allow the timeout to be user-controllable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to