weizhouapache commented on PR #10515:
URL: https://github.com/apache/cloudstack/pull/10515#issuecomment-2709713489

   > LGTM, Verified the issue manually by executing the following steps
   > 
   > 1. Create a cloudstack env with  2 hosts and  no nfs primary storages.
   > 2. On one of the kvm host configure ha and enable HA.
   > 3. Add a firewall rule  which drops the packets on port 8250
   > 
   > iptables -I OUTPUT -p tcp -m tcp --dport 8250 -j DROP
   > 
   > 4. Check the management server logs
   > 
   > Before fix,
   > 
   > Cloudstack doesn't pick up the HypervInvestigator VMwareInvestigator, ping 
investigator.
   > 
   > ```
   > 2025-03-06 13:36:30,022 INFO  [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Investigating why host Host 
{"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"}
 has disconnected with event PingTimeout
   > 2025-03-06 13:36:30,023 DEBUG [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) checking if agent (Host 
{"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"})
 is alive
   > 2025-03-06 13:36:30,025 DEBUG [c.c.a.t.Request] 
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460: 
Sending  { Cmd , MgmtId: 32986892337576, via: 
1(ref-trl-8094-k-mol8-kiran-chavala-kvm1), Ver: v1, Flags: 100011, 
[{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}]
 }
   > 2025-03-06 13:37:10,041 DEBUG [c.c.a.m.AgentAttache] 
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460: 
Waiting some more time because this is the current command
   > 2025-03-06 13:37:10,041 DEBUG [c.c.a.m.AgentAttache] 
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460: 
Waiting some more time because this is the current command
   > 2025-03-06 13:37:10,042 WARN  [c.c.a.m.AgentAttache] 
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460: 
Timed out on Seq 1-8864491441548689460:  { Cmd , MgmtId: 32986892337576, via: 
1(ref-trl-8094-k-mol8-kiran-chavala-kvm1), Ver: v1, Flags: 100011, 
[{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}]
 }
   > 2025-03-06 13:37:10,047 DEBUG [c.c.a.m.AgentAttache] 
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460: 
Cancelling.
   > 2025-03-06 13:37:10,047 WARN  [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Operation timed out: Commands 
8864491441548689460 to Host 1 timed out after 100
   > 2025-03-06 13:37:10,067 DEBUG [c.c.h.HighAvailabilityManagerImpl] 
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) SimpleInvestigator unable to 
determine the state of the host.  Moving on.
   > 2025-03-06 13:37:10,067 DEBUG [c.c.h.HighAvailabilityManagerImpl] 
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) XenServerInvestigator unable 
to determine the state of the host.  Moving on.
   > 2025-03-06 13:37:10,083 WARN  [c.c.h.KVMInvestigator] 
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Agent investigation was 
requested on host Host 
{"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"},
 but host does not support investigation because it has no NFS storage. 
Skipping investigation.
   > 2025-03-06 13:37:10,083 DEBUG [c.c.h.HighAvailabilityManagerImpl] 
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) KVMInvestigator was able to 
determine host Host 
{"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"}
 is in Disconnected
   > 2025-03-06 13:37:10,083 INFO  [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) The agent from host Host 
{"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"}
 state determined is Disconnected
   > 2025-03-06 13:37:10,083 WARN  [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Agent is disconnected but the 
host is still up: Host 
{"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"}
 state: Enabled
   > ```
   > 
   > After fix
   > 
   > Cloudstack picks up the HypervInvestigator VMwareInvestigator, ping 
investigator.
   > 
   > ```
   >  [root@ol8 ~]# cat /var/log/cloudstack/management/management-server.log 
|grep -i "logid:b39c7f05"
   > 2025-03-06 13:08:59,485 INFO  [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Investigating why host Host 
{"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"}
 has disconnected with event PingTimeout
   > 2025-03-06 13:08:59,485 DEBUG [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) checking if agent (Host 
{"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"})
 is alive
   > 2025-03-06 13:08:59,487 DEBUG [c.c.a.t.Request] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501: 
Sending  { Cmd , MgmtId: 32987949302884, via: 
2(ref-trl-8087-k-mol8-kiran-chavala-kvm2), Ver: v1, Flags: 100011, 
[{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}]
 }
   > 2025-03-06 13:09:49,487 DEBUG [c.c.a.m.AgentAttache] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501: 
Waiting some more time because this is the current command
   > 2025-03-06 13:10:39,487 DEBUG [c.c.a.m.AgentAttache] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501: 
Waiting some more time because this is the current command
   > 2025-03-06 13:10:39,488 WARN  [c.c.a.m.AgentAttache] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501: 
Timed out on Seq 2-5748563449361727501:  { Cmd , MgmtId: 32987949302884, via: 
2(ref-trl-8087-k-mol8-kiran-chavala-kvm2), Ver: v1, Flags: 100011, 
[{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}]
 }
   > 2025-03-06 13:10:39,488 DEBUG [c.c.a.m.AgentAttache] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501: 
Cancelling.
   > 2025-03-06 13:10:39,489 WARN  [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Operation timed out: Commands 
5748563449361727501 to Host 2 timed out after 100
   > 2025-03-06 13:10:39,491 DEBUG [c.c.h.HighAvailabilityManagerImpl] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) SimpleInvestigator unable to 
determine the state of the host.  Moving on.
   > 2025-03-06 13:10:39,491 DEBUG [c.c.h.HighAvailabilityManagerImpl] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) XenServerInvestigator unable to 
determine the state of the host.  Moving on.
   > 2025-03-06 13:10:39,494 WARN  [c.c.h.KVMInvestigator] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Agent investigation was 
requested on host Host 
{"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"},
 but host does not support investigation because it has no NFS storage. 
Skipping investigation.
   > 2025-03-06 13:10:39,494 DEBUG [c.c.h.HighAvailabilityManagerImpl] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) KVMInvestigator unable to 
determine the state of the host.  Moving on.
   > 2025-03-06 13:10:39,494 DEBUG [c.c.h.HighAvailabilityManagerImpl] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) HypervInvestigator unable to 
determine the state of the host.  Moving on.
   > 2025-03-06 13:10:39,494 DEBUG [c.c.h.HighAvailabilityManagerImpl] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) VMwareInvestigator unable to 
determine the state of the host.  Moving on.
   > 2025-03-06 13:10:39,495 DEBUG [c.c.h.UserVmDomRInvestigator] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) checking if agent (Host 
{"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"})
 is alive
2025-03-06 13:10:39,496 DEBUG [c.c.h.UserVmDomRInvestigator] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) sending ping from (Host 
{"id":1,"name":"ol8.localdomain","type":"Routing","uuid":"c0fd498b-e0ff-433c-a68d-698a982a5f6f"})
 to agent's host ip address (10.0.35.136)
   > 2025-03-06 13:10:39,497 DEBUG [c.c.a.t.Request] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 1-728457239727181052: 
Sending  { Cmd , MgmtId: 32987949302884, via: 1(ol8.localdomain), Ver: v1, 
Flags: 100011, 
[{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"10.0.35.136","wait":"20","bypassHostMaintenance":"false"}}]
 }
   > 2025-03-06 13:10:39,511 DEBUG [c.c.a.t.Request] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 1-728457239727181052: 
Received:  { Ans: , MgmtId: 32987949302884, via: 1(ol8.localdomain), Ver: v1, 
Flags: 10, { Answer } }
   > 2025-03-06 13:10:39,512 DEBUG [c.c.h.AbstractInvestigatorImpl] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) host (10.0.35.136) has been 
successfully pinged, returning that host is up
   > 2025-03-06 13:10:39,512 DEBUG [c.c.h.UserVmDomRInvestigator] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) ping from (Host 
{"id":1,"name":"ol8.localdomain","type":"Routing","uuid":"c0fd498b-e0ff-433c-a68d-698a982a5f6f"})
 to agent's host ip address (10.0.35.136) successful, returning that agent is 
disconnected
   > 2025-03-06 13:10:39,512 DEBUG [c.c.h.HighAvailabilityManagerImpl] 
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) PingInvestigator was able to 
determine host Host 
{"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"}
 is in Disconnected
   > ```
   
   great, thanks @kiranchavala for testing !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to