weizhouapache commented on PR #10515:
URL: https://github.com/apache/cloudstack/pull/10515#issuecomment-2709713489
> LGTM, Verified the issue manually by executing the following steps
>
> 1. Create a cloudstack env with 2 hosts and no nfs primary storages.
> 2. On one of the kvm host configure ha and enable HA.
> 3. Add a firewall rule which drops the packets on port 8250
>
> iptables -I OUTPUT -p tcp -m tcp --dport 8250 -j DROP
>
> 4. Check the management server logs
>
> Before fix,
>
> Cloudstack doesn't pick up the HypervInvestigator VMwareInvestigator, ping
investigator.
>
> ```
> 2025-03-06 13:36:30,022 INFO [c.c.a.m.AgentManagerImpl]
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Investigating why host Host
{"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"}
has disconnected with event PingTimeout
> 2025-03-06 13:36:30,023 DEBUG [c.c.a.m.AgentManagerImpl]
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) checking if agent (Host
{"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"})
is alive
> 2025-03-06 13:36:30,025 DEBUG [c.c.a.t.Request]
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460:
Sending { Cmd , MgmtId: 32986892337576, via:
1(ref-trl-8094-k-mol8-kiran-chavala-kvm1), Ver: v1, Flags: 100011,
[{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}]
}
> 2025-03-06 13:37:10,041 DEBUG [c.c.a.m.AgentAttache]
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460:
Waiting some more time because this is the current command
> 2025-03-06 13:37:10,041 DEBUG [c.c.a.m.AgentAttache]
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460:
Waiting some more time because this is the current command
> 2025-03-06 13:37:10,042 WARN [c.c.a.m.AgentAttache]
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460:
Timed out on Seq 1-8864491441548689460: { Cmd , MgmtId: 32986892337576, via:
1(ref-trl-8094-k-mol8-kiran-chavala-kvm1), Ver: v1, Flags: 100011,
[{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}]
}
> 2025-03-06 13:37:10,047 DEBUG [c.c.a.m.AgentAttache]
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460:
Cancelling.
> 2025-03-06 13:37:10,047 WARN [c.c.a.m.AgentManagerImpl]
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Operation timed out: Commands
8864491441548689460 to Host 1 timed out after 100
> 2025-03-06 13:37:10,067 DEBUG [c.c.h.HighAvailabilityManagerImpl]
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) SimpleInvestigator unable to
determine the state of the host. Moving on.
> 2025-03-06 13:37:10,067 DEBUG [c.c.h.HighAvailabilityManagerImpl]
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) XenServerInvestigator unable
to determine the state of the host. Moving on.
> 2025-03-06 13:37:10,083 WARN [c.c.h.KVMInvestigator]
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Agent investigation was
requested on host Host
{"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"},
but host does not support investigation because it has no NFS storage.
Skipping investigation.
> 2025-03-06 13:37:10,083 DEBUG [c.c.h.HighAvailabilityManagerImpl]
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) KVMInvestigator was able to
determine host Host
{"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"}
is in Disconnected
> 2025-03-06 13:37:10,083 INFO [c.c.a.m.AgentManagerImpl]
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) The agent from host Host
{"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"}
state determined is Disconnected
> 2025-03-06 13:37:10,083 WARN [c.c.a.m.AgentManagerImpl]
(AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Agent is disconnected but the
host is still up: Host
{"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"}
state: Enabled
> ```
>
> After fix
>
> Cloudstack picks up the HypervInvestigator VMwareInvestigator, ping
investigator.
>
> ```
> [root@ol8 ~]# cat /var/log/cloudstack/management/management-server.log
|grep -i "logid:b39c7f05"
> 2025-03-06 13:08:59,485 INFO [c.c.a.m.AgentManagerImpl]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Investigating why host Host
{"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"}
has disconnected with event PingTimeout
> 2025-03-06 13:08:59,485 DEBUG [c.c.a.m.AgentManagerImpl]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) checking if agent (Host
{"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"})
is alive
> 2025-03-06 13:08:59,487 DEBUG [c.c.a.t.Request]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501:
Sending { Cmd , MgmtId: 32987949302884, via:
2(ref-trl-8087-k-mol8-kiran-chavala-kvm2), Ver: v1, Flags: 100011,
[{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}]
}
> 2025-03-06 13:09:49,487 DEBUG [c.c.a.m.AgentAttache]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501:
Waiting some more time because this is the current command
> 2025-03-06 13:10:39,487 DEBUG [c.c.a.m.AgentAttache]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501:
Waiting some more time because this is the current command
> 2025-03-06 13:10:39,488 WARN [c.c.a.m.AgentAttache]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501:
Timed out on Seq 2-5748563449361727501: { Cmd , MgmtId: 32987949302884, via:
2(ref-trl-8087-k-mol8-kiran-chavala-kvm2), Ver: v1, Flags: 100011,
[{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}]
}
> 2025-03-06 13:10:39,488 DEBUG [c.c.a.m.AgentAttache]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501:
Cancelling.
> 2025-03-06 13:10:39,489 WARN [c.c.a.m.AgentManagerImpl]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Operation timed out: Commands
5748563449361727501 to Host 2 timed out after 100
> 2025-03-06 13:10:39,491 DEBUG [c.c.h.HighAvailabilityManagerImpl]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) SimpleInvestigator unable to
determine the state of the host. Moving on.
> 2025-03-06 13:10:39,491 DEBUG [c.c.h.HighAvailabilityManagerImpl]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) XenServerInvestigator unable to
determine the state of the host. Moving on.
> 2025-03-06 13:10:39,494 WARN [c.c.h.KVMInvestigator]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Agent investigation was
requested on host Host
{"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"},
but host does not support investigation because it has no NFS storage.
Skipping investigation.
> 2025-03-06 13:10:39,494 DEBUG [c.c.h.HighAvailabilityManagerImpl]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) KVMInvestigator unable to
determine the state of the host. Moving on.
> 2025-03-06 13:10:39,494 DEBUG [c.c.h.HighAvailabilityManagerImpl]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) HypervInvestigator unable to
determine the state of the host. Moving on.
> 2025-03-06 13:10:39,494 DEBUG [c.c.h.HighAvailabilityManagerImpl]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) VMwareInvestigator unable to
determine the state of the host. Moving on.
> 2025-03-06 13:10:39,495 DEBUG [c.c.h.UserVmDomRInvestigator]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) checking if agent (Host
{"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"})
is alive
2025-03-06 13:10:39,496 DEBUG [c.c.h.UserVmDomRInvestigator]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) sending ping from (Host
{"id":1,"name":"ol8.localdomain","type":"Routing","uuid":"c0fd498b-e0ff-433c-a68d-698a982a5f6f"})
to agent's host ip address (10.0.35.136)
> 2025-03-06 13:10:39,497 DEBUG [c.c.a.t.Request]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 1-728457239727181052:
Sending { Cmd , MgmtId: 32987949302884, via: 1(ol8.localdomain), Ver: v1,
Flags: 100011,
[{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"10.0.35.136","wait":"20","bypassHostMaintenance":"false"}}]
}
> 2025-03-06 13:10:39,511 DEBUG [c.c.a.t.Request]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 1-728457239727181052:
Received: { Ans: , MgmtId: 32987949302884, via: 1(ol8.localdomain), Ver: v1,
Flags: 10, { Answer } }
> 2025-03-06 13:10:39,512 DEBUG [c.c.h.AbstractInvestigatorImpl]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) host (10.0.35.136) has been
successfully pinged, returning that host is up
> 2025-03-06 13:10:39,512 DEBUG [c.c.h.UserVmDomRInvestigator]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) ping from (Host
{"id":1,"name":"ol8.localdomain","type":"Routing","uuid":"c0fd498b-e0ff-433c-a68d-698a982a5f6f"})
to agent's host ip address (10.0.35.136) successful, returning that agent is
disconnected
> 2025-03-06 13:10:39,512 DEBUG [c.c.h.HighAvailabilityManagerImpl]
(AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) PingInvestigator was able to
determine host Host
{"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"}
is in Disconnected
> ```
great, thanks @kiranchavala for testing !
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]