Re: [I] Host HA not working even after configured oob-management [cloudstack]
rohityadavcloud commented on issue #7543: URL: https://github.com/apache/cloudstack/issues/7543#issuecomment-2162354341 There is a known issue with ipmitool version on EL8 and EL9, it's worth checking if OOBM is working in the first place. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Host HA not working even after configured oob-management [cloudstack]
spdinis commented on issue #7543: URL: https://github.com/apache/cloudstack/issues/7543#issuecomment-2090174568 So after a lot of testing I think it is an expected behavior due to the design. I have the issue regardless distro, I', using ubuntu 22.04 for example. Long story short, the HA mechanism won't work at all when the server is powered off or IPMI is unknown that will happen when the server has no power at all. The odd thing is that the coordination between HA Host and HA VM, if would be better would overcome the problem. I ended up ignoring the HA Host all together and keep using HA VM, that works, it has a caveat that takes around 15 minutes to trigger , but eventually does, after breaching the threshold acceptable of having lost NFS heartbeat. I have no idea where to manipulate that timer, been trying to look at it, but I have bigger fish to fry at this point so I'm accepting that in a rare circumstance when a host is powered off or looses environmental power it will take +/- 15 minutes for the vm to bounce elsewhere. So the workaround for me is simply disable Host HA in the cluster/zone/host and be patient when requires an HA. There are enough things in place to make the mechanism robust is just the coordination between the 2 mechanisms requires some work. But a feature request needs to be raised. I don't consider this a bug, rather a design flaw. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Host HA not working even after configured oob-management [cloudstack]
rohityadavcloud commented on issue #7543: URL: https://github.com/apache/cloudstack/issues/7543#issuecomment-2084996982 Isn't this documented, for EL8/9 ipmitool has issues on the distros. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Host HA not working even after configured oob-management [cloudstack]
spdinis commented on issue #7543: URL: https://github.com/apache/cloudstack/issues/7543#issuecomment-2081223258 After hours of testing I came up with several different issues. So yes the #8918 is definitely a thing and confirmed, I tried both with IPMI and redfish, same result, if I for example power the server up and send it to BIOS, ACS immediately picks up the power on estate and fences the host and the VMs immediately bounce to another host. I just not sure if it is a reset or a power off, because seems that, in my case the default fencing is power off. Either way PowerEdge don't support reset or Power off when the server is off. Later I will test this with some HPs DL380s G10 and see how that goes. This is something that definitely is a bit of a non-sense, if the ACS already knows the server is powered off, why try to power it off and not simply put it in maintenance to bounce the VMs? Now the other issue which is kind of related, but ends up defeating the purpose is that if there is a power outage in the DC, or accidental cable pool or anything like that, you are basically doomed, what happens is that the server goes to unknown: and then you get this errors: 2024-04-27 19:48:13,626 DEBUG [o.a.c.u.p.ProcessRunner] (pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Process standard error output command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status]: [Get Auth Capabilities error 2024-04-27 19:48:13,626 DEBUG [o.a.c.o.d.i.IpmitoolOutOfBandManagementDriver] (pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) The command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status] failed and got the result []. Error: [Get Auth Capabilities error Seems that the NFS mechanism is for nothing, in the cluster there is consensus that the host is dead and then nothing can happen. The other thing I have noticed is that if you have more than one NFS Primary storage, while all of them will have the heartbeat files and the cluster members gain consensus, the host doesn't even move to fencing, rather to degraded. Here are the collection of logs from everything I could find in that case: 2024-04-27 19:36:16,119 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-2:ctx-9b0931e0) (logid:d80ab369) KVMInvestigator was able to determine host 38 is in Disconnected 2024-04-27 19:42:41,901 DEBUG [o.a.c.k.h.KVMHostActivityChecker] (pool-1-thread-12:null) (logid:82c6107b) Investigating Host {"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"} via neighbouring Host {"id":37,"name":"ukslo-csl-kvm01.slo.gt-t.net","type":"Routing","uuid":"b1be18a1-6097-427c-bb04-3278ae5a1a33"}. 2024-04-27 19:42:42,138 DEBUG [o.a.c.k.h.KVMHostActivityChecker] (pool-1-thread-12:null) (logid:82c6107b) Neighbouring Host {"id":37,"name":"ukslo-csl-kvm01.slo.gt-t.net","type":"Routing","uuid":"b1be18a1-6097-427c-bb04-3278ae5a1a33"} returned status [Down] for the investigated Host {"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"}. 2024-04-27 19:42:42,138 DEBUG [o.a.c.k.h.KVMHostActivityChecker] (pool-1-thread-12:null) (logid:82c6107b) Investigating Host {"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"} via neighbouring Host {"id":39,"name":"ukslo-csl-kvm03.slo.gt-t.net","type":"Routing","uuid":"316b97d6-ffca-44c5-874f-e5967b6035e3"}. 2024-04-27 19:42:42,488 DEBUG [o.a.c.k.h.KVMHostActivityChecker] (pool-1-thread-12:null) (logid:82c6107b) Neighbouring Host {"id":39,"name":"ukslo-csl-kvm03.slo.gt-t.net","type":"Routing","uuid":"316b97d6-ffca-44c5-874f-e5967b6035e3"} returned status [Down] for the investigated Host {"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"}. 2024-04-27 19:48:11,604 DEBUG [o.a.c.u.p.ProcessRunner] (pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Preparing command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status] to execute. 2024-04-27 19:48:11,605 DEBUG [o.a.c.u.p.ProcessRunner] (pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Submitting command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status]. 2024-04-27 19:48:11,605 DEBUG [o.a.c.u.p.ProcessRunner] (pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Waiting for a response from command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status]. Defined timeout: [60]. 2024-04-27 19:48:13,626 DEBUG [o.a.c.u.p.ProcessRunner] (pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Process standard output for command
Re: [I] Host HA not working even after configured oob-management [cloudstack]
spdinis commented on issue #7543: URL: https://github.com/apache/cloudstack/issues/7543#issuecomment-2077714357 Hi, We are preparing a transition from vmware to kvm and we are struggling to get HA to work with the same symptoms. We are using Cloudstack 4.19.0 we have few test clusters that have nfs mount for heartbeat and we will be using shared mount. for the case we tried with simple NFS and made no difference. We have the out of band enabled and when we power off the physical host via iDRAC, the host moves to fencing after a while and stays in that status and all VMs that were running on it, keep saying running. Once we declare manually that the host is degraded, VMs jump straight to another host. One of the surviving nodes agent logs shows it detects that the host is down as slavkap showed: 2024-04-25 15:34:11,406 WARN [kvm.resource.KVMHAChecker] (pool-636-thread-1:null) (logid:29cd972b) All checks with KVMHAChecker for host IP [10.250.9.154] in pools [e355eb3b-58eb-3ce2-890a-6a7b7263d896, b7f5ff6f-7233-3d79-aa1a-1c1bc233e1c8] considered it as dead. It may cause a shutdown of the host. So I presume the issue is related to the transition to another state after fencing. We will perform some additional tests using redfish for example, or by try to force some non power off failure, see if it is an issue with the agent detecting that the IPMI is actually off assuming it was a voluntary action. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Host HA not working even after configured oob-management [cloudstack]
slavkap commented on issue #7543: URL: https://github.com/apache/cloudstack/issues/7543#issuecomment-1968618029 Hi @yashi4engg, before CS version 4.19, KVM host HA requires NFS primary storage. I was able to reproduce your problem if I remove the NFS primary storage from my dev. The host HA state never becomes Available. ![image](https://github.com/apache/cloudstack/assets/51903378/287041a0-f77f-4bb9-9814-9becb4823586) Can you share if you have similar messages in your `agent.log` file > 2024-02-28 11:35:39,920 DEBUG [kvm.resource.KVMHAChecker] (pool-624-thread-1:null) (logid:5815f313) Checking heart beat with KVMHAChecker for host IP [10.2.26.1] in pools [] > 2024-02-28 11:35:39,921 WARN [kvm.resource.KVMHAChecker] (pool-624-thread-1:null) (logid:5815f313) All checks with KVMHAChecker for host IP [10.2.26.1] in pools [] considered it as dead. It may cause a shutdown of the host. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Host HA not working even after configured oob-management [cloudstack]
yashi4engg commented on issue #7543: URL: https://github.com/apache/cloudstack/issues/7543#issuecomment-1942220766 @DaanHoogland -we tested it with different storage backend ...like in one setup we have primary storage as ceph and secondary storage as NFS ... In another setup we have primary storage as OCFS2 and secondary storage as NFS .. But it still not working in any of setup. As OS we tested with OEL8 and OEL9 both are not working. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org