Re: [I] Host HA not working even after configured oob-management [cloudstack]

2024-06-12 Thread via GitHub


rohityadavcloud commented on issue #7543:
URL: https://github.com/apache/cloudstack/issues/7543#issuecomment-2162354341

   There is a known issue with ipmitool version on EL8 and EL9, it's worth 
checking if OOBM is working in the first place.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Host HA not working even after configured oob-management [cloudstack]

2024-05-02 Thread via GitHub


spdinis commented on issue #7543:
URL: https://github.com/apache/cloudstack/issues/7543#issuecomment-2090174568

   So after a lot of testing I think it is an expected behavior due to the 
design.
   
   I have the issue regardless distro, I', using ubuntu 22.04 for example.
   
   Long story short, the HA mechanism won't work at all when the server is 
powered off or IPMI is unknown that will happen when the server has no power at 
all.
   
   The odd thing is that the coordination between HA Host and HA VM, if would 
be better would overcome the problem. I ended up ignoring the HA Host all 
together and keep using HA VM, that works, it has a caveat that takes around 15 
minutes to trigger , but eventually does, after breaching the threshold 
acceptable of having lost NFS heartbeat.
   
   I have no idea where to manipulate that timer, been trying to look at it, 
but I have bigger fish to fry at this point so I'm accepting that in a rare 
circumstance when a host is powered off or looses environmental power it will 
take +/- 15 minutes for the vm to bounce elsewhere.
   
   So the workaround for me is simply disable Host HA in the cluster/zone/host 
and be patient when requires an HA.
   
   There are enough things in place to make the mechanism robust is just the 
coordination between the 2 mechanisms requires some work. But a feature request 
needs to be raised. I don't consider this a bug, rather a design flaw.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Host HA not working even after configured oob-management [cloudstack]

2024-04-30 Thread via GitHub


rohityadavcloud commented on issue #7543:
URL: https://github.com/apache/cloudstack/issues/7543#issuecomment-2084996982

   Isn't this documented, for EL8/9 ipmitool has issues on the distros.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Host HA not working even after configured oob-management [cloudstack]

2024-04-27 Thread via GitHub


spdinis commented on issue #7543:
URL: https://github.com/apache/cloudstack/issues/7543#issuecomment-2081223258

   After hours of testing I came up with several different issues.
   
   So yes the #8918 is definitely a thing and confirmed, I tried both with IPMI 
and redfish, same result, if I for example power the server up and send it to 
BIOS, ACS immediately picks up the power on estate and fences the host and the 
VMs immediately bounce to another host.
   I just not sure if it is a reset or a power off, because seems that, in my 
case the default fencing is power off. Either way PowerEdge don't support reset 
or Power off when the server is off.
   Later I will test this with some HPs DL380s G10 and see how that goes.
   
   This is something that definitely is a bit of a non-sense, if the ACS 
already knows the server is powered off, why try to power it off and not simply 
put it in maintenance to bounce the VMs?
   
   Now the other issue which is kind of related, but ends up defeating the 
purpose is that if there is a power outage in the DC, or accidental cable pool 
or anything like that, you are basically doomed, what happens is that the 
server goes to unknown:

   and then you get this errors:
   2024-04-27 19:48:13,626 DEBUG [o.a.c.u.p.ProcessRunner] 
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Process standard error output 
command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U 
[redacted] -P [redacted] chassis power status]: [Get Auth Capabilities error
   2024-04-27 19:48:13,626 DEBUG 
[o.a.c.o.d.i.IpmitoolOutOfBandManagementDriver] (pool-5-thread-28:ctx-7a0961d6) 
(logid:6bf324cc) The command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 
10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status] failed 
and got the result []. Error: [Get Auth Capabilities error
   
   Seems that the NFS mechanism is for nothing, in the cluster there is 
consensus that the host is dead and then nothing can happen.
   
   The other thing I have noticed is that if you have more than one NFS Primary 
storage, while all of them will have the heartbeat files and the cluster 
members gain consensus, the host doesn't even move to fencing, rather to 
degraded.
   
   Here are the collection of logs from everything I could find in that case:
   
   2024-04-27 19:36:16,119 DEBUG [c.c.h.HighAvailabilityManagerImpl] 
(AgentTaskPool-2:ctx-9b0931e0) (logid:d80ab369) KVMInvestigator was able to 
determine host 38 is in Disconnected
   
   2024-04-27 19:42:41,901 DEBUG [o.a.c.k.h.KVMHostActivityChecker] 
(pool-1-thread-12:null) (logid:82c6107b) Investigating Host 
{"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"}
 via neighbouring Host 
{"id":37,"name":"ukslo-csl-kvm01.slo.gt-t.net","type":"Routing","uuid":"b1be18a1-6097-427c-bb04-3278ae5a1a33"}.
   2024-04-27 19:42:42,138 DEBUG [o.a.c.k.h.KVMHostActivityChecker] 
(pool-1-thread-12:null) (logid:82c6107b) Neighbouring Host 
{"id":37,"name":"ukslo-csl-kvm01.slo.gt-t.net","type":"Routing","uuid":"b1be18a1-6097-427c-bb04-3278ae5a1a33"}
 returned status [Down] for the investigated Host 
{"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"}.
   2024-04-27 19:42:42,138 DEBUG [o.a.c.k.h.KVMHostActivityChecker] 
(pool-1-thread-12:null) (logid:82c6107b) Investigating Host 
{"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"}
 via neighbouring Host 
{"id":39,"name":"ukslo-csl-kvm03.slo.gt-t.net","type":"Routing","uuid":"316b97d6-ffca-44c5-874f-e5967b6035e3"}.
   2024-04-27 19:42:42,488 DEBUG [o.a.c.k.h.KVMHostActivityChecker] 
(pool-1-thread-12:null) (logid:82c6107b) Neighbouring Host 
{"id":39,"name":"ukslo-csl-kvm03.slo.gt-t.net","type":"Routing","uuid":"316b97d6-ffca-44c5-874f-e5967b6035e3"}
 returned status [Down] for the investigated Host 
{"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"}.
   
   2024-04-27 19:48:11,604 DEBUG [o.a.c.u.p.ProcessRunner] 
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Preparing command 
[/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P 
[redacted] chassis power status] to execute.
   2024-04-27 19:48:11,605 DEBUG [o.a.c.u.p.ProcessRunner] 
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Submitting command 
[/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P 
[redacted] chassis power status].
   2024-04-27 19:48:11,605 DEBUG [o.a.c.u.p.ProcessRunner] 
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Waiting for a response from 
command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U 
[redacted] -P [redacted] chassis power status]. Defined timeout: [60].
   2024-04-27 19:48:13,626 DEBUG [o.a.c.u.p.ProcessRunner] 
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Process standard output for 
command 

Re: [I] Host HA not working even after configured oob-management [cloudstack]

2024-04-25 Thread via GitHub


spdinis commented on issue #7543:
URL: https://github.com/apache/cloudstack/issues/7543#issuecomment-2077714357

   Hi,
   
   We are preparing a transition from vmware to kvm and we are struggling to 
get HA to work with the same symptoms.
   
   We are using Cloudstack 4.19.0 we have few test clusters that have nfs mount 
for heartbeat and we will be using shared mount. for the case we tried with 
simple NFS and made no difference.
   
   We have the out of band enabled and when we power off the physical host via 
iDRAC, the host moves to fencing after a while and stays in that status and all 
VMs that were running on it, keep saying running.
   
   Once we declare manually that the host is degraded, VMs jump straight to 
another host.
   
   One of the surviving nodes agent logs shows it detects that the host is down 
as slavkap showed:
   
   2024-04-25 15:34:11,406 WARN  [kvm.resource.KVMHAChecker] 
(pool-636-thread-1:null) (logid:29cd972b) All checks with KVMHAChecker for host 
IP [10.250.9.154] in pools [e355eb3b-58eb-3ce2-890a-6a7b7263d896, 
b7f5ff6f-7233-3d79-aa1a-1c1bc233e1c8] considered it as dead. It may cause a 
shutdown of the host.
   
   So I presume the issue is related to the transition to another state after 
fencing. We will perform some additional tests using redfish for example, or by 
try to force some non power off failure, see if it is an issue with the agent 
detecting that the IPMI is actually off assuming it was a voluntary action.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Host HA not working even after configured oob-management [cloudstack]

2024-02-28 Thread via GitHub


slavkap commented on issue #7543:
URL: https://github.com/apache/cloudstack/issues/7543#issuecomment-1968618029

   Hi @yashi4engg, before CS version 4.19, KVM host HA requires NFS primary 
storage.
   I was able to reproduce your problem if I remove the NFS primary storage 
from my dev. The host HA state never becomes Available.
   
![image](https://github.com/apache/cloudstack/assets/51903378/287041a0-f77f-4bb9-9814-9becb4823586)
   Can you share if you have similar messages in your `agent.log` file
   
   > 2024-02-28 11:35:39,920 DEBUG [kvm.resource.KVMHAChecker] 
(pool-624-thread-1:null) (logid:5815f313) Checking heart beat with KVMHAChecker 
for host IP [10.2.26.1] in pools []
   > 2024-02-28 11:35:39,921 WARN  [kvm.resource.KVMHAChecker] 
(pool-624-thread-1:null) (logid:5815f313) All checks with KVMHAChecker for host 
IP [10.2.26.1] in pools [] considered it as dead. It may cause a shutdown of 
the host.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Host HA not working even after configured oob-management [cloudstack]

2024-02-13 Thread via GitHub


yashi4engg commented on issue #7543:
URL: https://github.com/apache/cloudstack/issues/7543#issuecomment-1942220766

   @DaanHoogland  -we tested it with different storage backend ...like in one 
setup we have primary storage as ceph and secondary storage as NFS ... In 
another setup we have primary storage as OCFS2 and secondary storage as NFS .. 
But it still not working in any of setup. 
   
   As OS we tested with OEL8 and OEL9 both are not working. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org