Re: [I] Host HA not working even after configured oob-management [cloudstack]

via GitHub Sat, 27 Apr 2024 15:54:07 -0700


spdinis commented on issue #7543:
URL: https://github.com/apache/cloudstack/issues/7543#issuecomment-2081223258


   After hours of testing I came up with several different issues.
   
   So yes the #8918 is definitely a thing and confirmed, I tried both with IPMI 
and redfish, same result, if I for example power the server up and send it to 
BIOS, ACS immediately picks up the power on estate and fences the host and the 
VMs immediately bounce to another host.
   I just not sure if it is a reset or a power off, because seems that, in my 
case the default fencing is power off. Either way PowerEdge don't support reset 
or Power off when the server is off.
   Later I will test this with some HPs DL380s G10 and see how that goes.
   
   This is something that definitely is a bit of a non-sense, if the ACS 
already knows the server is powered off, why try to power it off and not simply 
put it in maintenance to bounce the VMs?
   
   Now the other issue which is kind of related, but ends up defeating the 
purpose is that if there is a power outage in the DC, or accidental cable pool 
or anything like that, you are basically doomed, what happens is that the 
server goes to unknown:
    
   and then you get this errors:
   2024-04-27 19:48:13,626 DEBUG [o.a.c.u.p.ProcessRunner] 
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Process standard error output 
command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U 
[redacted] -P [redacted] chassis power status]: [Get Auth Capabilities error
   2024-04-27 19:48:13,626 DEBUG 
[o.a.c.o.d.i.IpmitoolOutOfBandManagementDriver] (pool-5-thread-28:ctx-7a0961d6) 
(logid:6bf324cc) The command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 
10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status] failed 
and got the result []. Error: [Get Auth Capabilities error
   
   Seems that the NFS mechanism is for nothing, in the cluster there is 
consensus that the host is dead and then nothing can happen.
   
   The other thing I have noticed is that if you have more than one NFS Primary 
storage, while all of them will have the heartbeat files and the cluster 
members gain consensus, the host doesn't even move to fencing, rather to 
degraded.
   
   Here are the collection of logs from everything I could find in that case:
   
   2024-04-27 19:36:16,119 DEBUG [c.c.h.HighAvailabilityManagerImpl] 
(AgentTaskPool-2:ctx-9b0931e0) (logid:d80ab369) KVMInvestigator was able to 
determine host 38 is in Disconnected
   
   2024-04-27 19:42:41,901 DEBUG [o.a.c.k.h.KVMHostActivityChecker] 
(pool-1-thread-12:null) (logid:82c6107b) Investigating Host 
{"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"}
 via neighbouring Host 
{"id":37,"name":"ukslo-csl-kvm01.slo.gt-t.net","type":"Routing","uuid":"b1be18a1-6097-427c-bb04-3278ae5a1a33"}.
   2024-04-27 19:42:42,138 DEBUG [o.a.c.k.h.KVMHostActivityChecker] 
(pool-1-thread-12:null) (logid:82c6107b) Neighbouring Host 
{"id":37,"name":"ukslo-csl-kvm01.slo.gt-t.net","type":"Routing","uuid":"b1be18a1-6097-427c-bb04-3278ae5a1a33"}
 returned status [Down] for the investigated Host 
{"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"}.
   2024-04-27 19:42:42,138 DEBUG [o.a.c.k.h.KVMHostActivityChecker] 
(pool-1-thread-12:null) (logid:82c6107b) Investigating Host 
{"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"}
 via neighbouring Host 
{"id":39,"name":"ukslo-csl-kvm03.slo.gt-t.net","type":"Routing","uuid":"316b97d6-ffca-44c5-874f-e5967b6035e3"}.
   2024-04-27 19:42:42,488 DEBUG [o.a.c.k.h.KVMHostActivityChecker] 
(pool-1-thread-12:null) (logid:82c6107b) Neighbouring Host 
{"id":39,"name":"ukslo-csl-kvm03.slo.gt-t.net","type":"Routing","uuid":"316b97d6-ffca-44c5-874f-e5967b6035e3"}
 returned status [Down] for the investigated Host 
{"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"}.
   
   2024-04-27 19:48:11,604 DEBUG [o.a.c.u.p.ProcessRunner] 
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Preparing command 
[/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P 
[redacted] chassis power status] to execute.
   2024-04-27 19:48:11,605 DEBUG [o.a.c.u.p.ProcessRunner] 
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Submitting command 
[/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P 
[redacted] chassis power status].
   2024-04-27 19:48:11,605 DEBUG [o.a.c.u.p.ProcessRunner] 
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Waiting for a response from 
command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U 
[redacted] -P [redacted] chassis power status]. Defined timeout: [60].
   2024-04-27 19:48:13,626 DEBUG [o.a.c.u.p.ProcessRunner] 
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Process standard output for 
command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U 
[redacted] -P [redacted] chassis power status]: [].
   2024-04-27 19:48:13,626 DEBUG [o.a.c.u.p.ProcessRunner] 
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Process standard error output 
command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U 
[redacted] -P [redacted] chassis power status]: [Get Auth Capabilities error
   2024-04-27 19:48:13,626 DEBUG 
[o.a.c.o.d.i.IpmitoolOutOfBandManagementDriver] (pool-5-thread-28:ctx-7a0961d6) 
(logid:6bf324cc) The command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 
10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status] failed 
and got the result []. Error: [Get Auth Capabilities error
   
   2024-04-27 19:50:23,841 DEBUG [o.a.c.h.HAManagerImpl] 
(BackgroundTaskPollManager-2:ctx-148f4c19) (logid:5038a4c2) Transitioned host 
HA state from:Degraded to:Suspect due to event:PeriodicRecheckResourceActivity 
for the host id:38
   2024-04-27 19:50:27,951 DEBUG [o.a.c.h.HAManagerImpl] 
(BackgroundTaskPollManager-6:ctx-e859e219) (logid:ab18a786) Transitioned host 
HA state from:Suspect to:Checking due to event:PerformActivityCheck for the 
host id:38
   2024-04-27 19:50:28,194 DEBUG [o.a.c.h.HAManagerImpl] (pool-2-thread-1:null) 
(logid:82c6107b) Transitioned host HA state from:Checking to:Suspect due to 
event:TooFewActivityCheckSamples for the host id:38
   2024-04-27 19:51:00,659 DEBUG [o.a.c.h.HAManagerImpl] 
(BackgroundTaskPollManager-2:ctx-5cee651c) (logid:51a40d65) Transitioned host 
HA state from:Suspect to:Checking due to event:PerformActivityCheck for the 
host id:38
   2024-04-27 19:51:00,905 DEBUG [o.a.c.h.HAManagerImpl] 
(pool-2-thread-25:null) (logid:7e1ca968) Transitioned host HA state 
from:Checking to:Suspect due to event:TooFewActivityCheckSamples for the host 
id:38
   2024-04-27 19:51:33,394 DEBUG [o.a.c.h.HAManagerImpl] 
(BackgroundTaskPollManager-5:ctx-2fce5985) (logid:3fde664f) Transitioned host 
HA state from:Suspect to:Checking due to event:PerformActivityCheck for the 
host id:38
   2024-04-27 19:51:33,634 DEBUG [o.a.c.h.HAManagerImpl] (pool-2-thread-5:null) 
(logid:66de58d8) Transitioned host HA state from:Checking to:Suspect due to 
event:TooFewActivityCheckSamples for the host id:38
   2024-04-27 19:52:06,140 DEBUG [o.a.c.h.HAManagerImpl] 
(BackgroundTaskPollManager-6:ctx-dde87f00) (logid:10de929d) Transitioned host 
HA state from:Suspect to:Checking due to event:PerformActivityCheck for the 
host id:38
   2024-04-27 19:52:06,382 DEBUG [o.a.c.h.HAManagerImpl] (pool-2-thread-7:null) 
(logid:dd936499) Transitioned host HA state from:Checking to:Suspect due to 
event:TooFewActivityCheckSamples for the host id:38
   2024-04-27 19:54:17,004 DEBUG [o.a.c.h.HAManagerImpl] 
(BackgroundTaskPollManager-2:ctx-173a7199) (logid:bee908f9) Transitioned host 
HA state from:Suspect to:Checking due to event:PerformActivityCheck for the 
host id:38
   2024-04-27 19:54:17,246 DEBUG [o.a.c.h.HAManagerImpl] 
(pool-2-thread-19:null) (logid:3df2c5a8) Transitioned host HA state 
from:Checking to:Suspect due to event:TooFewActivityCheckSamples for the host 
id:38
   2024-04-27 19:54:49,699 DEBUG [o.a.c.h.HAManagerImpl] 
(BackgroundTaskPollManager-3:ctx-658f80b7) (logid:2ec0cbbe) Transitioned host 
HA state from:Suspect to:Checking due to event:PerformActivityCheck for the 
host id:38
   2024-04-27 19:54:49,949 DEBUG [o.a.c.h.HAManagerImpl] 
(pool-2-thread-15:null) (logid:47634402) Transitioned host HA state 
from:Checking to:Suspect due to event:TooFewActivityCheckSamples for the host 
id:38
   2024-04-27 19:55:22,483 DEBUG [o.a.c.h.HAManagerImpl] 
(BackgroundTaskPollManager-4:ctx-948895aa) (logid:a1bb5643) Transitioned host 
HA state from:Suspect to:Checking due to event:PerformActivityCheck for the 
host id:38
   2024-04-27 19:55:22,723 DEBUG [o.a.c.h.HAManagerImpl] 
(pool-2-thread-21:null) (logid:432c0642) Transitioned host HA state 
from:Checking to:Degraded due to event:ActivityCheckFailureUnderThresholdRatio 
for the host id:38
   2024-04-27 20:00:24,278 DEBUG [o.a.c.h.HAManagerImpl] 
(BackgroundTaskPollManager-3:ctx-c4b31d57) (logid:bd7094fb) Transitioned host 
HA state from:Degraded to:Suspect due to event:PeriodicRecheckResourceActivity 
for the host id:38
   
   You can see that the host loops between Degraded and Suspect and Checking, 
but never moves into fencing.
   
   I will still do some more testing around this, now that I got a better grip 
on what is happening, an at some point I basically threw all toys out of the 
pram and tried so many things that I need to do some more segmented 
investigation in some details. 
   I will play around with the fencing option that exists in the global 
settings to fence the host if only 1 witness is lost.
   
   I have a call with Shapeblue Monday I will review these findings with them 
as well.
   
   But one thing is for sure, this mechanism needs some improvement, specially 
coming from Vmware where HA just works when it has to and deals very well with 
host isolation. Currently I don't think we will be having a HA on environmental 
power loss, due to the nature of the of the mechanism that relies on 
understanding the OOB status and if the OOB chip is unresponsive seems it 
doesn't have any other action than power off when decides to fence.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Host HA not working even after configured oob-management [cloudstack]

Reply via email to