spdinis commented on issue #7543:
URL: https://github.com/apache/cloudstack/issues/7543#issuecomment-2081223258
After hours of testing I came up with several different issues.
So yes the #8918 is definitely a thing and confirmed, I tried both with IPMI
and redfish, same result, if I for example power the server up and send it to
BIOS, ACS immediately picks up the power on estate and fences the host and the
VMs immediately bounce to another host.
I just not sure if it is a reset or a power off, because seems that, in my
case the default fencing is power off. Either way PowerEdge don't support reset
or Power off when the server is off.
Later I will test this with some HPs DL380s G10 and see how that goes.
This is something that definitely is a bit of a non-sense, if the ACS
already knows the server is powered off, why try to power it off and not simply
put it in maintenance to bounce the VMs?
Now the other issue which is kind of related, but ends up defeating the
purpose is that if there is a power outage in the DC, or accidental cable pool
or anything like that, you are basically doomed, what happens is that the
server goes to unknown:
and then you get this errors:
2024-04-27 19:48:13,626 DEBUG [o.a.c.u.p.ProcessRunner]
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Process standard error output
command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U
[redacted] -P [redacted] chassis power status]: [Get Auth Capabilities error
2024-04-27 19:48:13,626 DEBUG
[o.a.c.o.d.i.IpmitoolOutOfBandManagementDriver] (pool-5-thread-28:ctx-7a0961d6)
(logid:6bf324cc) The command [/usr/bin/ipmitool -I lanplus -R 1 -v -H
10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status] failed
and got the result []. Error: [Get Auth Capabilities error
Seems that the NFS mechanism is for nothing, in the cluster there is
consensus that the host is dead and then nothing can happen.
The other thing I have noticed is that if you have more than one NFS Primary
storage, while all of them will have the heartbeat files and the cluster
members gain consensus, the host doesn't even move to fencing, rather to
degraded.
Here are the collection of logs from everything I could find in that case:
2024-04-27 19:36:16,119 DEBUG [c.c.h.HighAvailabilityManagerImpl]
(AgentTaskPool-2:ctx-9b0931e0) (logid:d80ab369) KVMInvestigator was able to
determine host 38 is in Disconnected
2024-04-27 19:42:41,901 DEBUG [o.a.c.k.h.KVMHostActivityChecker]
(pool-1-thread-12:null) (logid:82c6107b) Investigating Host
{"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"}
via neighbouring Host
{"id":37,"name":"ukslo-csl-kvm01.slo.gt-t.net","type":"Routing","uuid":"b1be18a1-6097-427c-bb04-3278ae5a1a33"}.
2024-04-27 19:42:42,138 DEBUG [o.a.c.k.h.KVMHostActivityChecker]
(pool-1-thread-12:null) (logid:82c6107b) Neighbouring Host
{"id":37,"name":"ukslo-csl-kvm01.slo.gt-t.net","type":"Routing","uuid":"b1be18a1-6097-427c-bb04-3278ae5a1a33"}
returned status [Down] for the investigated Host
{"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"}.
2024-04-27 19:42:42,138 DEBUG [o.a.c.k.h.KVMHostActivityChecker]
(pool-1-thread-12:null) (logid:82c6107b) Investigating Host
{"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"}
via neighbouring Host
{"id":39,"name":"ukslo-csl-kvm03.slo.gt-t.net","type":"Routing","uuid":"316b97d6-ffca-44c5-874f-e5967b6035e3"}.
2024-04-27 19:42:42,488 DEBUG [o.a.c.k.h.KVMHostActivityChecker]
(pool-1-thread-12:null) (logid:82c6107b) Neighbouring Host
{"id":39,"name":"ukslo-csl-kvm03.slo.gt-t.net","type":"Routing","uuid":"316b97d6-ffca-44c5-874f-e5967b6035e3"}
returned status [Down] for the investigated Host
{"id":38,"name":"ukslo-csl-kvm02.slo.gt-t.net","type":"Routing","uuid":"99c85875-54a2-4ade-bf36-0083711c3c34"}.
2024-04-27 19:48:11,604 DEBUG [o.a.c.u.p.ProcessRunner]
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Preparing command
[/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P
[redacted] chassis power status] to execute.
2024-04-27 19:48:11,605 DEBUG [o.a.c.u.p.ProcessRunner]
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Submitting command
[/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U [redacted] -P
[redacted] chassis power status].
2024-04-27 19:48:11,605 DEBUG [o.a.c.u.p.ProcessRunner]
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Waiting for a response from
command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U
[redacted] -P [redacted] chassis power status]. Defined timeout: [60].
2024-04-27 19:48:13,626 DEBUG [o.a.c.u.p.ProcessRunner]
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Process standard output for
command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U
[redacted] -P [redacted] chassis power status]: [].
2024-04-27 19:48:13,626 DEBUG [o.a.c.u.p.ProcessRunner]
(pool-5-thread-28:ctx-7a0961d6) (logid:6bf324cc) Process standard error output
command [/usr/bin/ipmitool -I lanplus -R 1 -v -H 10.250.33.134 -p 623 -U
[redacted] -P [redacted] chassis power status]: [Get Auth Capabilities error
2024-04-27 19:48:13,626 DEBUG
[o.a.c.o.d.i.IpmitoolOutOfBandManagementDriver] (pool-5-thread-28:ctx-7a0961d6)
(logid:6bf324cc) The command [/usr/bin/ipmitool -I lanplus -R 1 -v -H
10.250.33.134 -p 623 -U [redacted] -P [redacted] chassis power status] failed
and got the result []. Error: [Get Auth Capabilities error
2024-04-27 19:50:23,841 DEBUG [o.a.c.h.HAManagerImpl]
(BackgroundTaskPollManager-2:ctx-148f4c19) (logid:5038a4c2) Transitioned host
HA state from:Degraded to:Suspect due to event:PeriodicRecheckResourceActivity
for the host id:38
2024-04-27 19:50:27,951 DEBUG [o.a.c.h.HAManagerImpl]
(BackgroundTaskPollManager-6:ctx-e859e219) (logid:ab18a786) Transitioned host
HA state from:Suspect to:Checking due to event:PerformActivityCheck for the
host id:38
2024-04-27 19:50:28,194 DEBUG [o.a.c.h.HAManagerImpl] (pool-2-thread-1:null)
(logid:82c6107b) Transitioned host HA state from:Checking to:Suspect due to
event:TooFewActivityCheckSamples for the host id:38
2024-04-27 19:51:00,659 DEBUG [o.a.c.h.HAManagerImpl]
(BackgroundTaskPollManager-2:ctx-5cee651c) (logid:51a40d65) Transitioned host
HA state from:Suspect to:Checking due to event:PerformActivityCheck for the
host id:38
2024-04-27 19:51:00,905 DEBUG [o.a.c.h.HAManagerImpl]
(pool-2-thread-25:null) (logid:7e1ca968) Transitioned host HA state
from:Checking to:Suspect due to event:TooFewActivityCheckSamples for the host
id:38
2024-04-27 19:51:33,394 DEBUG [o.a.c.h.HAManagerImpl]
(BackgroundTaskPollManager-5:ctx-2fce5985) (logid:3fde664f) Transitioned host
HA state from:Suspect to:Checking due to event:PerformActivityCheck for the
host id:38
2024-04-27 19:51:33,634 DEBUG [o.a.c.h.HAManagerImpl] (pool-2-thread-5:null)
(logid:66de58d8) Transitioned host HA state from:Checking to:Suspect due to
event:TooFewActivityCheckSamples for the host id:38
2024-04-27 19:52:06,140 DEBUG [o.a.c.h.HAManagerImpl]
(BackgroundTaskPollManager-6:ctx-dde87f00) (logid:10de929d) Transitioned host
HA state from:Suspect to:Checking due to event:PerformActivityCheck for the
host id:38
2024-04-27 19:52:06,382 DEBUG [o.a.c.h.HAManagerImpl] (pool-2-thread-7:null)
(logid:dd936499) Transitioned host HA state from:Checking to:Suspect due to
event:TooFewActivityCheckSamples for the host id:38
2024-04-27 19:54:17,004 DEBUG [o.a.c.h.HAManagerImpl]
(BackgroundTaskPollManager-2:ctx-173a7199) (logid:bee908f9) Transitioned host
HA state from:Suspect to:Checking due to event:PerformActivityCheck for the
host id:38
2024-04-27 19:54:17,246 DEBUG [o.a.c.h.HAManagerImpl]
(pool-2-thread-19:null) (logid:3df2c5a8) Transitioned host HA state
from:Checking to:Suspect due to event:TooFewActivityCheckSamples for the host
id:38
2024-04-27 19:54:49,699 DEBUG [o.a.c.h.HAManagerImpl]
(BackgroundTaskPollManager-3:ctx-658f80b7) (logid:2ec0cbbe) Transitioned host
HA state from:Suspect to:Checking due to event:PerformActivityCheck for the
host id:38
2024-04-27 19:54:49,949 DEBUG [o.a.c.h.HAManagerImpl]
(pool-2-thread-15:null) (logid:47634402) Transitioned host HA state
from:Checking to:Suspect due to event:TooFewActivityCheckSamples for the host
id:38
2024-04-27 19:55:22,483 DEBUG [o.a.c.h.HAManagerImpl]
(BackgroundTaskPollManager-4:ctx-948895aa) (logid:a1bb5643) Transitioned host
HA state from:Suspect to:Checking due to event:PerformActivityCheck for the
host id:38
2024-04-27 19:55:22,723 DEBUG [o.a.c.h.HAManagerImpl]
(pool-2-thread-21:null) (logid:432c0642) Transitioned host HA state
from:Checking to:Degraded due to event:ActivityCheckFailureUnderThresholdRatio
for the host id:38
2024-04-27 20:00:24,278 DEBUG [o.a.c.h.HAManagerImpl]
(BackgroundTaskPollManager-3:ctx-c4b31d57) (logid:bd7094fb) Transitioned host
HA state from:Degraded to:Suspect due to event:PeriodicRecheckResourceActivity
for the host id:38
You can see that the host loops between Degraded and Suspect and Checking,
but never moves into fencing.
I will still do some more testing around this, now that I got a better grip
on what is happening, an at some point I basically threw all toys out of the
pram and tried so many things that I need to do some more segmented
investigation in some details.
I will play around with the fencing option that exists in the global
settings to fence the host if only 1 witness is lost.
I have a call with Shapeblue Monday I will review these findings with them
as well.
But one thing is for sure, this mechanism needs some improvement, specially
coming from Vmware where HA just works when it has to and deals very well with
host isolation. Currently I don't think we will be having a HA on environmental
power loss, due to the nature of the of the mechanism that relies on
understanding the OOB status and if the OOB chip is unresponsive seems it
doesn't have any other action than power off when decides to fence.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]