gusmef opened a new issue, #13510:
URL: https://github.com/apache/cloudstack/issues/13510

   ### problem
   
   We're trying VM HA on cloudstack. We disabled HOST HA (see issue 
[13371](https://github.com/apache/cloudstack/issues/13371)) and then simulated 
a crash on an host with a "fake" kernel panic. The investigators correctly 
found the host dead and put it in alert/disconnected state. 
   ```text
   2026-06-29 10:01:03,855 DEBUG [o.a.c.k.h.KVMHostActivityChecker] 
(AgentTaskPool-3:[ctx-5facfbaf]) (logid:9ccd9518) Checking Host 
{"id":76,"name":"fakehost1.ha","type":"Routing","uuid":"ad118d97-03e2-4555-9d70-595864e87
   1fd"} status...
   2026-06-29 10:01:20,849 INFO  [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-4:[ctx-dd435ced]) (logid:69b8079e) Investigating why host Host 
{"id":76,"name":"fakehost1.ha","type":"Routing","uuid":"ad118d97-03e2-4555
   -9d70-595864e871fd"} has disconnected with event
   2026-06-29 10:01:23,858 DEBUG [o.a.c.k.h.KVMHostActivityChecker] 
(AgentTaskPool-3:[ctx-5facfbaf]) (logid:9ccd9518) Setting Host 
{"id":76,"name":"fakehost1.ha","type":"Routing","uuid":"ad118d97-03e2-4555-9d70-595864e871
   fd"} to "Disconnected" status.
   2026-06-29 10:01:23,860 DEBUG [o.a.c.k.h.KVMHostActivityChecker] 
(AgentTaskPool-3:[ctx-5facfbaf]) (logid:9ccd9518) Investigating Host 
{"id":76,"name":"fakehost1.ha","type":"Routing","uuid":"ad118d97-03e2-4555-9d70-5958
   64e871fd"} via neighboring Host 
{"id":75,"name":"fakehost2.ha","type":"Routing","uuid":"ffea4f8e-0d91-4a22-b600-aea1891be419"}.
   2026-06-29 10:01:23,918 DEBUG [o.a.c.k.h.KVMHostActivityChecker] 
(AgentTaskPool-3:[ctx-5facfbaf]) (logid:9ccd9518) Host 
{"id":76,"name":"fakehost1.ha","type":"Routing","uuid":"ad118d97-03e2-4555-9d70-595864e871fd"}
 is 
   not active according to neighbor Host 
{"id":75,"name":"fakehost2.ha","type":"Routing","uuid":"ffea4f8e-0d91-4a22-b600-aea1891be419"},
 details: Heart is not beating.
   2026-06-29 10:01:23,918 DEBUG [o.a.c.k.h.KVMHostActivityChecker] 
(AgentTaskPool-3:[ctx-5facfbaf]) (logid:9ccd9518) HA: HOST is ineligible legacy 
state Disconnected for host Host {"id":76,"name":"fakehost1.ha","type":"R
   outing","uuid":"ad118d97-03e2-4555-9d70-595864e871fd"}
   2026-06-29 10:01:23,918 INFO  [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-3:[ctx-5facfbaf]) (logid:9ccd9518) The agent from host Host 
{"id":76,"name":"fakehost1.ha","type":"Routing","uuid":"ad118d97-03e2-4555-9d
   70-595864e871fd"} state determined is Disconnected
   2026-06-29 10:01:23,918 WARN  [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-3:[ctx-5facfbaf]) (logid:9ccd9518) Agent is disconnected but the 
host is still up: Host {"id":76,"name":"fakehost1.ha","type":"Routing","
   uuid":"ad118d97-03e2-4555-9d70-595864e871fd"} state: Enabled
   ```
   
   The management servers then tried to start the vms on another host
   
   ```text
   2026-06-29 10:01:54,745 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl$5] 
(Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200]) (logid:28a7d0f4) 
Executing AsyncJob 
{"accountId":1,"cmd":"com.cloud.vm.VmWorkStart","cmdInfo":"rO0ABXNyABhjb2
   
0uY2xvdWQudm0uVm1Xb3JrU3RhcnR9cMGsvxz73gIAC0oABGRjSWRMAAZhdm9pZHN0ADBMY29tL2Nsb3VkL2RlcGxveS9EZXBsb3ltZW50UGxhbm5lciRFeGNsdWRlTGlzdDtMAAljbHVzdGVySWR0ABBMamF2YS9sYW5nL0xvbmc7TAAGaG9zdElkcQB-AAJMAAtqb3VybmFsTmFtZXQAEkxqYXZhL2xhbmcvU3R
   
yaW5nO0wAEXBoeXNpY2FsTmV0d29ya0lkcQB-AAJMAAdwbGFubmVycQB-AANMAAVwb2RJZHEAfgACTAAGcG9vbElkcQB-AAJMAAlyYXdQYXJhbXN0AA9MamF2YS91dGlsL01hcDtMAA1yZXNlcnZhdGlvbklkcQB-AAN4cgATY29tLmNsb3VkLnZtLlZtV29ya5-ZtlbwJWdrAgAESgAJYWNjb3VudElkSgAGdXNl
   
cklkSgAEdm1JZEwAC2hhbmRsZXJOYW1lcQB-AAN4cAAAAAAAAAABAAAAAAAAAAEAAAAAAAAFy3QAGVZpcnR1YWxNYWNoaW5lTWFuYWdlckltcGwAAAAAAAAAAHBwcHBwcHBwc3IAEWphdmEudXRpbC5IYXNoTWFwBQfawcMWYNEDAAJGAApsb2FkRmFjdG9ySQAJdGhyZXNob2xkeHA_QAAAAAAADHcIAAAAEAAAA
   
AF0AAtIYU9wZXJhdGlvbnQAP3JPMEFCWE55QUJGcVlYWmhMbXhoYm1jdVFtOXZiR1ZoYnMwZ2NvRFZuUHJ1QWdBQldnQUZkbUZzZFdWNGNBRXhw","cmdVersion":0,"completeMsid":null,"created":"Mon
 Jun 29 10:01:52 CEST 2026","id":21200,"initMsid":90520741747699,"insta
   
nceId":null,"instanceType":null,"lastPolled":null,"lastUpdated":null,"processStatus":0,"removed":null,"result":null,"resultCode":0,"status":"IN_PROGRESS","userId":1,"uuid":"319b5aa3-a42e-42d6-a538-c9c172529d70"}
   2026-06-29 10:01:54,788 INFO  [c.c.a.m.a.i.FirstFitRoutingAllocator] 
(Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98, 
FirstFitRoutingAllocator]) (logid:28a7d0f4)  Guest VM is requested with 
Custom[UEFI] Boot Typ
   e false
   2026-06-29 10:01:54,831 DEBUG [o.a.c.e.o.NetworkOrchestrator] 
(Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98]) 
(logid:28a7d0f4) Network Network {"id": 347, "name": "DR-FE_11NET", "uuid": 
"48655ada-11ad-4101-abc
   b-1531f6575018", "networkofferingid": 32} is already implemented
   2026-06-29 10:01:54,842 DEBUG [o.a.c.e.o.NetworkOrchestrator] 
(Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98]) 
(logid:28a7d0f4) Changing active number of NICs for Network ID=Network {"id": 
347, "name": "DR-FE_1
   1NET", "uuid": "48655ada-11ad-4101-abcb-1531f6575018", "networkofferingid": 
32} on 1
   2026-06-29 10:01:54,849 DEBUG [o.a.c.e.o.VolumeOrchestrator] 
(Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98]) 
(logid:28a7d0f4) No need to recreate the volume 
[{"name":"ROOT-1483","uuid":"37824f75-50a3-4dea-9062
   -f0ee8cde341c"}] since it already has an assigned pool: 
[0f61b84c-4c67-3856-8e16-eb920a49c200]. Adding disk to the VM.
   .....
   2026-06-29 10:01:55,976 DEBUG [o.a.c.e.o.NetworkOrchestrator] 
(Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98]) 
(logid:28a7d0f4) The nic Nic 
{"broadcastUri":"vlan:\/\/20","deviceId":0,"iPv4Address":null,"id":264
   
6,"instanceId":1483,"reservationId":"22066ce0-4ddd-4d9a-9fa3-c1d7c611ba2d","uuid":"9a4c66dc-e3f9-484f-a1af-2817ebfba998"}
 on NicProfile 
{"broadcastUri":null,"deviceId":0,"iPv4Address":null,"id":2646,"reservationId":"22066ce0-4ddd-4d9
   
a-9fa3-c1d7c611ba2d","uuid":"9a4c66dc-e3f9-484f-a1af-2817ebfba998","vmId":1483} 
was released according to VM instance 
{"id":1483,"instanceName":"i-2-1483-VM","state":"Starting","type":"User","uuid":"1f7afd4f-398b-4789-ae14-73d28903f0
   36"} by guru com.cloud.network.guru.ExternalGuestNetworkGuru@601e3c8a, now 
updating record.
   2026-06-29 10:01:55,977 DEBUG [o.a.c.e.o.NetworkOrchestrator] 
(Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98]) 
(logid:28a7d0f4) Changing active number of NICs for Network ID=Network {"id": 
347, "name": "DR-FE_1
   1NET", "uuid": "48655ada-11ad-4101-abcb-1531f6575018", "networkofferingid": 
32} on -1
   2026-06-29 10:01:56,038 WARN  [c.c.d.DeploymentPlanningManagerImpl] 
(Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98]) 
(logid:28a7d0f4) The last host [Host {"id":76,"name":"fakehost1.ha","type":"Ro
   uting","uuid":"ad118d97-03e2-4555-9d70-595864e871fd"}] of VM [VM instance 
{"id":1483,"instanceName":"i-2-1483-VM","state":"Starting","type":"User","uuid":"1f7afd4f-398b-4789-ae14-73d28903f036"}]
 is in the avoid set. Skipping this and
    trying other available hosts.
   ....
   2026-06-29 10:01:56,189 INFO  [o.a.c.s.a.ZoneWideStoragePoolAllocator] 
(Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98]) 
(logid:28a7d0f4) Using volume allocation algorithm firstfit to reorder pools.
   2026-06-29 10:01:56,221 ERROR [c.c.v.ClusteredVirtualMachineManagerImpl] 
(Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98]) 
(logid:28a7d0f4) Unable to orchestrate start VM instance 
{"id":1483,"instanceName":"i-2-
   
1483-VM","state":"Stopped","type":"User","uuid":"1f7afd4f-398b-4789-ae14-73d28903f036"}
 due to [Unable to create a deployment for VM instance 
{"id":1483,"instanceName":"i-2-1483-VM","state":"Starting","type":"User","uuid":"1f7afd4f-3
   98b-4789-ae14-73d28903f036"} after 3 attempts Last known error: Unable to 
start VM on Host 
{"id":75,"name":"fakehost2.ha","type":"Routing","uuid":"ffea4f8e-0d91-4a22-b600-aea1891be419"}
 due to internal error: QEMU unex
   pectedly closed the monitor (vm='i-2-1483-VM'): 2026-06-29T08:01:55.363939Z 
qemu-kvm: -blockdev 
{"node-name":"libvirt-2-format","read-only":false,"discard":"unmap","cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"lib
   virt-2-storage","backing":null}: Failed to get "write" lock
   2026-06-29 10:01:56,249 WARN  [c.c.h.HighAvailabilityManagerExtImpl] 
(HA-Worker-1:[ctx-9f19d45c, work-17]) (logid:9228a43f) Encountered unhandled 
exception during HA process, reschedule work 
HAWork[17-HA-1483-Stopped-Scheduled] 
com.cloud.utils.exception.CloudRuntimeException: Unable to orchestrate the 
start of VM instance 
{"instanceName":"i-2-1483-VM","uuid":"1f7afd4f-398b-4789-ae14-73d28903f036"}.
           at 
com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMachineManagerImpl.java:6012)
           at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
           at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.base/java.lang.reflect.Method.invoke(Method.java:569)
           at 
com.cloud.vm.VmWorkJobHandlerProxy.handleVmWorkJob(VmWorkJobHandlerProxy.java:102)
           at 
com.cloud.vm.VirtualMachineManagerImpl.handleVmWorkJob(VirtualMachineManagerImpl.java:6133)
           at 
com.cloud.vm.VmWorkJobDispatcher.runJob(VmWorkJobDispatcher.java:99)
           at 
org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:698)
           at 
org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
           at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
           at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
           at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
           at 
org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
           at 
org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:646)
           at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
           at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
           at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
           at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
           at java.base/java.lang.Thread.run(Thread.java:840)
   
   2026-06-29 10:01:56,250 WARN  [c.c.h.HighAvailabilityManagerExtImpl] 
(HA-Worker-1:[ctx-9f19d45c, work-17]) (logid:9228a43f) Rescheduling work 
HAWork[17-HA-1483-Stopped-Scheduled] to try again at 
2026-06-29T10:02:26.432+0200. Finished attempt 1/5 times.
   ```
   The job fails (expectedly i would say) because of the nfs write lock held 
from the dead host.
   But when retrying the vm restart we get
   ```
   2026-06-29 10:01:52,843 INFO  [c.c.h.HighAvailabilityManagerExtImpl] 
(HA-Worker-1:[ctx-9f19d45c, work-17]) (logid:9228a43f) Processing work 
HAWork[17-HA-1483-Stopped-Scheduled]
   2026-06-29 10:01:52,845 INFO  [c.c.h.HighAvailabilityManagerExtImpl] 
(HA-Worker-1:[ctx-9f19d45c, work-17]) (logid:9228a43f) HA on VM instance 
{"id":1483,"instanceName":"i-2-1483-VM","state":"Stopped","type":"User","uuid":"1f7afd4f-398b-4789-ae14-73d28903f036"}
   2026-06-29 10:01:52,860 WARN  [o.a.c.f.j.AsyncJobExecutionContext] 
(HA-Worker-1:[ctx-9f19d45c, work-17, ctx-b9bac272]) (logid:9228a43f) Job is 
executed without a context, setup psudo job for the executing thread
   2026-06-29 10:01:52,871 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] 
(HA-Worker-1:[ctx-9f19d45c, work-17, ctx-b9bac272]) (logid:9228a43f) Sync 
job-21200 execution on object VmWorkJobQueue.1483
   2026-06-29 10:01:56,249 WARN  [c.c.h.HighAvailabilityManagerExtImpl] 
(HA-Worker-1:[ctx-9f19d45c, work-17]) (logid:9228a43f) Encountered unhandled 
exception during HA process, reschedule work 
HAWork[17-HA-1483-Stopped-Scheduled] 
com.cloud.utils.exception.CloudRuntimeException: Unable to orchestrate the 
start of VM instance 
{"instanceName":"i-2-1483-VM","uuid":"1f7afd4f-398b-4789-ae14-73d28903f036"}.
   2026-06-29 10:01:56,250 WARN  [c.c.h.HighAvailabilityManagerExtImpl] 
(HA-Worker-1:[ctx-9f19d45c, work-17]) (logid:9228a43f) Rescheduling work 
HAWork[17-HA-1483-Stopped-Scheduled] to try again at 
2026-06-29T10:02:26.432+0200. Finished attempt 1/5 times.
   2026-06-29 10:02:54,848 INFO  [c.c.h.HighAvailabilityManagerExtImpl] 
(HA-Worker-0:[ctx-fce5735e, work-17]) (logid:f2990b6c) Processing work 
HAWork[17-HA-1483-Stopped-Scheduled]
   2026-06-29 10:02:54,852 INFO  [c.c.h.HighAvailabilityManagerExtImpl] 
(HA-Worker-0:[ctx-fce5735e, work-17]) (logid:f2990b6c) HA on VM instance 
{"id":1483,"instanceName":"i-2-1483-VM","state":"Stopped","type":"User","uuid":"1f7afd4f-398b-4789-ae14-73d28903f036"}
   2026-06-29 10:02:54,852 INFO  [c.c.h.HighAvailabilityManagerExtImpl] 
(HA-Worker-0:[ctx-fce5735e, work-17]) (logid:f2990b6c) VM VM instance 
{"id":1483,"instanceName":"i-2-1483-VM","state":"Stopped","type":"User","uuid":"1f7afd4f-398b-4789-ae14-73d28903f036"}
 has been changed.  Current State = Stopped Previous State = Stopped last 
updated = 20 previous updated = 17
   2026-06-29 10:02:54,852 INFO  [c.c.h.HighAvailabilityManagerExtImpl] 
(HA-Worker-0:[ctx-fce5735e, work-17]) (logid:f2990b6c) Completed work 
HAWork[17-HA-1483-Stopped-Scheduled]. Took 2/5 attempts.
   ```
   The restart process didn't even try a second time because of the mismatch in 
the update counter between the vm and the work (20 vs 17). The final state of 
the vm (without manual intervention) is Stopped.
   
   
   ### versions
   
   Cloudstack version: 4.22.1.0, with management server running on ubuntu22 and 
agent running on oracle linux 9
   Shared nfsv4 (4.2) storage 
   
   ### The steps to reproduce the bug
   
   1. Disable HOST HA
   2. Enable VM HA on some vm
   3. Shutdown forcefully the host with aforementioned vm and check for nfs 
locks
   
   
   ### What to do about it?
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to