nvazquez opened a new issue, #12045:
URL: https://github.com/apache/cloudstack/issues/12045
### problem
Have found a corner case for hosts entering maintenance mode, in which
CloudStack selects other hosts already in maintenance mode as destination hosts
for the VM migrations away from the host, causing VMs to be stopped.
Had the following scenario on 4.20.1 environment with 4xKVM hosts on a
single cluster:
- Host 1: Enabled, VMs running
- Host 2: On maintenance mode
- Host 3: Enabled, VMs running
- Host 4: On maintenance mode
- Triggered maintenance mode on host 1
- Observed the end result was host 1 into maintenance mode, however observed
one of the previously running VMs has been stopped. Was able to manually start
it on host 3 after detecting it.
### Example
VM with id = 9 was running on host 1, got stopped after attempting to
migrate to host 2 (which was in maintenance mode):
Triggered prepare for maintenance mode on host 1 and observed the VM
migration is scheduled:
````
2025-11-11 13:30:03,940 DEBUG [c.c.a.ApiServlet]
(qtp638169719-14:[ctx-b5ffe9bb, ctx-dd7a138a]) (logid:71054204) ===END===
10.0.3.251 -- GET id=16b3b0e8-8aaf-4e71-9ce0-29c252b
15aa5&command=prepareHostForMaintenance&response=json&sessionkey=kFIXNlVUZjUW_e0ucKeG6uV1Wso
2025-11-11 13:30:03,941 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl$5]
(API-Job-Executor-8:[ctx-950aa71a, job-124]) (logid:9199a267) Executing
AsyncJob {"accountId":2,"cmd":"org.apac
he.cloudstack.api.command.admin.host.PrepareForMaintenanceCmd","cmdInfo":"{\"response\":\"json\",\"ctxUserId\":\"2\",\"sessionkey\":\"kFIXNlVUZjUW_e0ucKeG6uV1Wso\",\"httpmethod\
":\"GET\",\"ctxStartEventId\":\"271\",\"id\":\"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5\",\"ctxDetails\":\"{\\\"interface
com.cloud.host.Host\\\":\\\"16b3b0e8-8aaf-4e71-9ce0-29c252b
15aa5\\\"}\",\"ctxAccountId\":\"2\",\"uuid\":\"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5\",\"cmdEventType\":\"MAINT.PREPARE\"}","cmdVersion":0,"completeMsid":null,"created":null,"id"
:124,"initMsid":32989174039467,"instanceId":1,"instanceType":"Host","lastPolled":null,"lastUpdated":null,"processStatus":0,"removed":null,"result":null,"resultCode":0,"status":"
IN_PROGRESS","userId":2,"uuid":"9199a267-9780-4407-adf1-e816aa608ae4"}
2025-11-11 13:30:03,952 INFO [c.c.r.ResourceManagerImpl]
(API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267)
Maintenance: attempting maintenance of host
Host
{"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
....
2025-11-11 13:30:04,101 INFO [c.c.r.ResourceManagerImpl]
(API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267)
Maintenance: scheduling migration of VM VM instance
{"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}
from host Host
{"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
2025-11-11 13:30:04,104 WARN [o.a.c.f.j.AsyncJobExecutionContext]
(HA-Worker-4:[ctx-fdd5cf9e, work-18]) (logid:616dae9c) Job is executed without
a context, setup psudo job for the executing thread
2025-11-11 13:30:04,104 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
(HA-Worker-3:[ctx-7427107b, work-17]) (logid:77866f07) Sync job-129 execution
on object VmWorkJobQueue.5
2025-11-11 13:30:04,123 INFO [c.c.h.HighAvailabilityManagerExtImpl]
(API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267)
Scheduled migration work of VM VM instance
{"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}
from host Host
{"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
with HAWork HAWork[19-Migration-9-Running-Scheduled]
2025-11-11 13:30:04,124 DEBUG [c.c.h.HighAvailabilityManagerExtImpl]
(API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267)
Wakeup workers HA
2025-11-11 13:30:04,127 INFO [c.c.h.HighAvailabilityManagerExtImpl]
(HA-Worker-0:[ctx-934bad64, work-19]) (logid:72bb5199) Processing work
HAWork[19-Migration-9-Running-Scheduled]
2025-11-11 13:30:04,129 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
(HA-Worker-4:[ctx-fdd5cf9e, work-18]) (logid:616dae9c) Sync job-130 execution
on object VmWorkJobQueue.8
2025-11-11 13:30:04,129 INFO [c.c.h.HighAvailabilityManagerExtImpl]
(HA-Worker-0:[ctx-934bad64, work-19]) (logid:72bb5199) Migration attempt: for
VM VM instance
{"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}from
host Host
{"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}.
Starting attempt: 1/5 times.
2025-11-11 13:30:04,130 INFO [c.c.h.HighAvailabilityManagerExtImpl]
(HA-Worker-0:[ctx-934bad64, work-19]) (logid:72bb5199) Migration attempt: for
VM b39f1e40-9576-4553-92b3-6e3e84d372ebfrom host id 1. Starting attempt: 1/5
times.
````
Observed CloudStack wrongly selected host 2 as a destination host for the VM
with id = 9 (this host was in maintenance mode) -> That caused the VM to get
stopped
````
2025-11-11 13:30:05,730 DEBUG [c.c.d.DeploymentPlanningManagerImpl]
(Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b])
(logid:aee8f755) Last host [Host {"id":2
,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2","type":"Routing","uuid":"9716940e-7eef-475c-b88c-1f00caa4fe6d"}]
of VM [VM instance {"id":9,"instanceName":"i-2-9-VM","state":
"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}] is
UP and has enough capacity. Checking for suitable pools for this host under
zone [Zone {"id": "1", "na
me": "ref-trl-10084-k-Mr8-nicolas-vazquez", "uuid":
"c6bb7e3f-388d-4318-8403-04d7f70d3862"}], pod [HostPod
{"id":1,"name":"Pod1","uuid":"46052b42-8089-4b24-bd46-ca9fb938964f"}]
and cluster [Cluster {id: "1", name: "p1-c1", uuid:
"2e2b4642-b4a5-46c1-ba26-ab54880b9b53"}].
2025-11-11 13:30:05,732 DEBUG [c.c.d.DeploymentPlanningManagerImpl]
(Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b])
(logid:aee8f755) Checking suitable pools
for volume [Volume
{"id":10,"instanceId":9,"name":"ROOT-9","uuid":"95ea3693-b4a4-49c4-9b21-88084cc295f2","volumeType":"ROOT"},
ROOT] of VM [VM instance {"id":9,"instanceName":"
i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}].
2025-11-11 13:30:05,733 DEBUG [c.c.d.DeploymentPlanningManagerImpl]
(Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b])
(logid:aee8f755) Volume [Volume {"id":10
,"instanceId":9,"name":"ROOT-9","uuid":"95ea3693-b4a4-49c4-9b21-88084cc295f2","volumeType":"ROOT"}]
of VM [VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type"
:"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}] has pool
[StoragePool
{"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm-pri1","poolType":"NetworkFilesystem","uuid"
:"162a7420-8b58-3701-bf2e-e27b2bf328ed"}] already specified. Checking if
this pool can be reused.
2025-11-11 13:30:05,734 DEBUG [c.c.n.NetworkModelImpl]
(Work-Job-Executor-11:[ctx-237107a3, job-119/job-130, ctx-406ba4b3])
(logid:506b69a7) Service SecurityGroup is not support
ed in the network Network {"id": 204, "name": "Net1", "uuid":
"52cde633-2f3b-4305-825d-0dc1abca1493", "networkofferingid": 10}
2025-11-11 13:30:05,734 DEBUG [c.c.d.DeploymentPlanningManagerImpl]
(Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b])
(logid:aee8f755) Pool [StoragePool {"id"
:1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm-pri1","poolType":"NetworkFilesystem","uuid":"162a7420-8b58-3701-bf2e-e27b2bf328ed"}]
of volume [Volume {"id":10,"instanceId":9
,"name":"ROOT-9","uuid":"95ea3693-b4a4-49c4-9b21-88084cc295f2","volumeType":"ROOT"}]
used by VM [VM instance
{"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","u
uid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}] fits the specified plan. No
need to reallocate a pool for this volume.
2025-11-11 13:30:05,735 DEBUG [c.c.d.DeploymentPlanningManagerImpl]
(Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b])
(logid:aee8f755) Trying to find a poteni
al host and associated storage pools from the suitable host/pool lists for
this VM
....
2025-11-11 13:30:05,793 DEBUG [c.c.c.CapacityManagerImpl]
(Work-Job-Executor-10:[ctx-5279597a, job-128/job-129, ctx-97af9533])
(logid:12e4154a) Host has enough CPU and RAM avail
able
2025-11-11 13:30:05,793 DEBUG [c.c.c.CapacityManagerImpl]
(Work-Job-Executor-10:[ctx-5279597a, job-128/job-129, ctx-97af9533])
(logid:12e4154a) STATS: Can alloc CPU from host: H
ost
{"id":3,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm3","type":"Routing","uuid":"95e251a6-5ffb-436d-b287-df2a77b80165"},
used: 2500, reserved: 0, actual total: 6000, total
with overprovisioning: 12000; requested cpu: 500, alloc_from_last_host?:
false, considerReservedCapacity?: true
2025-11-11 13:30:05,793 DEBUG [c.c.c.CapacityManagerImpl]
(Work-Job-Executor-10:[ctx-5279597a, job-128/job-129, ctx-97af9533])
(logid:12e4154a) STATS: Can alloc MEM from host: H
ost
{"id":3,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm3","type":"Routing","uuid":"95e251a6-5ffb-436d-b287-df2a77b80165"},
used: (2.50 GB) 2684354560, reserved: (0 bytes) 0,
total: (6.59 GB) 7072067584; requested mem: (512.00 MB) 536870912,
alloc_from_last_host?: false, considerReservedCapacity?: true
2025-11-11 13:30:05,796 WARN [c.c.v.ClusteredVirtualMachineManagerImpl]
(Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b])
(logid:aee8f755) Unable to migrate
VM instance
{"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}
to Host {"id":2,"name":"ref-trl-10084-k-Mr8-nicolas-
vazquez-kvm2","type":"Routing","uuid":"9716940e-7eef-475c-b88c-1f00caa4fe6d"}
due to [Resource [Host:2] is unreachable: Host 2: Unable to send class
com.cloud.agent.api.PrepareF
orMigrationCommand because agent ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2 is
in maintenance mode] com.cloud.exception.AgentUnavailableException: Resource
[Host:2] is unreachable
: Host 2: Unable to send class
com.cloud.agent.api.PrepareForMigrationCommand because agent
ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2 is in maintenance mode
at
com.cloud.agent.manager.AgentAttache.checkAvailability(AgentAttache.java:190)
at
com.cloud.agent.manager.ClusteredAgentAttache.checkAvailability(ClusteredAgentAttache.java:80)
at com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:358)
at
com.cloud.agent.manager.ClusteredAgentAttache.send(ClusteredAgentAttache.java:131)
at com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:404)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:570)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:414)
at
com.cloud.vm.VirtualMachineManagerImpl.migrate(VirtualMachineManagerImpl.java:2798)
at
com.cloud.vm.VirtualMachineManagerImpl.orchestrateMigrateAway(VirtualMachineManagerImpl.java:3497)
at
com.cloud.vm.VirtualMachineManagerImpl.orchestrateMigrateAway(VirtualMachineManagerImpl.java:5522)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at
com.cloud.vm.VmWorkJobHandlerProxy.handleVmWorkJob(VmWorkJobHandlerProxy.java:102)
at
com.cloud.vm.VirtualMachineManagerImpl.handleVmWorkJob(VirtualMachineManagerImpl.java:5610)
at
com.cloud.vm.VmWorkJobDispatcher.runJob(VmWorkJobDispatcher.java:99)
at
org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:652)
at
org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
at
org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
at
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
at
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
at
org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
at
org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:600)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
......
2025-11-11 13:30:05,809 DEBUG [c.c.c.CapacityManagerImpl]
(Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b])
(logid:aee8f755) VM instance {"id":9,"instanceName
":"i-2-9-VM","state":"Stopping","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}
state transited from [Running] to [Stopping] with event [StopRequested]. VM's
origin
al host: Host
{"id":2,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2","type":"Routing","uuid":"9716940e-7eef-475c-b88c-1f00caa4fe6d"},
new host: Host {"id":1,"name":"ref-trl-1
0084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"},
host before state transition: Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-
vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
2025-11-11 13:30:05,810 DEBUG [c.c.v.UserVmManagerImpl]
(Work-Job-Executor-12:[c
````
### versions
CloudStack version 4.20.1
Tested with single cluster with 4 KVM hosts + NFS cluster-wide storage
### The steps to reproduce the bug
Described above
### What to do about it?
The migrate away logic must not consider hosts in maintenance (or other
invalid state) as destination hosts
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]