nvazquez opened a new issue, #12045:
URL: https://github.com/apache/cloudstack/issues/12045

   ### problem
   
   Have found a corner case for hosts entering maintenance mode, in which 
CloudStack selects other hosts already in maintenance mode as destination hosts 
for the VM migrations away from the host, causing VMs to be stopped.
   
   Had the following scenario on 4.20.1 environment with 4xKVM hosts on a 
single cluster:
   - Host 1: Enabled, VMs running
   - Host 2: On maintenance mode
   - Host 3: Enabled, VMs running
   - Host 4: On maintenance mode
   
   - Triggered maintenance mode on host 1
   - Observed the end result was host 1 into maintenance mode, however observed 
one of the previously running VMs has been stopped. Was able to manually start 
it on host 3 after detecting it.
   
   ### Example
   VM with id = 9 was running on host 1, got stopped after attempting to 
migrate to host 2 (which was in maintenance mode):
   
   Triggered prepare for maintenance mode on host 1 and observed the VM 
migration is scheduled:
   ````
   2025-11-11 13:30:03,940 DEBUG [c.c.a.ApiServlet] 
(qtp638169719-14:[ctx-b5ffe9bb, ctx-dd7a138a]) (logid:71054204) ===END===  
10.0.3.251 -- GET  id=16b3b0e8-8aaf-4e71-9ce0-29c252b
   
15aa5&command=prepareHostForMaintenance&response=json&sessionkey=kFIXNlVUZjUW_e0ucKeG6uV1Wso
   2025-11-11 13:30:03,941 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl$5] 
(API-Job-Executor-8:[ctx-950aa71a, job-124]) (logid:9199a267) Executing 
AsyncJob {"accountId":2,"cmd":"org.apac
   
he.cloudstack.api.command.admin.host.PrepareForMaintenanceCmd","cmdInfo":"{\"response\":\"json\",\"ctxUserId\":\"2\",\"sessionkey\":\"kFIXNlVUZjUW_e0ucKeG6uV1Wso\",\"httpmethod\
   
":\"GET\",\"ctxStartEventId\":\"271\",\"id\":\"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5\",\"ctxDetails\":\"{\\\"interface
 com.cloud.host.Host\\\":\\\"16b3b0e8-8aaf-4e71-9ce0-29c252b
   
15aa5\\\"}\",\"ctxAccountId\":\"2\",\"uuid\":\"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5\",\"cmdEventType\":\"MAINT.PREPARE\"}","cmdVersion":0,"completeMsid":null,"created":null,"id"
   
:124,"initMsid":32989174039467,"instanceId":1,"instanceType":"Host","lastPolled":null,"lastUpdated":null,"processStatus":0,"removed":null,"result":null,"resultCode":0,"status":"
   IN_PROGRESS","userId":2,"uuid":"9199a267-9780-4407-adf1-e816aa608ae4"}
   2025-11-11 13:30:03,952 INFO  [c.c.r.ResourceManagerImpl] 
(API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267) 
Maintenance: attempting maintenance of host
    Host 
{"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
   ....
   2025-11-11 13:30:04,101 INFO  [c.c.r.ResourceManagerImpl] 
(API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267) 
Maintenance: scheduling migration of VM VM instance 
{"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}
 from host Host 
{"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
   2025-11-11 13:30:04,104 WARN  [o.a.c.f.j.AsyncJobExecutionContext] 
(HA-Worker-4:[ctx-fdd5cf9e, work-18]) (logid:616dae9c) Job is executed without 
a context, setup psudo job for the executing thread
   2025-11-11 13:30:04,104 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] 
(HA-Worker-3:[ctx-7427107b, work-17]) (logid:77866f07) Sync job-129 execution 
on object VmWorkJobQueue.5
   2025-11-11 13:30:04,123 INFO  [c.c.h.HighAvailabilityManagerExtImpl] 
(API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267) 
Scheduled migration work of VM VM instance 
{"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}
 from host Host 
{"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
 with HAWork HAWork[19-Migration-9-Running-Scheduled]
   2025-11-11 13:30:04,124 DEBUG [c.c.h.HighAvailabilityManagerExtImpl] 
(API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267) 
Wakeup workers HA
   2025-11-11 13:30:04,127 INFO  [c.c.h.HighAvailabilityManagerExtImpl] 
(HA-Worker-0:[ctx-934bad64, work-19]) (logid:72bb5199) Processing work 
HAWork[19-Migration-9-Running-Scheduled]
   2025-11-11 13:30:04,129 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] 
(HA-Worker-4:[ctx-fdd5cf9e, work-18]) (logid:616dae9c) Sync job-130 execution 
on object VmWorkJobQueue.8
   2025-11-11 13:30:04,129 INFO  [c.c.h.HighAvailabilityManagerExtImpl] 
(HA-Worker-0:[ctx-934bad64, work-19]) (logid:72bb5199) Migration attempt: for 
VM VM instance 
{"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}from
 host Host 
{"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}.
 Starting attempt: 1/5 times.
   2025-11-11 13:30:04,130 INFO  [c.c.h.HighAvailabilityManagerExtImpl] 
(HA-Worker-0:[ctx-934bad64, work-19]) (logid:72bb5199) Migration attempt: for 
VM b39f1e40-9576-4553-92b3-6e3e84d372ebfrom host id 1. Starting attempt: 1/5 
times.
   ````
   
   Observed CloudStack wrongly selected host 2 as a destination host for the VM 
with id = 9 (this host was in maintenance mode) -> That caused the VM to get 
stopped
   ````
   2025-11-11 13:30:05,730 DEBUG [c.c.d.DeploymentPlanningManagerImpl] 
(Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) 
(logid:aee8f755) Last host [Host {"id":2
   
,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2","type":"Routing","uuid":"9716940e-7eef-475c-b88c-1f00caa4fe6d"}]
 of VM [VM instance {"id":9,"instanceName":"i-2-9-VM","state":
   "Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}] is 
UP and has enough capacity. Checking for suitable pools for this host under 
zone [Zone {"id": "1", "na
   me": "ref-trl-10084-k-Mr8-nicolas-vazquez", "uuid": 
"c6bb7e3f-388d-4318-8403-04d7f70d3862"}], pod [HostPod 
{"id":1,"name":"Pod1","uuid":"46052b42-8089-4b24-bd46-ca9fb938964f"}] 
   and cluster [Cluster {id: "1", name: "p1-c1", uuid: 
"2e2b4642-b4a5-46c1-ba26-ab54880b9b53"}].
   2025-11-11 13:30:05,732 DEBUG [c.c.d.DeploymentPlanningManagerImpl] 
(Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) 
(logid:aee8f755) Checking suitable pools
    for volume [Volume 
{"id":10,"instanceId":9,"name":"ROOT-9","uuid":"95ea3693-b4a4-49c4-9b21-88084cc295f2","volumeType":"ROOT"},
 ROOT] of VM [VM instance {"id":9,"instanceName":"
   
i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}].
   2025-11-11 13:30:05,733 DEBUG [c.c.d.DeploymentPlanningManagerImpl] 
(Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) 
(logid:aee8f755) Volume [Volume {"id":10
   
,"instanceId":9,"name":"ROOT-9","uuid":"95ea3693-b4a4-49c4-9b21-88084cc295f2","volumeType":"ROOT"}]
 of VM [VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type"
   :"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}] has pool 
[StoragePool 
{"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm-pri1","poolType":"NetworkFilesystem","uuid"
   :"162a7420-8b58-3701-bf2e-e27b2bf328ed"}] already specified. Checking if 
this pool can be reused.
   2025-11-11 13:30:05,734 DEBUG [c.c.n.NetworkModelImpl] 
(Work-Job-Executor-11:[ctx-237107a3, job-119/job-130, ctx-406ba4b3]) 
(logid:506b69a7) Service SecurityGroup is not support
   ed in the network Network {"id": 204, "name": "Net1", "uuid": 
"52cde633-2f3b-4305-825d-0dc1abca1493", "networkofferingid": 10}
   2025-11-11 13:30:05,734 DEBUG [c.c.d.DeploymentPlanningManagerImpl] 
(Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) 
(logid:aee8f755) Pool [StoragePool {"id"
   
:1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm-pri1","poolType":"NetworkFilesystem","uuid":"162a7420-8b58-3701-bf2e-e27b2bf328ed"}]
 of volume [Volume {"id":10,"instanceId":9
   
,"name":"ROOT-9","uuid":"95ea3693-b4a4-49c4-9b21-88084cc295f2","volumeType":"ROOT"}]
 used by VM [VM instance 
{"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","u
   uid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}] fits the specified plan. No 
need to reallocate a pool for this volume.
   2025-11-11 13:30:05,735 DEBUG [c.c.d.DeploymentPlanningManagerImpl] 
(Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) 
(logid:aee8f755) Trying to find a poteni
   al host and associated storage pools from the suitable host/pool lists for 
this VM
   ....
   2025-11-11 13:30:05,793 DEBUG [c.c.c.CapacityManagerImpl] 
(Work-Job-Executor-10:[ctx-5279597a, job-128/job-129, ctx-97af9533]) 
(logid:12e4154a) Host has enough CPU and RAM avail
   able
   2025-11-11 13:30:05,793 DEBUG [c.c.c.CapacityManagerImpl] 
(Work-Job-Executor-10:[ctx-5279597a, job-128/job-129, ctx-97af9533]) 
(logid:12e4154a) STATS: Can alloc CPU from host: H
   ost 
{"id":3,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm3","type":"Routing","uuid":"95e251a6-5ffb-436d-b287-df2a77b80165"},
 used: 2500, reserved: 0, actual total: 6000, total
    with overprovisioning: 12000; requested cpu: 500, alloc_from_last_host?: 
false, considerReservedCapacity?: true
   2025-11-11 13:30:05,793 DEBUG [c.c.c.CapacityManagerImpl] 
(Work-Job-Executor-10:[ctx-5279597a, job-128/job-129, ctx-97af9533]) 
(logid:12e4154a) STATS: Can alloc MEM from host: H
   ost 
{"id":3,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm3","type":"Routing","uuid":"95e251a6-5ffb-436d-b287-df2a77b80165"},
 used: (2.50 GB) 2684354560, reserved: (0 bytes) 0,
    total: (6.59 GB) 7072067584; requested mem: (512.00 MB) 536870912, 
alloc_from_last_host?: false, considerReservedCapacity?: true
   2025-11-11 13:30:05,796 WARN  [c.c.v.ClusteredVirtualMachineManagerImpl] 
(Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) 
(logid:aee8f755) Unable to migrate 
   VM instance 
{"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}
 to Host {"id":2,"name":"ref-trl-10084-k-Mr8-nicolas-
   
vazquez-kvm2","type":"Routing","uuid":"9716940e-7eef-475c-b88c-1f00caa4fe6d"} 
due to [Resource [Host:2] is unreachable: Host 2: Unable to send class 
com.cloud.agent.api.PrepareF
   orMigrationCommand because agent ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2 is 
in maintenance mode] com.cloud.exception.AgentUnavailableException: Resource 
[Host:2] is unreachable
   : Host 2: Unable to send class 
com.cloud.agent.api.PrepareForMigrationCommand because agent 
ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2 is in maintenance mode
           at 
com.cloud.agent.manager.AgentAttache.checkAvailability(AgentAttache.java:190)
           at 
com.cloud.agent.manager.ClusteredAgentAttache.checkAvailability(ClusteredAgentAttache.java:80)
           at com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:358)
           at 
com.cloud.agent.manager.ClusteredAgentAttache.send(ClusteredAgentAttache.java:131)
           at com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:404)
           at 
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:570)
           at 
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:414)
           at 
com.cloud.vm.VirtualMachineManagerImpl.migrate(VirtualMachineManagerImpl.java:2798)
           at 
com.cloud.vm.VirtualMachineManagerImpl.orchestrateMigrateAway(VirtualMachineManagerImpl.java:3497)
           at 
com.cloud.vm.VirtualMachineManagerImpl.orchestrateMigrateAway(VirtualMachineManagerImpl.java:5522)
           at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
           at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.base/java.lang.reflect.Method.invoke(Method.java:566)
           at 
com.cloud.vm.VmWorkJobHandlerProxy.handleVmWorkJob(VmWorkJobHandlerProxy.java:102)
           at 
com.cloud.vm.VirtualMachineManagerImpl.handleVmWorkJob(VirtualMachineManagerImpl.java:5610)
           at 
com.cloud.vm.VmWorkJobDispatcher.runJob(VmWorkJobDispatcher.java:99)
           at 
org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:652)
           at 
org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
           at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
           at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
           at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
           at 
org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
           at 
org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:600)
           at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
           at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
           at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
           at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
           at java.base/java.lang.Thread.run(Thread.java:829)
   ......
   2025-11-11 13:30:05,809 DEBUG [c.c.c.CapacityManagerImpl] 
(Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) 
(logid:aee8f755) VM instance {"id":9,"instanceName
   
":"i-2-9-VM","state":"Stopping","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}
 state transited from [Running] to [Stopping] with event [StopRequested]. VM's 
origin
   al host: Host 
{"id":2,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2","type":"Routing","uuid":"9716940e-7eef-475c-b88c-1f00caa4fe6d"},
 new host: Host {"id":1,"name":"ref-trl-1
   
0084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"},
 host before state transition: Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-
   vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
   2025-11-11 13:30:05,810 DEBUG [c.c.v.UserVmManagerImpl] 
(Work-Job-Executor-12:[c
   ````
   
   ### versions
   
   CloudStack version 4.20.1
   Tested with single cluster with 4 KVM hosts + NFS cluster-wide storage
   
   ### The steps to reproduce the bug
   
   Described above
   
   ### What to do about it?
   
   The migrate away logic must not consider hosts in maintenance (or other 
invalid state) as destination hosts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to