tatay188 opened a new issue, #12216:
URL: https://github.com/apache/cloudstack/issues/12216

   ### problem
   
   Unable to create VMs getting 504 error
   
   The server is recognized a GPU enabled.
   Using a regular template with UBUNTU 22.04
   
   Using a Service created, the service is as all our server but has a GPU. HA, 
and i checked the Video just to test.
   on the agent logs shows:
   2025-12-09 20:33:14,351 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(AgentRequest-Handler-5:[]) (logid:3e3d3f80) Trying to fetch storage pool 
e76f8956-1a81-3e97-aff6-8dc3f199a48a from libvirt
   2025-12-09 20:34:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
   2025-12-09 20:35:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
   2025-12-09 20:36:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
   2025-12-09 20:37:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
   2025-12-09 20:38:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
   2025-12-09 20:39:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
   2025-12-09 20:40:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
   2025-12-09 20:41:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
   2025-12-09 20:42:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
   2025-12-09 20:43:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
   2025-12-09 20:43:05,155 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(AgentRequest-Handler-3:[]) (logid:7f598e7f) Trying to fetch storage pool 
e76f8956-1a81-3e97-aff6-8dc3f199a48a from libvirt
   
   
   On the Server, there is no errors or disconnections.
   the storage ID  for this VM shows 
   ID
   30aea531-8b82-478f-85db-e9991bf193f5
   
   I am able to reach the primary storage from the GPU Host.
   
   Except for this error and after 45 minutes the System keeps spinning on 
creating the VM "Launch Instance in progress"
   
   <img width="1182" height="163" alt="Image" 
src="https://github.com/user-attachments/assets/8d2a9c9d-ca5c-47a0-9789-1b6353045639";
 />
   
   
   Logs from the management server:
   
   2025-12-09 20:46:49,858 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-cc70e228]) (logid:c4510d7b) No inactive management 
server node found
   
   2025-12-09 20:46:49,858 DEBUG [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-cc70e228]) (logid:c4510d7b) Peer scan is finished. 
profiler: Done. Duration: 4ms , profilerQueryActiveList: Done. Duration: 1ms, , 
profilerSyncClusterInfo: Done. Duration: 1ms, profilerInvalidatedNodeList: 
Done. Duration: 0ms, profilerRemovedList: Done. Duration: 0ms,, 
profilerNewList: Done. Duration: 0ms, profilerInactiveList: Done. Duration: 1ms
   
   2025-12-09 20:46:51,322 DEBUG [o.a.c.h.H.HAManagerBgPollTask] 
(BackgroundTaskPollManager-4:[ctx-1aad6e7c]) (logid:829826e7) HA health check 
task is running...
   
   2025-12-09 20:46:51,358 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-c1ff2c4b]) (logid:07a5943d) No inactive management 
server node found
   
   2025-12-09 20:46:51,358 DEBUG [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-c1ff2c4b]) (logid:07a5943d) Peer scan is finished. 
profiler: Done. Duration: 4ms , profilerQueryActiveList: Done. Duration: 1ms, , 
profilerSyncClusterInfo: Done. Duration: 1ms, profilerInvalidatedNodeList: 
Done. Duration: 0ms, profilerRemovedList: Done. Duration: 0ms,, 
profilerNewList: Done. Duration: 0ms, profilerInactiveList: Done. Duration: 1ms
   
   2025-12-09 20:46:51,678 INFO  [c.c.a.m.A.MonitorTask] 
(AgentMonitor-1:[ctx-fe31f2a4]) (logid:825f839a) Found the following agents 
behind on ping: [75]
   
   2025-12-09 20:46:51,683 DEBUG [c.c.a.m.A.MonitorTask] 
(AgentMonitor-1:[ctx-fe31f2a4]) (logid:825f839a) Ping timeout for agent Host 
{"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"},
 do investigation
   
   2025-12-09 20:46:51,685 INFO  [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Investigating why host Host 
{"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"}
 has disconnected with event
   
   2025-12-09 20:46:51,687 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Checking if agent (Host 
{"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"})
 is alive
   
   2025-12-09 20:46:51,689 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Wait time setting on 
com.cloud.agent.api.CheckHealthCommand is 50 seconds
   
   2025-12-09 20:46:51,690 DEBUG [c.c.a.m.ClusteredAgentAttache] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Seq 75-1207246175112003675: 
Routed from 250977680725600
   
   2025-12-09 20:46:51,690 DEBUG [c.c.a.t.Request] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Seq 75-1207246175112003675: 
Sending  { Cmd , MgmtId: 250977680725600, via: 75(ggpu), Ver: v1, Flags: 
100011, 
   
[{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}]
 }
   
   2025-12-09 20:46:51,733 DEBUG [c.c.a.t.Request] (AgentManager-Handler-11:[]) 
(logid:) Seq 75-1207246175112003675: Processing:  { Ans: , MgmtId: 
250977680725600, via: 75, Ver: v1, Flags: 10, 
[{"com.cloud.agent.api.CheckHealthAnswer":{"result":"true","details":"resource 
is alive","wait":"0","bypassHostMaintenance":"false"}}] }
   
   2025-12-09 20:46:51,734 DEBUG [c.c.a.t.Request] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Seq 75-1207246175112003675: 
Received:  { Ans: , MgmtId: 250977680725600, via: 75(ggpu), Ver: v1, Flags: 10, 
{ CheckHealthAnswer } }
   
   2025-12-09 20:46:51,734 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Details from executing class 
com.cloud.agent.api.CheckHealthCommand: resource is alive
   
   2025-12-09 20:46:51,734 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Agent (Host 
{"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"})
 responded to checkHealthCommand, reporting that agent is Up
   
   2025-12-09 20:46:51,734 INFO  [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) The agent from host Host 
{"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"}
 state determined is Up
   
   2025-12-09 20:46:51,734 INFO  [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Agent is determined to be up 
and running
   
   2025-12-09 20:46:51,734 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) [Resource state = Enabled, 
Agent event = , Host = Ping]
   
   2025-12-09 20:46:52,121 DEBUG [c.c.a.ApiServlet] 
(qtp1438988851-251223:[ctx-4df48857]) (logid:9ab16f87) ===START===  
SOMEIPADDRESS  -- GET  
jobId=22d89170-20e6-4151-a809-552938d734e9&command=queryAsyncJobResult&response=json&
   
   2025-12-09 20:46:52,121 DEBUG [c.c.a.ApiServlet] 
(qtp1438988851-251223:[ctx-4df48857]) (logid:9ab16f87) Two factor 
authentication is already verified for the user 2, so skipping
   
   2025-12-09 20:46:52,134 DEBUG [c.c.a.ApiServer] 
(qtp1438988851-251223:[ctx-4df48857, ctx-caa819c6]) (logid:9ab16f87) CIDRs from 
which account 'Account 
[{"accountName":"admin","id":2,"uuid":"45a1be9e-2c67-11f0-a2e6-9ee6a2dce283"}]' 
is allowed to perform API calls:
   I noticed the Isolated network the Virtual router is on another server, I do 
not have any server tags at the moment.
   
   
   final error:
   Error
   Unable to orchestrate the start of VM instance 
{"instanceName":"i-2-223-VM","uuid":"a12748a3-7519-4732-8445-05dfa96046b7"}.
   
   
   ### versions
   
   The versions of ACS, hypervisors, storage, network etc..
   ACS 4.22.0
   KVM for the GPU and other hosts
   CEPH RDB primary storage
   NFS secondary storage
   VXLAN running same as the other servers.
   Ubuntu 22.04 as a Hypervisor
   Ubuntu 22.04 as template - Same template used for other VMs.
   
   The GPU is recognized by the system with no problems.
   
   ### The steps to reproduce the bug
   
   1. Using a GPU service offering with HA and GPU Display true - we do have 
disabled the OOB management.
   2. Add a simple VM isolated network, using a GPU Offering 1GPU.
   3. Everything starts ok, VR is created on a regular CPU server 
-automatically, Storage is created, Ip addresses allocated
   4. Instance creation fails after 35+ minutes.
   
   Please Guide us on the proper setting.
   
   Thank you
   
   
   
   
   ### What to do about it?
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to