tuanhoangth1603 opened a new issue, #12096:
URL: https://github.com/apache/cloudstack/issues/12096

   ### problem
   
   After upgrading the KVM agent on a compute node from CloudStack 4.20 to 
4.22, the agent fails to recreate or connect to the existing RBD storage pool. 
The error manifests in the agent logs as a LibvirtException during pool 
initialization, querying if the RBD pool exists (which it does on the Ceph 
cluster). This prevents the host from fully reconnecting and handling VM 
operations (e.g., volume attach/detach).
   The issue appears tied to changes in libvirt (8.0+) or Ceph client libraries 
post-upgrade, causing IoCTX creation to fail due to temporary secret/cached 
state mismatches. Notably, a full reboot of the compute node resolves the issue 
immediately, allowing clean recreation of the pool and secret. However, this 
introduces unwanted downtime for running VMs on that node, which is 
unacceptable in production.
   
   ### versions
   
   Environment
   
   CloudStack version: Management server upgraded to 4.22.0 (from 4.20.0)
   Agent version: KVM agent upgraded from 4.20.0 to 4.22.0 on compute nodes
   Hypervisor: KVM 
   Primary Storage: Ceph RBD (pool name: cloudstack-zone1; Ceph version: 14)
   OS on compute nodes: Ubuntu 20.04
   
   
   ### The steps to reproduce the bug
   
   1. Upgrade mgmt to 4.22
   2. upgrade agent to 4.22
   3. log error from agent.log: Failed to create RBD storage pool: 
org.libvirt.LibvirtException: failed to create the RBD IoCTX. Does the pool 
'cloudstack-zone1' exist? 
   I also do these commands on CEPH but it's still error
   ```# ceph config set mon auth_expose_insecure_global_id_reclaim false
   
   # ceph config set mon mon_warn_on_insecure_global_id_reclaim_allowed false
   
   # ceph config set mon auth_allow_insecure_global_id_reclaim false
   ```
   
   **Expected Behavior**
   The agent should successfully redefine the RBD storage pool using the 
existing Ceph configuration (monitors, secrets) without failure, allowing 
seamless host reconnection post-upgrade.
   
   **Actual Behavior**
   Agent logs show repeated failures to create the RBD IoCTX, followed by 
cleanup of the libvirt secret. Host status remains "Disconnected" or "Alert" in 
UI until manual intervention. Full reboot of the compute node resolves the 
issue immediately (it's so bad solution)
   
   ### What to do about it?
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to