jmsperu opened a new issue, #12679:
URL: https://github.com/apache/cloudstack/issues/12679

   ### problem
   
   `NASBackupProvider.syncBackupStorageStats()` crashes with a 
`NullPointerException` when 
`ResourceManager.findOneRandomRunningHostByHypervisor()` returns `null`. This 
happens when no KVM host in the zone has `status=Up` at the exact moment the 
`BackupSyncTask` runs (e.g., during management server startup, brief agent 
disconnections, or host state
     transitions).
   
     The NPE kills the entire `BackupSyncTask` background job every sync 
interval (default 300s), flooding the management server log with stack traces 
and preventing backup storage stats from being updated.
   
     ## Stack Trace
   
     ERROR [o.a.c.b.B.BackupSyncTask] Error trying to run backup-sync 
background task due to:
     [Cannot invoke "com.cloud.host.Host.getId()" because "host" is null].
     java.lang.NullPointerException: Cannot invoke 
"com.cloud.host.Host.getId()" because "host" is null
         at 
org.apache.cloudstack.backup.NASBackupProvider.syncBackupStorageStats(NASBackupProvider.java:544)
         at 
org.apache.cloudstack.backup.BackupManagerImpl$BackupSyncTask.runInContext(BackupManagerImpl.java:1947)
   
     ## Affected Code
   
     File: 
plugins/backup/nas/src/main/java/org/apache/cloudstack/backup/NASBackupProvider.java`
   
     java
     @Override
     public void syncBackupStorageStats(Long zoneId) {
         final List<BackupRepository> repositories = 
backupRepositoryDao.listByZoneAndProvider(zoneId, getName());
         final Host host = 
resourceManager.findOneRandomRunningHostByHypervisor(Hypervisor.HypervisorType.KVM,
 zoneId);
         // host can be null here, but no null check before using it:
         for (final BackupRepository repository : repositories) {
             ...
             answer = (BackupStorageStatsAnswer) 
agentManager.send(host.getId(), command); // NPE
             ...
         }
     }
   
     findOneRandomRunningHostByHypervisor in ResourceManagerImpl returns null 
when no matching host is found:
   
     if (CollectionUtils.isEmpty(hosts)) {
         return null;
     }
   
     The same pattern also exists in deleteBackup() (line ~450) where the host 
can be null when the VM is removed and no running KVM host is available.
   
     Suggested Fix
   
     Add a null check after findOneRandomRunningHostByHypervisor, log a 
warning, and return early:
   
     @Override
     public void syncBackupStorageStats(Long zoneId) {
         final List<BackupRepository> repositories = 
backupRepositoryDao.listByZoneAndProvider(zoneId, getName());
         if (repositories.isEmpty()) {
             return;
         }
         final Host host = 
resourceManager.findOneRandomRunningHostByHypervisor(Hypervisor.HypervisorType.KVM,
 zoneId);
         if (host == null) {
             logger.warn("Unable to find a running KVM host in zone {} to sync 
backup storage stats", zoneId);
             return;
         }
         for (final BackupRepository repository : repositories) {
             ...
         }
     }
   
     And similarly for deleteBackup():
   
     Host host = vm != null ? getVMHypervisorHost(vm) :
         
resourceManager.findOneRandomRunningHostByHypervisor(HypervisorType.KVM, 
Long.valueOf(backup.getZoneId()));
     if (host == null) {
         throw new CloudRuntimeException("Unable to find a running KVM host to 
process backup deletion");
     }
   
     Environment
   
     - CloudStack version: 4.22.0.0
     - Hypervisor: KVM
     - Backup provider: NAS (NFS)
     - OS: Ubuntu 24.04, Java 21
   
     How to Reproduce
   
     1. Configure NAS backup provider with an NFS backup repository
     2. Assign backup offerings to VMs
     3. Restart cloudstack-management (or wait for a transient host disconnect)
     4. Observe management-server.log — the NPE fires every 
backup.framework.sync.interval seconds
   
     Impact
   
     - BackupSyncTask fails completely on every cycle, backup storage capacity 
stats are never updated
     - Log spam (one full stack trace every 5 minutes)
     - No data loss, but backup monitoring/reporting is degraded
   
   
   
   ### versions
   
   The versions of ACS, hypervisors, storage, network etc..
   
   ### The steps to reproduce the bug
   
   1.
   2.
   3.
   ...
   
   
   ### What to do about it?
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to