jmsperu opened a new pull request, #13076:
URL: https://github.com/apache/cloudstack/pull/13076

   ## Summary
   
   The LINSTOR plugin treats a successful HTTP response from 
`resourceDefinitionDelete` as proof the resource is gone and immediately drops 
the volume from CloudStack's accounting. In practice LINSTOR can return success 
while the resource lingers in DELETING state — for example when a DRBD peer is 
unreachable, quorum was lost, or a satellite is down. The plugin has no retry, 
no verification, and no sweeper. We've found hundreds of stuck DELETING 
resources accumulated over weeks because nothing surfaces the divergence 
between the CS view and the LINSTOR view.
   
   ## What this PR does
   
   Adds two helpers to `LinstorUtil`:
   
   - `isResourceDefinitionGone(api, rscName)` — existence check via 
`resourceDefinitionList`
   - `waitForResourceDefinitionDeleted(api, rscName, timeoutMillis)` — polls 
every 1s until the resource is gone OR timeout elapses; returns `true` on 
confirmed-gone, `false` on timeout
   
   Calls `waitForResourceDefinitionDeleted` from both delete paths:
   
   - `LinstorPrimaryDataStoreDriverImpl.deleteResourceDefinition` (driver / 
management-server side)
   - `LinstorStorageAdaptor.deRefOrDeleteResource` (host / KVM agent side)
   
   Default timeout 30 seconds (`DEFAULT_RD_DELETE_VERIFY_TIMEOUT_MILLIS`).
   
   ## Behaviour
   
   - **Resource deletes within 30s** (the normal case) — no log change, no 
behaviour change.
   - **Resource still in DELETING after 30s** — emits a WARN naming the 
resource and pointing the operator at `linstor resource list`. Returns success 
to the caller (the CS-side accounting has already moved on; throwing here would 
create a different inconsistency).
   - **Controller transiently unreachable during poll** — debug-logs each 
failed poll, keeps trying until deadline.
   
   ## What this does NOT do (deferred)
   
   - Tier-2 sweeper that periodically re-attempts delete on long-stuck 
resources is not in this PR. The Tier-1 surface here is the minimum needed to 
make the problem visible in operator logs. A sweeper can land separately if 
maintainers want it.
   
   ## Test plan
   
   - [ ] CI build + unit tests pass (no public API changes; only behaviour 
additions in private methods)
   - [ ] Manual: delete a volume on a healthy LINSTOR cluster — no new logs, 
~30s shorter than `wait` would otherwise be (immediate return after first poll 
confirms gone)
   - [ ] Manual: delete a volume on a cluster with a downed satellite — observe 
the new WARN line in the agent log and confirm the resource is still in 
`linstor resource list` until manually cleared


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to