btzq opened a new issue, #10019:
URL: https://github.com/apache/cloudstack/issues/10019
<!--
Verify first that your issue/request is not already reported on GitHub.
Also test if the latest release and main branch are affected too.
Always add information AFTER of these HTML comments, but no need to delete
the comments.
-->
##### ISSUE TYPE
<!-- Pick one below and delete the rest -->
* Feature Idea
##### COMPONENT NAME
<!--
Categorize the issue, e.g. API, VR, VPN, UI, etc.
-->
~~~
Host? Cluster? Im not sure
~~~
##### CLOUDSTACK VERSION
<!--
New line separated list of affected versions, commit ID for issues on main
branch.
-->
~~~
NA
~~~
##### CONFIGURATION
<!--
Information about the configuration if relevant, e.g. basic network,
advanced networking, etc. N/A otherwise
-->
##### OS / ENVIRONMENT
<!--
Information about the environment if relevant, N/A otherwise
-->
##### SUMMARY
<!-- Explain the problem/feature briefly -->
### **Current Capability**
CloudStack currently offers a 'Maintenance' Mode, which facilitates the live
migration of all VMs from a host and removes the host from the cluster for
maintenance.
### **Proposed Feature: "Waiting for Maintenance" Mode**
The proposed "Waiting for Maintenance" Mode introduces a preparatory state
that addresses scenarios where live migration is impractical or impossible.
This feature would enable gradual decommissioning or maintenance while avoiding
service disruption.
### **General Idea of How It Might Work:**
**1. **Operator Responsibilities:****
- Customer communication and notification will be managed entirely by the
cloud company, outside of CloudStack. This is to inform customers that they are
given a time window to voluntarily restart their VMs before the cut off date.
**2. CloudStack Responsibilities:**
- Block the creation of new VMs to the host/cluster marked as 'Waiting For
Maintenance'
- Ensure restarted VMs are relocated to clusters with matching host tags.
_This is actually a similar process as how AWS Cloud does it:
https://aws.amazon.com/maintenance-help/_
### **Use Cases**
**Scenario 1: Decommissioning an Old Compute Cluster**
Problem:
- Legacy clusters with outdated CPU architectures cannot perform live
migration due to compatibility issues
(e.g., VM freezing during migration causing downtime).
- Existing VMs must restart to migrate to a new cluster with compatible
architectures.
- The old cluster remains active, risking the placement of new VMs and
hindering decommissioning.
**Scenario 2: Maintenance of GPU Clusters with GPU Passthrough**
Problem:
- GPU passthrough prevents live migration, unlike vGPU setups that allow
seamless migration.
- Downtime-free maintenance is not feasible, requiring customer cooperation
to restart affected VMs.
##### STEPS TO REPRODUCE
<!--
For bugs, show exactly how to reproduce the problem, using a minimal
test-case. Use Screenshots if accurate.
For new features, show how the feature would be used.
-->
<!-- Paste example playbooks or commands between quotes below -->
~~~
NA
~~~
<!-- You can also paste gist.github.com links for larger files -->
##### EXPECTED RESULTS
<!-- What did you expect to happen when running the steps above? -->
~~~
Refer to Above
~~~
##### ACTUAL RESULTS
<!-- What actually happened? -->
<!-- Paste verbatim command output between quotes below -->
~~~
Not able to facilitate smooth decomissioning of servers for compute where
live migration is not possible.
~~~
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]