andrijapanicsb commented on issue #3610: [KVM] Rolling maintenance URL: https://github.com/apache/cloudstack/pull/3610#issuecomment-598265174 TestID | Test Name | Steps | Expected Result | Status -- | -- | -- | -- | -- 1 | Scenario where the PreMaintenance script “informs” CloudStack that the Maintenance stage should not be done | On one of the hosts, make sure that the PreMaintenance.sh scripts exists with code 70, i.e.: #!/bin/bash echo "############# This is PreMaintenance script" exit 70 Run the rolling maintenance API against that host and confirm that no Maintenance stage will be executed | Confirm that the maintenance was skipped on the host: (localcloud) SBCM5> > start rollingmaintenance hostids=0bffbfcb-fc6f-4c6d-9604-35e494439a33 { "rollingmaintenance": { "details": "OK", "hostsskipped": [ { "hostid": "0bffbfcb-fc6f-4c6d-9604-35e494439a33", "hostname": "ref-trl-478-k-M7-apanic-kvm3", "reason": "Pre-maintenance stage set to avoid maintenance" } ], "hostsupdated": [], "success": true } } Confirm that the rolling-maintenance.log on KVM host confirms no script was run after the “PreMaintenance.sh” script: root@ref-trl-478-k-M7-apanic-kvm3:~/scripts# grep -ir "INFO Executing script" /var/log/cloudstack/agent/rolling-maintenance.log 11:34:53,381 rolling-maintenance INFO Executing script: /root/scripts/PreFlight.sh for stage: PreFlight 11:35:03,662 rolling-maintenance INFO Executing script: /root/scripts/PreMaintenance.sh for stage: PreMaintenance | Pass 2 | Confirm that the “forced” parameter doesn’t influence the scenario where the PreMaintenance script “informs” CloudStack that the Maintenance stage should not be done | On one of the hosts, make sure that the PreMaintenance.sh scripts exits with code 70, i.e.: #!/bin/bash echo "############# This is PreMaintenance script" exit 70 Run the rolling maintenance api against that host with the “forced=true” parameter and confirm that no Maintenance stage will be executed | (localcloud) SBCM5> > start rollingmaintenance hostids=0bffbfcb-fc6f-4c6d-9604-35e494439a33 forced=true { "rollingmaintenance": { "details": "OK", "hostsskipped": [ { "hostid": "0bffbfcb-fc6f-4c6d-9604-35e494439a33", "hostname": "ref-trl-478-k-M7-apanic-kvm3", "reason": "Pre-maintenance stage set to avoid maintenance" } ], "hostsupdated": [], "success": true } } Confirm that the rolling-maintenance.log on KVM host confirms no script was run after the “PreMaintenance.sh” script: root@ref-trl-478-k-M7-apanic-kvm3:~/scripts# grep -ir "INFO Executing script" /var/log/cloudstack/agent/rolling-maintenance.log 11:42:02,602 rolling-maintenance INFO Executing script: /root/scripts/PreFlight.sh for stage: PreFlight 11:42:13,1 rolling-maintenance INFO Executing script: /root/scripts/PreMaintenance.sh for stage: PreMaintenance | Pass 3 | Confirm that a single stage failure DOES cause abortion of the rolling maintenance API call against the rest of the cluster, when “forced” parameter NOT specified | On the first hosts in the cluster, make sure that the PreMaintenance.sh scripts exits with code 1, i.e.: #!/bin/bash echo "############# This is PreMaintenance script" exit 1 Run the rolling maintenance API against the cluster and observe that since there is a failure on the first host, no other hosts will be attempted for maintenance | ((localcloud) SBCM5> > start rollingmaintenance clusterids=333fc22b-7189-4f67-a691-d95051a1b0f5 { "rollingmaintenance": { "details": "Error starting rolling maintenance: Stage: PreMaintenance failed on host d55eb212-357a-41fa-9bd3-99b5492b94d2: ############################## This is PreMaintenance script\n", "hostsskipped": [], "hostsupdated": [], "success": false } } | Pass 4 | Confirm that a single stage failure on one host does NOT cause abortion of the rolling maintenance API against the cluster, when “forced” parameter is specified | On the first hosts in the cluster, make sure that the PreMaintenance.sh scripts exits with code 1, i.e.: #!/bin/bash echo "############# This is PreMaintenance script" exit 1 Run the rolling maintenance API against the cluster and observe that even though there is a failure on the first host, other (2) hosts have been processed for Maintenance | (localcloud) SBCM5> > start rollingmaintenance clusterids=333fc22b-7189-4f67-a691-d95051a1b0f5 forced=true { "rollingmaintenance": { "details": "Error starting rolling maintenance: Maintenance state expected, but got ErrorInPrepareForMaintenance", "hostsskipped": [ { "hostid": "d55eb212-357a-41fa-9bd3-99b5492b94d2", "hostname": "ref-trl-478-k-M7-apanic-kvm1", "reason": "Pre-maintenance script failed: ############################## This is PreMaintenance script\n" } ], "hostsupdated": [ { "enddate": "2020-01-23'T'18:42:57+00:00", "hostid": "b0220ceb-c302-4102-b4eb-f12a72f9769b", "hostname": "ref-trl-478-k-M7-apanic-kvm2", "startdate": "2020-01-23'T'18:42:37+00:00" } ], "enddate": "2020-01-23'T'18:44:17+00:00", "hostid": "b0220ceb-c302-4102-b4eb-f12a72f9769b", "hostname": "ref-trl-478-k-M7-apanic-kvm3", "startdate": "2020-01-23'T'18:43:10+00:00" } ] "success": false } } | Pass 5 | Confirm that the scripts are executed by the agent | Make sure that the agent.properties contain setting to disable service mode executor: rolling.maintenance.service.executor.disabled=true =true Restart cloudstack-agent, start the rolling maintenance of a single host and observe the log agent.log on the KVM host doesn’t mention script being executed by the systemd (systemctl) | 2020-03-11 16:59:18,670 INFO [resource.wrapper.LibvirtRollingMaintenanceCommandWrapper] (agentRequest-Handler-2:null) (logid:9c201ba1) Processing stage PreFlight 2020-03-11 16:59:18,670 INFO [rolling.maintenance.RollingMaintenanceAgentExecutor] (agentRequest-Handler-2:null) (logid:9c201ba1) Executing stage: PreFlight script: /root/scripts/PreFlight 2020-03-11 16:59:18,671 DEBUG [rolling.maintenance.RollingMaintenanceAgentExecutor] (agentRequest-Handler-2:null) (logid:9c201ba1) Executing: /root/scripts/PreFlight 2020-03-11 16:59:18,673 DEBUG [rolling.maintenance.RollingMaintenanceAgentExecutor] (agentRequest-Handler-2:null) (logid:9c201ba1) Executing while with timeout : 1800000 2020-03-11 16:59:18,675 DEBUG [rolling.maintenance.RollingMaintenanceAgentExecutor] (agentRequest-Handler-2:null) (logid:9c201ba1) Execution is successful. 2020-03-11 16:59:18,680 INFO [rolling.maintenance.RollingMaintenanceAgentExecutor] (agentRequest-Handler-2:null) (logid:9c201ba1) Execution finished for stage: PreFlight script: /root/scripts/PreFlight : 0 | Pass 6 | Confirm that the scripts are executed by the service executor | Make sure that the agent.properties contain setting to disable service mode executor: rolling.maintenance.service.executor.disabled=false Restart cloudstack-agent, start the rolling maintenance of a single host and observe the log agent.log on the KVM host mentioned systemd being invoked | 2020-03-11 17:04:03,168 INFO [resource.wrapper.LibvirtRollingMaintenanceCommandWrapper] (agentRequest-Handler-3:null) (logid:525cb47d) Processing stage PreFlight 2020-03-11 17:04:03,168 DEBUG [rolling.maintenance.RollingMaintenanceServiceExecutor] (agentRequest-Handler-3:null) (logid:525cb47d) Invoking rolling maintenance service for stage: PreFlight and file /root/scripts/PreFlight with action: start 2020-03-11 17:04:03,170 DEBUG [utils.script.Script] (agentRequest-Handler-3:null) (logid:525cb47d) Executing: /bin/bash -c systemd-escape 'PreFlight,/root/scripts/PreFlight,1800,/root/scripts/rolling-maintenance-results,/root/scripts/rolling-maintenance-output' 2020-03-11 17:04:03,171 DEBUG [utils.script.Script] (agentRequest-Handler-3:null) (logid:525cb47d) Executing while with timeout : 3600000 2020-03-11 17:04:03,176 DEBUG [utils.script.Script] (agentRequest-Handler-3:null) (logid:525cb47d) Execution is successful. 2020-03-11 17:04:03,177 DEBUG [rolling.maintenance.RollingMaintenanceServiceExecutor] (agentRequest-Handler-3:null) (logid:525cb47d) Executing: /bin/systemctl start cloudstack-rolling-maintenance@PreFlight\x2c-root-scripts-PreFlight\x2c1800\x2c-root-scripts-rolling\x2dmaintenance\x2dresults\x2c-root-scripts-rolling\x2dmaintenance\x2doutput | Pass 7 | Confirm capacity checks are in place | Out of 6 hosts, disable hosts 2,3,4,5, dedicate a single host (host6) to an account, and then execute the rolling maintenance against the remaining 1 host (host 1) which is hosting VMS that does NOT belong the account for which host6 was dedicated Having NO VMs on the host1, host1 can be put into maintenance as there are no capacities needed to be available elsewhere, as no VM will be migrated away from host1 | (localcloud) SBCM5> > start rollingmaintenance hostids=86c0b59f-89de-40db-9b30-251f851e869f { "rollingmaintenance": { "details": "OK", "hostsskipped": [], "hostsupdated": [ { "enddate": "2020-03-11'T'17:21:38+00:00", "hostid": "86c0b59f-89de-40db-9b30-251f851e869f", "hostname": "ref-trl-711-k-M7-apanic-kvm1", "output": "null ", "startdate": "2020-03-11'T'17:21:07+00:00" } ], "success": true } } | Pass 8 | Confirm capacity checks are in place | Out of 6 hosts, disable hosts 2,3,4,5, dedicate a single host (host6) to an account, and then execute the rolling maintenance against the remaining 1 host (host 1) which is hosting VMS that does NOT belong the account for which host6 was dedicated Having at least 1 VM on host1, try to execute rolling maintenance against it -it will fail as there are no free hosts (non-disabled, non-dedicated) that can host VMs from host1. | Due to the nature of putting the host into the Maintenance mode, after the first attempt is failed, management server will retry to migrate VMs away for 5 times and will then fail permanently (give up) while the startRollingMaintenance API call will be running until the timeout defined by “kvm.rolling.maintenance.wait.maintenance.timeout ” is reached, which defaults to 1800 seconds, after which the API call will fail as well. The failure to migrate VMs away can be observed in the management-server.log 2020-03-11 17:24:25,761 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) No suitable hosts found 2020-03-11 17:24:25,761 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) No suitable hosts found under this Cluster: 1 2020-03-11 17:24:25,761 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) Could not find suitable Deployment Destination for this VM under any clusters, returning. 2020-03-11 17:24:25,761 DEBUG [c.c.d.FirstFitPlanner] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) Searching resources only under specified Cluster: 1 2020-03-11 17:24:25,762 DEBUG [c.c.d.FirstFitPlanner] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) The specified cluster is in avoid set, returning. 2020-03-11 17:24:25,762 DEBUG [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) Unable to find destination for migrating the vm VM[User\|i-2-27-VM] | Pass 9 | The pre-flight scripts must be executed on each host before any maintenance actions. | Start the cluster-level rolling maintenance and “tail -f” the /var/log/cloudstack/agent/rolling-maintenance.log on each KVM host simultaneously – observe that the PreFlight stage/script is executed on all hosts in the cluster before the PreMaintenance stage/script is executed on the first host in the cluster | tail -f” the /var/log/cloudstack/agent/rolling-maintenance.log on each KVM host simultaneously – observe that the PreFlight stage/script is executed on all hosts in the cluster before the PreMaintenance stage/script is executed on the first host in the cluster: tail -f” the /var/log/cloudstack/agent/rolling-maintenance.log 17:05:56,710 rolling-maintenance INFO Successful execution of /root/scripts/PreFlight | Pass 10 | Failure of PreFlight check on hosts halts the API when forced=false is set | On the first hosts in the cluster, make sure that the PreFlight scripts exists with code 1, i.e.: #!/bin/bash echo "############# This is PreMaintenance script" exit 1 Run the rolling maintenance API against the cluster and observe that since there is a failure on the first host, no other hosts will be attempted for maintenance | (localcloud) SBCM5> > start rollingmaintenance clusterids=a0c249d2-e020-4f2b-ab9c-1e05bbe68b64 { "rollingmaintenance": { "details": "Error starting rolling maintenance: Stage: PreFlight failed on host 86c0b59f-89de-40db-9b30-251f851e869f: null", "hostsskipped": [], "hostsupdated": [], "success": false } } | Pass 11 | In absence of Maintenance script on a host, that host will be skipped | On a single host, make sure that there is no script named “Maintenance”, “Maintenance.sh” or “Maintenance.py” present in the configured script folder and execute the rolling maintenance call against this host and another one The first host will be skipped and a proper message is shown | (localcloud) SBCM5> > start rollingmaintenance hostids=86c0b59f-89de-40db-9b30-251f851e869f,ef10dacd-ac4e-4ec0-bc8d-7fb5bb461c9d { "rollingmaintenance": { "details": "OK", "hostsskipped": [ { "hostid": "86c0b59f-89de-40db-9b30-251f851e869f", "hostname": "ref-trl-711-k-M7-apanic-kvm1", "reason": "There is no maintenance script on the host" } ], "hostsupdated": [ { "enddate": "2020-03-11'T'17:58:58+00:00", "hostid": "ef10dacd-ac4e-4ec0-bc8d-7fb5bb461c9d", "hostname": "ref-trl-711-k-M7-apanic-kvm2", "output": "", "startdate": "2020-03-11'T'17:58:18+00:00" } ], "success": true } } | Pass 12 | Capacity checks are also done before putting host into Maintenance | Before executing rolling maintenance on host1, make sure to, out of 6 hosts, disable hosts 2,3,4,5, while host 6 is NOT disabled and does have enough capacities for VMs that exist on host1 – the capacity checks during PreFligh stage will not fail. On host 1 make sure the PreMaintenance script has the equivalent of “sleep 30” command inside it, so the script will take at least 30 seconds to execute (PreFlight capacity checks have completed by now and there is host6 with enough capacities to host VMs from host1) and there is enough time for test-operator to go and disable host6 during those 30 seconds. Execute “tail -f the /var/log/cloudstack/agent/rolling-maintenance.log” on the host1. When the line “Executing script: /root/scripts/PreMaintenance.sh for stage: PreMaintenance” appears (your script location might be different, as well as script extension) Quickly go and disable host6 during those 30 seconds. Observe that the rolling maintenance call will fail | (localcloud) SBCM5> > start rollingmaintenance hostids=86c0b59f-89de-40db-9b30-251f851e869f { "rollingmaintenance": { "details": "Error starting rolling maintenance: No host available in cluster a0c249d2-e020-4f2b-ab9c-1e05bbe68b64 (p1-c1) to support host 86c0b59f-89de-40db-9b30-251f851e869f (ref-trl-711-k-M7-apanic-kvm1) in maintenance", "hostsskipped": [], "hostsupdated": [], "success": false } } (localcloud) SBCM5> > | Pass 13 | When a stage does not contain a script for execution, it is skipped | On a single host, make sure that there is no script named “PreMaintenance”, “PreMaintenance.sh” or “PreMaintenance.py” present in the configured script folder and execute the rolling maintenance call against this host. On the KVM host, observe that the lines in “rolling-maintenance.log” do not contain PreMaintenance script, but all the other scripts/stages have run normally | grep "Executing script" /var/log/cloudstack/agent/rolling-maintenance.log 18:42:37,527 rolling-maintenance INFO Executing script: /root/scripts/PreFlight for stage: PreFlight 18:43:47,961 rolling-maintenance INFO Executing script: /root/scripts/Maintenance for stage: Maintenance 18:43:58,133 rolling-maintenance INFO Executing script: /root/scripts/PostMaintenance.sh for stage: PostMaintenance | Pass 14 | Execute rolling maintenance against the whole zone | Ensure to have at least 2 clusters in a zone. Perform rolling maintenance of the whole zone. Observe that clusters are processed one after the another – first all host from the first cluster, then all hosts from the second cluster NOTE: in these tests, we have remove kvm4/kvm5/kvm6 hosts from the first cluster and added them to the new cluster (in order of kvm6/kvm5/kvm4). Expected order of clusters/hosts processed is: - Cluster1 (p1-c1 in our case) à host kvm1/kvm2/kvm3 - Cluster2 (cluster2 in our case) à kvm6/kvm5/kvm4 (since the hosts within a cluster are processed by the order as they appear in the DB) | Observe the “startdate” reported by the API, that confirms all hosts across both clusters are processed in the expected order: (localcloud) SBCM5> > start rollingmaintenance zoneids=ce831d12-c2df-4b11-bec9-684dcc292c18 { "rollingmaintenance": { "details": "OK", "hostsskipped": [], "hostsupdated": [ { "enddate": "2020-03-11'T'20:06:09+00:00", "hostid": "86c0b59f-89de-40db-9b30-251f851e869f", "hostname": "ref-trl-711-k-M7-apanic-kvm1", "output": "", "startdate": "2020-03-11'T'20:05:28+00:00" }, { "enddate": "2020-03-11'T'20:08:09+00:00", "hostid": "ef10dacd-ac4e-4ec0-bc8d-7fb5bb461c9d", "hostname": "ref-trl-711-k-M7-apanic-kvm2", "output": "", "startdate": "2020-03-11'T'20:06:29+00:00" }, { "enddate": "2020-03-11'T'20:10:10+00:00", "hostid": "fcc8b96e-1c29-492e-a074-96babec70ecc", "hostname": "ref-trl-711-k-M7-apanic-kvm3", "output": "", "startdate": "2020-03-11'T'20:08:30+00:00" }, { "enddate": "2020-03-11'T'20:12:11+00:00", "hostid": "4a732078-2f5d-4bf1-8425-2135004a6b1a", "hostname": "ref-trl-711-k-M7-apanic-kvm6", "output": "", "startdate": "2020-03-11'T'20:11:01+00:00" }, { "enddate": "2020-03-11'T'20:13:12+00:00", "hostid": "8f27f11a-9c60-4c30-8622-0e1bce718adc", "hostname": "ref-trl-711-k-M7-apanic-kvm5", "output": "", "startdate": "2020-03-11'T'20:12:32+00:00" }, { "enddate": "2020-03-11'T'20:14:13+00:00", "hostid": "adbbfc34-9369-4a15-93dc-7ed85756c24e", "hostname": "ref-trl-711-k-M7-apanic-kvm4", "output": "", "startdate": "2020-03-11'T'20:13:33+00:00" } ], "success": true } } | Pass 15 | Execute rolling maintenance against hosts from different clusters/zones | While having multiple zones, execute the rolling maintenance by specifying at least hosts from different zones | (localcloud) SBCM5> > start rollingmaintenance hostids=86c0b59f-89de-40db-9b30-251f851e869f,b0f54409-4874-4573-9c24-8efac5b07f6f { "rollingmaintenance": { "details": "OK", "hostsskipped": [], "hostsupdated": [ { "enddate": "2020-03-12'T'12:33:04+00:00", "hostid": "86c0b59f-89de-40db-9b30-251f851e869f", "hostname": "ref-trl-711-k-M7-apanic-kvm1", "output": "", "startdate": "2020-03-12'T'12:32:24+00:00" }, { "enddate": "2020-03-12'T'12:35:15+00:00", "hostid": "b0f54409-4874-4573-9c24-8efac5b07f6f", "hostname": "ref-trl-714-k-M7-apanic-kvm1", "output": "", "startdate": "2020-03-12'T'12:33:35+00:00" } ], "success": true } } ( | Pass 16 | Execute rolling maintenance against multiple zones | Having multiple zones, execute the rolling maintenance by specifying at least 2 zones, and notice that first all hosts in one zone will be processed (all hosts in a single cluster, then all hosts from another cluster), and only then the hosts from another zone | (localcloud) SBCM5> > start rollingmaintenance zoneids=6f3c9827-6e99-4c63-b7d5-e8f427f6dcff,ce831d12-c2df-4b11-bec9-684dcc292c18 { "rollingmaintenance": { "details": "OK", "hostsskipped": [], "hostsupdated": [ { "enddate": "2020-03-12'T'12:41:24+00:00", "hostid": "86c0b59f-89de-40db-9b30-251f851e869f", "hostname": "ref-trl-711-k-M7-apanic-kvm1", "output": "", "startdate": "2020-03-12'T'12:40:44+00:00" }, { "enddate": "2020-03-12'T'12:43:25+00:00", "hostid": "ef10dacd-ac4e-4ec0-bc8d-7fb5bb461c9d", "hostname": "ref-trl-711-k-M7-apanic-kvm2", "output": "", "startdate": "2020-03-12'T'12:41:45+00:00" }, { "enddate": "2020-03-12'T'12:45:26+00:00", "hostid": "fcc8b96e-1c29-492e-a074-96babec70ecc", "hostname": "ref-trl-711-k-M7-apanic-kvm3", "output": "", "startdate": "2020-03-12'T'12:43:46+00:00" }, { "enddate": "2020-03-12'T'12:47:27+00:00", "hostid": "4a732078-2f5d-4bf1-8425-2135004a6b1a", "hostname": "ref-trl-711-k-M7-apanic-kvm6", "output": "", "startdate": "2020-03-12'T'12:46:17+00:00" }, { "enddate": "2020-03-12'T'12:49:28+00:00", "hostid": "8f27f11a-9c60-4c30-8622-0e1bce718adc", "hostname": "ref-trl-711-k-M7-apanic-kvm5", "output": "", "startdate": "2020-03-12'T'12:47:48+00:00" }, { "enddate": "2020-03-12'T'12:51:29+00:00", "hostid": "adbbfc34-9369-4a15-93dc-7ed85756c24e", "hostname": "ref-trl-711-k-M7-apanic-kvm4", "output": "", "startdate": "2020-03-12'T'12:49:48+00:00" }, { "enddate": "2020-03-12'T'12:53:00+00:00", "hostid": "59159ade-f5c3-4606-9174-e501301f59d4", "hostname": "ref-trl-714-k-M7-apanic-kvm3", "output": "", "startdate": "2020-03-12'T'12:52:19+00:00" }, { "enddate": "2020-03-12'T'12:54:00+00:00", "hostid": "b0f54409-4874-4573-9c24-8efac5b07f6f", "hostname": "ref-trl-714-k-M7-apanic-kvm1", "output": "", "startdate": "2020-03-12'T'12:53:20+00:00" }, { "enddate": "2020-03-12'T'12:55:01+00:00", "hostid": "02228e26-a0d6-4607-824d-501ae5ac8dab", "hostname": "ref-trl-714-k-M7-apanic-kvm2", "output": "", "startdate": "2020-03-12'T'12:54:21+00:00" } ], "success": true } } | Pass
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
