anuragaw opened a new pull request #3575: Health check feature for virtual 
router
URL: https://github.com/apache/cloudstack/pull/3575
 
 
   We want to support more exhaustive health checks for VRs. This feature helps 
admins configuring health checks and also expands it's scope. There are two 
categories of health checks - basic and advanced (more expensive so should be 
run less frequently). The following checks have been added with a separate 
script - 
   1. Services check (as per existing monitorServices.py) - basic check
   2. Disk space check against a threshold - basic check
   3. CPU usage check against a threshold - basic check
   4. Memory usage check against a threshold - basic check
   5. Router template and scripts version check - basic
   6. Connectivity to the gateways from router - basic
   7. DNS config match against MS - advanced check
   8. DHCP config match against MS - advanced check
   9. HA Proxy config match against MS (internal LB and public LB) - advance 
check
   10. Port forwarding match against MS in iptables. - advance check
   
   Following global configs   were added for configuring health checks:
   •    "router.health.checks.enabled" - If true, router health checks are 
allowed to be executed and read. If false, all scheduled checks and API calls 
for on demand checks are disabled. Default is true.
   •    "router.health.checks.basic.interval" - Interval in minutes at which 
basic router health checks are performed. If set to 0, no tests are scheduled. 
Default is 3 mins as per the existing monitor services.  
   •    "router.health.checks.advanced.interval" - Interval in minutes at which 
advanced router health checks are performed. If set to 0, no tests are 
scheduled. Default value is 10 minutes  . 
   •    "router.health.checks.config  .refresh.interval" - Interval in minutes 
at which router health checks config - such as scheduling intervals, excluded 
checks, etc is updated on virtual routers by the management server. This value 
should be sufficiently high (like 2x) from the 
router.health.checks.basic.interval and router.health.checks.advanced.interval 
so that there is time between new results generation for passed data. Default 
is 10 mins.  
   •    "router.health.checks.results.fetch.interval"  - Interval in minutes at 
which router health checks results are fetched by management server. On each 
result fetch, management server evaluates need to recreate VR as per 
configuration of router.health.checks.failures.to.recreate.vr. This value 
should be sufficiently high (like 2x) from the 
router.health.checks.basic.interval and router.health.checks.advanced.interval 
so that there is time between new results generation and fetch.
   •    "router.health.checks.failures.to.recreate.vr" - Health checks failures 
defined by this config are the checks that should cause router recreation. If 
empty the recreate is not attempted for any health check failure. Possible 
values are comma separated script names from systemvm’s /root/health_scripts/ 
(namely - cpu_usage_check.py, dhcp_check.py, disk_space_check.py, dns_check.py, 
gateways_check.py, haproxy_check.py, iptables_check.py, memory_usage_check.py, 
router_version_check.py), connectivity.test or services (namely - 
loadbalancing.service,  webserver.service,  dhcp.service)
   •    "router.health.checks.to.exclude" - Health checks that should be 
excluded when executing scheduled checks on the router. This can be a comma 
separated list of script names placed in the '/root/health_checks/' folder. 
Currently the following scripts are placed in default systemvm template -  
cpu_usage_check.py, disk_space_check.py, gateways_check.py, iptables_check.py, 
router_version_check.py, dhcp_check.py, dns_check.py, haproxy_check.py, 
memory_usage_check.py. 
   •    "router.health.checks.free.disk.space.threshold" - Free disk space 
threshold (in MB) on VR below which the check is considered a failure. Default 
is 100MB.  
   •    "router.health.checks.max.cpu.usage.threshold" - Max CPU Usage 
threshold as % above which check is considered a failure. 
   •    "router.health.checks.max.memory.usage.threshold" - Max Memory Usage 
threshold as % above which check is considered a failure. 
   
   
   API Changes: 
   * listRouters and listInternalLoadBalancers now optionally takes in a flag 
includehealthcheckresults (default false) to fetch the last health check 
results for the router.
   * getRouterHealthCheckResults - a new API is added to fetch health check 
results with an optional flag performfreshchecks to execute checks on demand. 
This execution is only disabled if "router.health.checks.enabled" is false.  
performfreshchecks = true means all data from Management server is sent to the 
router and fresh checks are executed. If false, we retrieve the previously 
executed result from router itself.
   
   Additionally the feature looks into any executable script in 
/root/health_scripts/ directory and adds it's result as json output of the 
overall health checks config. This allows custom checks to be put in and custom 
systemvm templates can also support health checks.
   
   UI shows router in alert state if health checks are failure.
   
   The health checks can be manually triggered using new API added in the 
feature (CLI or UI both support this).
   ## Description
   <!--- Describe your changes in detail -->
   
   <!-- For new features, provide link to FS, dev ML discussion etc. -->
   <!-- In case of bug fix, the expected and actual behaviours, steps to 
reproduce. -->
   
   <!-- When "Fixes: #<id>" is specified, the issue/PR will automatically be 
closed when this PR gets merged -->
   <!-- For addressing multiple issues/PRs, use multiple "Fixes: #<id>" -->
   <!-- Fixes: # -->
   
   ## Types of changes
   <!--- What types of changes does your code introduce? Put an `x` in all the 
boxes that apply: -->
   - [ ] Breaking change (fix or feature that would cause existing 
functionality to change)
   - [x] New feature (non-breaking change which adds functionality)
   - [ ] Bug fix (non-breaking change which fixes an issue)
   - [ ] Enhancement (improves an existing feature and functionality)
   - [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
   
   ## Screenshots (if appropriate):
   
   ## How Has This Been Tested?
   <!-- Please describe in detail how you tested your changes. -->
   <!-- Include details of your testing environment, and the tests you ran to 
-->
   <!-- see how your change affects other areas of the code, etc. -->
   Integration tests, manually, CMK, UI
   
   ![Screenshot from 2019-12-16 
15-12-28](https://user-images.githubusercontent.com/43956255/70896650-1cef4180-2017-11ea-804e-140cf23d7d4d.png)
   ![Screenshot from 2019-12-16 
15-12-34](https://user-images.githubusercontent.com/43956255/70896651-1d87d800-2017-11ea-85a0-33ee21f41f3c.png)
   ![Screenshot from 2019-12-16 
15-12-44](https://user-images.githubusercontent.com/43956255/70896652-1d87d800-2017-11ea-85a2-ca0c77ff77e6.png)
   ![Screenshot from 2019-12-16 
15-12-55](https://user-images.githubusercontent.com/43956255/70896653-1d87d800-2017-11ea-9b7d-5bb72a26220d.png)
   ![Screenshot from 2019-12-16 
15-13-04](https://user-images.githubusercontent.com/43956255/70896655-1e206e80-2017-11ea-93a9-d3b5c8903961.png)
   
   API Changes - 
   New parameters added to list routers-
   ```
   (local) 🐵 > list routers includehealthcheckresults=true 
filter=id,healthchecksfailed,healthcheckresults
   {
     "count": 1,
     "router": [
       {
         "healthcheckresults": [
           {
             "checkname": "connectivity",
             "checktype": "basic",
             "details": "Successfully fetched data",
             "lastupdated": "2019-12-16T15:14:06+0530",
             "success": true
           },
           {
             "checkname": "cpu_usage_check.py",
             "checktype": "basic",
             "details": "CPU Usage within limits with current at 1.7%",
             "lastupdated": "2019-12-16T15:12:38+0530",
             "success": true
           },
           {
             "checkname": "dhcp.service",
             "checktype": "basic",
             "details": "service is running",
             "lastupdated": "2019-12-16T15:12:38+0530",
             "success": true
           },
           {
             "checkname": "dhcp_check.py",
             "checktype": "advance",
             "details": "All 1 VMs are present in dhcphosts.txt",
             "lastupdated": "2019-12-16T15:12:41+0530",
             "success": true
           },
           {
             "checkname": "disk_space_check.py",
             "checktype": "basic",
             "details": "Sufficient free space is 345 MB",
             "lastupdated": "2019-12-16T15:12:41+0530",
             "success": true
           },
           {
             "checkname": "dns_check.py",
             "checktype": "advance",
             "details": "All 1 VMs are present in /etc/hosts",
             "lastupdated": "2019-12-16T15:12:41+0530",
             "success": true
           },
           {
             "checkname": "gateways_check.py",
             "checktype": "basic",
             "details": "All 1 gateways are reachable via ping",
             "lastupdated": "2019-12-16T15:12:41+0530",
             "success": true
           },
           {
             "checkname": "haproxy_check.py",
             "checktype": "advance",
             "details": "No data provided to check, skipping",
             "lastupdated": "2019-12-16T15:12:41+0530",
             "success": true
           },
           {
             "checkname": "iptables_check.py",
             "checktype": "advance",
             "details": "No portforwarding rules provided to check, skipping",
             "lastupdated": "2019-12-16T15:12:41+0530",
             "success": true
           },
           {
             "checkname": "loadbalancing.service",
             "checktype": "basic",
             "details": "service is running",
             "lastupdated": "2019-12-16T15:12:38+0530",
             "success": true
           },
           {
             "checkname": "memory_usage_check.py",
             "checktype": "basic",
             "details": "Memory Usage within limits with current at 23.704%",
             "lastupdated": "2019-12-16T15:12:38+0530",
             "success": true
           },
           {
             "checkname": "router_version_check.py",
             "checktype": "basic",
             "details": "Template and scripts version match successful",
             "lastupdated": "2019-12-16T15:12:41+0530",
             "success": true
           },
           {
             "checkname": "ssh.service",
             "checktype": "basic",
             "details": "service is running",
             "lastupdated": "2019-12-16T15:12:38+0530",
             "success": true
           },
           {
             "checkname": "webserver.service",
             "checktype": "basic",
             "details": "service is running",
             "lastupdated": "2019-12-16T15:12:38+0530",
             "success": true
           }
         ],
         "healthchecksfailed": false,
         "id": "920452d6-7951-4425-ba2c-aecb2ddaaf6b"
       }
     ]
   }
   ```
   And added new API - getRouterHealthCheckResults-
   ```
   (local) 🐵 > get routerhealthcheckresults 
routerid="920452d6-7951-4425-ba2c-aecb2ddaaf6b  " performfreshchecks=true 
   {
     "routerhealthchecks": {
       "healthchecks": [
         {
           "checkname": "connectivity.test",
           "checktype": "basic",
           "details": "Successfully fetched data",
           "lastupdated": "2019-12-16T15:19:47+0530",
           "success": true
         },
         {
           "checkname": "cpu_usage_check.py",
           "checktype": "basic",
           "details": "CPU Usage within limits with current at 2.4%",
           "lastupdated": "2019-12-16T15:19:43+0530",
           "success": true
         },
         {
           "checkname": "dhcp.service",
           "checktype": "basic",
           "details": "service is running",
           "lastupdated": "2019-12-16T15:19:43+0530",
           "success": true
         },
         {
           "checkname": "dhcp_check.py",
           "checktype": "advanced",
           "details": "All 1 VMs are present in dhcphosts.txt",
           "lastupdated": "2019-12-16T15:19:47+0530",
           "success": true
         },
         {
           "checkname": "disk_space_check.py",
           "checktype": "basic",
           "details": "Sufficient free space is 345 MB",
           "lastupdated": "2019-12-16T15:19:46+0530",
           "success": true
         },
         {
           "checkname": "dns_check.py",
           "checktype": "advanced",
           "details": "All 1 VMs are present in /etc/hosts",
           "lastupdated": "2019-12-16T15:19:47+0530",
           "success": true
         },
         {
           "checkname": "gateways_check.py",
           "checktype": "basic",
           "details": "All 1 gateways are reachable via ping",
           "lastupdated": "2019-12-16T15:19:46+0530",
           "success": true
         },
         {
           "checkname": "haproxy_check.py",
           "checktype": "advanced",
           "details": "No data provided to check, skipping",
           "lastupdated": "2019-12-16T15:19:47+0530",
           "success": true
         },
         {
           "checkname": "iptables_check.py",
           "checktype": "advanced",
           "details": "No portforwarding rules provided to check, skipping",
           "lastupdated": "2019-12-16T15:19:47+0530",
           "success": true
         },
         {
           "checkname": "loadbalancing.service",
           "checktype": "basic",
           "details": "service is running",
           "lastupdated": "2019-12-16T15:19:43+0530",
           "success": true
         },
         {
           "checkname": "memory_usage_check.py",
           "checktype": "basic",
           "details": "Memory Usage within limits with current at 23.8486%",
           "lastupdated": "2019-12-16T15:19:43+0530",
           "success": true
         },
         {
           "checkname": "router_version_check.py",
           "checktype": "basic",
           "details": "Template and scripts version match successful",
           "lastupdated": "2019-12-16T15:19:46+0530",
           "success": true
         },
         {
           "checkname": "ssh.service",
           "checktype": "basic",
           "details": "service is running",
           "lastupdated": "2019-12-16T15:19:43+0530",
           "success": true
         },
         {
           "checkname": "webserver.service",
           "checktype": "basic",
           "details": "service is running",
           "lastupdated": "2019-12-16T15:19:43+0530",
           "success": true
         }
       ],
       "routerid  ": "920452d6-7951-4425-ba2c-aecb2ddaaf6b"
     }
   }
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to