anuragaw opened a new pull request #3575: Health check feature for virtual router URL: https://github.com/apache/cloudstack/pull/3575 We want to support more exhaustive health checks for VRs. This feature helps admins configuring health checks and also expands it's scope. There are two categories of health checks - basic and advanced (more expensive so should be run less frequently). The following checks have been added with a separate script - 1. Services check (as per existing monitorServices.py) - basic check 2. Disk space check against a threshold - basic check 3. CPU usage check against a threshold - basic check 4. Memory usage check against a threshold - basic check 5. Router template and scripts version check - basic 6. Connectivity to the gateways from router - basic 7. DNS config match against MS - advanced check 8. DHCP config match against MS - advanced check 9. HA Proxy config match against MS (internal LB and public LB) - advance check 10. Port forwarding match against MS in iptables. - advance check Following global configs were added for configuring health checks: • "router.health.checks.enabled" - If true, router health checks are allowed to be executed and read. If false, all scheduled checks and API calls for on demand checks are disabled. Default is true. • "router.health.checks.basic.interval" - Interval in minutes at which basic router health checks are performed. If set to 0, no tests are scheduled. Default is 3 mins as per the existing monitor services. • "router.health.checks.advanced.interval" - Interval in minutes at which advanced router health checks are performed. If set to 0, no tests are scheduled. Default value is 10 minutes . • "router.health.checks.config .refresh.interval" - Interval in minutes at which router health checks config - such as scheduling intervals, excluded checks, etc is updated on virtual routers by the management server. This value should be sufficiently high (like 2x) from the router.health.checks.basic.interval and router.health.checks.advanced.interval so that there is time between new results generation for passed data. Default is 10 mins. • "router.health.checks.results.fetch.interval" - Interval in minutes at which router health checks results are fetched by management server. On each result fetch, management server evaluates need to recreate VR as per configuration of router.health.checks.failures.to.recreate.vr. This value should be sufficiently high (like 2x) from the router.health.checks.basic.interval and router.health.checks.advanced.interval so that there is time between new results generation and fetch. • "router.health.checks.failures.to.recreate.vr" - Health checks failures defined by this config are the checks that should cause router recreation. If empty the recreate is not attempted for any health check failure. Possible values are comma separated script names from systemvm’s /root/health_scripts/ (namely - cpu_usage_check.py, dhcp_check.py, disk_space_check.py, dns_check.py, gateways_check.py, haproxy_check.py, iptables_check.py, memory_usage_check.py, router_version_check.py), connectivity.test or services (namely - loadbalancing.service, webserver.service, dhcp.service) • "router.health.checks.to.exclude" - Health checks that should be excluded when executing scheduled checks on the router. This can be a comma separated list of script names placed in the '/root/health_checks/' folder. Currently the following scripts are placed in default systemvm template - cpu_usage_check.py, disk_space_check.py, gateways_check.py, iptables_check.py, router_version_check.py, dhcp_check.py, dns_check.py, haproxy_check.py, memory_usage_check.py. • "router.health.checks.free.disk.space.threshold" - Free disk space threshold (in MB) on VR below which the check is considered a failure. Default is 100MB. • "router.health.checks.max.cpu.usage.threshold" - Max CPU Usage threshold as % above which check is considered a failure. • "router.health.checks.max.memory.usage.threshold" - Max Memory Usage threshold as % above which check is considered a failure. API Changes: * listRouters and listInternalLoadBalancers now optionally takes in a flag includehealthcheckresults (default false) to fetch the last health check results for the router. * getRouterHealthCheckResults - a new API is added to fetch health check results with an optional flag performfreshchecks to execute checks on demand. This execution is only disabled if "router.health.checks.enabled" is false. performfreshchecks = true means all data from Management server is sent to the router and fresh checks are executed. If false, we retrieve the previously executed result from router itself. Additionally the feature looks into any executable script in /root/health_scripts/ directory and adds it's result as json output of the overall health checks config. This allows custom checks to be put in and custom systemvm templates can also support health checks. UI shows router in alert state if health checks are failure. The health checks can be manually triggered using new API added in the feature (CLI or UI both support this). ## Description <!--- Describe your changes in detail --> <!-- For new features, provide link to FS, dev ML discussion etc. --> <!-- In case of bug fix, the expected and actual behaviours, steps to reproduce. --> <!-- When "Fixes: #<id>" is specified, the issue/PR will automatically be closed when this PR gets merged --> <!-- For addressing multiple issues/PRs, use multiple "Fixes: #<id>" --> <!-- Fixes: # --> Fixes: 3270 ## Types of changes <!--- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [x] New feature (non-breaking change which adds functionality) - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] Enhancement (improves an existing feature and functionality) - [ ] Cleanup (Code refactoring and cleanup, that may add test cases) ## Screenshots (if appropriate): ## How Has This Been Tested? <!-- Please describe in detail how you tested your changes. --> <!-- Include details of your testing environment, and the tests you ran to --> <!-- see how your change affects other areas of the code, etc. --> Integration tests, manually, CMK, UI      API Changes - New parameters added to list routers- ``` (local) 🐵 > list routers includehealthcheckresults=true filter=id,healthchecksfailed,healthcheckresults { "count": 1, "router": [ { "healthcheckresults": [ { "checkname": "connectivity", "checktype": "basic", "details": "Successfully fetched data", "lastupdated": "2019-12-16T15:14:06+0530", "success": true }, { "checkname": "cpu_usage_check.py", "checktype": "basic", "details": "CPU Usage within limits with current at 1.7%", "lastupdated": "2019-12-16T15:12:38+0530", "success": true }, { "checkname": "dhcp.service", "checktype": "basic", "details": "service is running", "lastupdated": "2019-12-16T15:12:38+0530", "success": true }, { "checkname": "dhcp_check.py", "checktype": "advance", "details": "All 1 VMs are present in dhcphosts.txt", "lastupdated": "2019-12-16T15:12:41+0530", "success": true }, { "checkname": "disk_space_check.py", "checktype": "basic", "details": "Sufficient free space is 345 MB", "lastupdated": "2019-12-16T15:12:41+0530", "success": true }, { "checkname": "dns_check.py", "checktype": "advance", "details": "All 1 VMs are present in /etc/hosts", "lastupdated": "2019-12-16T15:12:41+0530", "success": true }, { "checkname": "gateways_check.py", "checktype": "basic", "details": "All 1 gateways are reachable via ping", "lastupdated": "2019-12-16T15:12:41+0530", "success": true }, { "checkname": "haproxy_check.py", "checktype": "advance", "details": "No data provided to check, skipping", "lastupdated": "2019-12-16T15:12:41+0530", "success": true }, { "checkname": "iptables_check.py", "checktype": "advance", "details": "No portforwarding rules provided to check, skipping", "lastupdated": "2019-12-16T15:12:41+0530", "success": true }, { "checkname": "loadbalancing.service", "checktype": "basic", "details": "service is running", "lastupdated": "2019-12-16T15:12:38+0530", "success": true }, { "checkname": "memory_usage_check.py", "checktype": "basic", "details": "Memory Usage within limits with current at 23.704%", "lastupdated": "2019-12-16T15:12:38+0530", "success": true }, { "checkname": "router_version_check.py", "checktype": "basic", "details": "Template and scripts version match successful", "lastupdated": "2019-12-16T15:12:41+0530", "success": true }, { "checkname": "ssh.service", "checktype": "basic", "details": "service is running", "lastupdated": "2019-12-16T15:12:38+0530", "success": true }, { "checkname": "webserver.service", "checktype": "basic", "details": "service is running", "lastupdated": "2019-12-16T15:12:38+0530", "success": true } ], "healthchecksfailed": false, "id": "920452d6-7951-4425-ba2c-aecb2ddaaf6b" } ] } ``` And added new API - getRouterHealthCheckResults- ``` (local) 🐵 > get routerhealthcheckresults routerid="920452d6-7951-4425-ba2c-aecb2ddaaf6b " performfreshchecks=true { "routerhealthchecks": { "healthchecks": [ { "checkname": "connectivity.test", "checktype": "basic", "details": "Successfully fetched data", "lastupdated": "2019-12-16T15:19:47+0530", "success": true }, { "checkname": "cpu_usage_check.py", "checktype": "basic", "details": "CPU Usage within limits with current at 2.4%", "lastupdated": "2019-12-16T15:19:43+0530", "success": true }, { "checkname": "dhcp.service", "checktype": "basic", "details": "service is running", "lastupdated": "2019-12-16T15:19:43+0530", "success": true }, { "checkname": "dhcp_check.py", "checktype": "advanced", "details": "All 1 VMs are present in dhcphosts.txt", "lastupdated": "2019-12-16T15:19:47+0530", "success": true }, { "checkname": "disk_space_check.py", "checktype": "basic", "details": "Sufficient free space is 345 MB", "lastupdated": "2019-12-16T15:19:46+0530", "success": true }, { "checkname": "dns_check.py", "checktype": "advanced", "details": "All 1 VMs are present in /etc/hosts", "lastupdated": "2019-12-16T15:19:47+0530", "success": true }, { "checkname": "gateways_check.py", "checktype": "basic", "details": "All 1 gateways are reachable via ping", "lastupdated": "2019-12-16T15:19:46+0530", "success": true }, { "checkname": "haproxy_check.py", "checktype": "advanced", "details": "No data provided to check, skipping", "lastupdated": "2019-12-16T15:19:47+0530", "success": true }, { "checkname": "iptables_check.py", "checktype": "advanced", "details": "No portforwarding rules provided to check, skipping", "lastupdated": "2019-12-16T15:19:47+0530", "success": true }, { "checkname": "loadbalancing.service", "checktype": "basic", "details": "service is running", "lastupdated": "2019-12-16T15:19:43+0530", "success": true }, { "checkname": "memory_usage_check.py", "checktype": "basic", "details": "Memory Usage within limits with current at 23.8486%", "lastupdated": "2019-12-16T15:19:43+0530", "success": true }, { "checkname": "router_version_check.py", "checktype": "basic", "details": "Template and scripts version match successful", "lastupdated": "2019-12-16T15:19:46+0530", "success": true }, { "checkname": "ssh.service", "checktype": "basic", "details": "service is running", "lastupdated": "2019-12-16T15:19:43+0530", "success": true }, { "checkname": "webserver.service", "checktype": "basic", "details": "service is running", "lastupdated": "2019-12-16T15:19:43+0530", "success": true } ], "routerid ": "920452d6-7951-4425-ba2c-aecb2ddaaf6b" } } ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
