This is an automated email from the ASF dual-hosted git repository.

rob pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/trafficcontrol.git


The following commit(s) were added to refs/heads/master by this push:
     new 1e0122bcab updates to the trafficcontrol-health-client: (#6724)
1e0122bcab is described below

commit 1e0122bcabb3a589ae7e84d6c5bf5c0684552e51
Author: John J. Rushford <[email protected]>
AuthorDate: Mon Apr 18 09:11:21 2022 -0600

    updates to the trafficcontrol-health-client: (#6724)
    
    - updates the host status commands used by the health-client
      to support both ATS 9 and ATS 10 traffic_ctl commands.
    - updates the parsing of host statuses output to support both
      ATS 9 and ATS 10 output formats.
    - adds a markup-poll-threshold config variable to help with
      health flapping
---
 CHANGELOG.md                        |   2 +
 tc-health-client/README.md          | 104 ++++++++++++----------
 tc-health-client/config/config.go   |   2 +
 tc-health-client/tmagent/tmagent.go | 173 +++++++++++++++++++++++-------------
 4 files changed, 171 insertions(+), 110 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 89f9a8ddf8..fc498e1427 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -56,6 +56,8 @@ The format is based on [Keep a 
Changelog](http://keepachangelog.com/en/1.0.0/).
 - Updated the Profiles Traffic Portal page to use a more performant 
AG-Grid-based table.
 - Updated Go version to 1.18
 - Removed the unused `deliveryservice_tmuser` table from Traffic Ops database
+- Adds updates to the trafficcontrol-health-client to, use new ATS Host status 
formats, detect and use proper
+  traffic_ctl commands, and adds new markup-poll-threshold config.
 
 ## [6.1.0] - 2022-01-18
 ### Added
diff --git a/tc-health-client/README.md b/tc-health-client/README.md
index ff6d42c0de..e397b8cae0 100644
--- a/tc-health-client/README.md
+++ b/tc-health-client/README.md
@@ -71,24 +71,24 @@ Requires Apache TrafficServer 8.1.0 or later.
 
 -f, -\-config-file=config-file 
   
-  Specify the config file to use.  
-  Defaults to /etc/trafficcontro-health-client/tc-health-client.json
+Specify the config file to use.  
+Defaults to /etc/trafficcontro-health-client/tc-health-client.json
 
 -h, -\-help 
 
-  Prints command line usage and exits
+Prints command line usage and exits
 
 -l, -\-logging-dir=logging-directory
 
-  Specify the directory where log files are kept.  The default location
-  is **/var/log/trafficcontrol/**
+Specify the directory where log files are kept.  The default location
+is **/var/log/trafficcontrol/**
 
 -v, -\-verbose
 
-  Logging verbosity.  Errors are logged to the default log file 
-  **/var/log/trafficcontrol/tc-health-client.log**
-  To add Warnings, use -v.  To add Warnings and Informational 
-  logging, use -vv.  Finally you may add Debug logging using -vvv.
+Logging verbosity.  Errors are logged to the default log file 
+**/var/log/trafficcontrol/tc-health-client.log**
+To add Warnings, use -v.  To add Warnings and Informational 
+logging, use -vv.  Finally you may add Debug logging using -vvv.
 
 # CONFIGURATION
 
@@ -109,6 +109,7 @@ Sample configuarion file:
     "tm-proxy-url", "http://sample-http-proxy.cdn.net:80";,
     "to-login-dispersion-factor": 90,
     "unavailable-poll-threshold": 2,
+    "markup-poll-threshold": 1,
     "trafficserver-config-dir": "/opt/trafficserver/etc/trafficserver",
     "trafficserver-bin-dir": "/opt/trafficserver/bin",
     "poll-state-json-log": "/var/log/trafficcontrol/poll-state.json",
@@ -118,85 +119,91 @@ Sample configuarion file:
 
 ### cdn-name 
 
-  The name of the CDN that the Traffic Server host is a member of.
+The name of the CDN that the Traffic Server host is a member of.
 
 ### enable-active-markdowns
 
-  When enabled, the client will actively mark down Traffic Server parents.
-  When disabled, the client will only log that it would have marked down
-  Traffic Server parents.  Down Parents are always marked UP if Traffic Monitor
-  reports them available irregardless of this setting.
+When enabled, the client will actively mark down Traffic Server parents.
+When disabled, the client will only log that it would have marked down
+Traffic Server parents.  Down Parents are always marked UP if Traffic Monitor
+reports them available irregardless of this setting.
 
 ### reason-code
 
-  Use the reason code **active** or **local** when marking down Traffic Server
-  hosts in the Traffic Server **HostStatus** subsystem.
+Use the reason code **active** or **local** when marking down Traffic Server
+hosts in the Traffic Server **HostStatus** subsystem.
 
 ### to-credential-file
 
-  The file where **Traffic Ops** credentials are read.  The file should define 
the 
-  following variables:
+The file where **Traffic Ops** credentials are read.  The file should define 
the 
+following variables:
 
-  * TO_URL="https://trafficops.cdn.com";
-  * TO_USER="touser"
-  * TO_PASS="touser_password"
+* TO_URL="https://trafficops.cdn.com";
+* TO_USER="touser"
+* TO_PASS="touser_password"
 
 ### to-url
 
-  The **Traffic Ops** URL
+The **Traffic Ops** URL
 
 ### to-request-timeout-seconds
 
-  The time in seconds to wait for a query response from both **Traffic Ops** 
and
-  the **Traffic Monitors**
+The time in seconds to wait for a query response from both **Traffic Ops** and
+the **Traffic Monitors**
 
 ### tm-poll-interval-seconds
 
-  The polling interval in seconds used to update **Traffic Server** parent
-  status.
+The polling interval in seconds used to update **Traffic Server** parent
+status.
 
 ### tm-proxy-url
 
-  If not nil, all Traffic Monitor requests will be proxied through this 
-  proxy endpoint.  This is useful when there are large numbers of caches
-  polling a Traffic Monitor and you wish to funnel queries through a caching
-  proxy server to limit direct direct connections to Traffic Monitor.
+If not nil, all Traffic Monitor requests will be proxied through this 
+proxy endpoint.  This is useful when there are large numbers of caches
+polling a Traffic Monitor and you wish to funnel queries through a caching
+proxy server to limit direct direct connections to Traffic Monitor.
 
 ### to-login-dispersion-factor
 
-  This is used to calculate TrafficOps login dispersion.  It is related to the
-  **tm-poll-interval-seconds**.  The login dispersion is computed by 
multiplying
-  **tm-poll-interval-seconds** by the **to-login-dispersion-factor**.  For 
example
-  if to-login-dispersion-factor is 90 and the **tm-poll-interval-seconds** is 
10s
-  the the dispersion modulo window is 900s.
+This is used to calculate TrafficOps login dispersion.  It is related to the
+**tm-poll-interval-seconds**.  The login dispersion is computed by multiplying
+**tm-poll-interval-seconds** by the **to-login-dispersion-factor**.  For 
example
+if to-login-dispersion-factor is 90 and the **tm-poll-interval-seconds** is 10s
+the the dispersion modulo window is 900s.
 
 ### unavailable-poll-threshold
 
-  This controls when an unhealthy parent is marked down.  An unhealthy parent
-  will be marked down when the number of consecutive polls reaches this 
threshold
-  with the parent reported as unhealthy.  The default threshold is 2.
+This controls when an unhealthy parent is marked down.  An unhealthy parent
+will be marked down when the number of consecutive polls reaches this threshold
+with the parent reported as unhealthy.  The default threshold is 2.
+
+### markup-poll-threshold
+
+This controls when a healthy parent is marked up.  An healthy parent
+will be marked up when the number of consecutive polls reaches this threshold
+with the parent reported as healthy.  The default threshold is 1.
 
 ### trafficserver-config-dir
 
-  The location on the host where **Traffic Server** configuration files are 
-  located.
+The location on the host where **Traffic Server** configuration files are 
+located.
 
 ### trafficserver-bin-dir
 
-  The location on the host where **Traffic Server** **traffic_ctl** tool may
-  be found.
+The location on the host where **Traffic Server** **traffic_ctl** tool may
+be found.
 
 ### poll-state-json-log ###
 
-  The full path to the polling state file which contains information 
-  about the current status of parents and the health client configuration.
-  Polling state data is written to this file after each polling cycle when
-  enabled, see **enable-poll-state-log**
+The full path to the polling state file which contains information 
+about the current status of parents and the health client configuration.
+Polling state data is written to this file after each polling cycle when
+enabled, see **enable-poll-state-log**
 
 ### enable-poll-state-log ###
 
-  Enable writing the Polling state to the **poll-state-json-log** after
-  eache polling cycle.  Default **false**, disabled
+Enable writing the Polling state to the **poll-state-json-log** after
+eache polling cycle.  Default **false**, disabled
 
 # Files
 
@@ -209,4 +216,3 @@ Sample configuarion file:
 * Traffic Server **strategies.yaml**
 * Traffic Server **traffic_ctl** command
   
-   
diff --git a/tc-health-client/config/config.go 
b/tc-health-client/config/config.go
index 5255534bb7..99dc1d870c 100644
--- a/tc-health-client/config/config.go
+++ b/tc-health-client/config/config.go
@@ -55,6 +55,7 @@ const (
        DefaultTrafficServerConfigDir   = "/opt/trafficserver/etc/trafficserver"
        DefaultTrafficServerBinDir      = "/opt/trafficserver/bin"
        DefaultUnavailablePollThreshold = 2
+       DefaultMarkupPollThreshold      = 1
 )
 
 type Cfg struct {
@@ -70,6 +71,7 @@ type Cfg struct {
        TmPollIntervalSeconds    string          
`json:"tm-poll-interval-seconds"`
        TOLoginDispersionFactor  int             
`json:"to-login-dispersion-factor"`
        UnavailablePollThreshold int             
`json:"unavailable-poll-threshold"`
+       MarkUpPollThreshold      int             `json:"markup-poll-threshold"`
        TrafficServerConfigDir   string          
`json:"trafficserver-config-dir"`
        TrafficServerBinDir      string          `json:"trafficserver-bin-dir"`
        PollStateJSONLog         string          `json:"poll-state-json-log"`
diff --git a/tc-health-client/tmagent/tmagent.go 
b/tc-health-client/tmagent/tmagent.go
index bb4ebe08c8..1783311eaf 100644
--- a/tc-health-client/tmagent/tmagent.go
+++ b/tc-health-client/tmagent/tmagent.go
@@ -49,6 +49,14 @@ const (
        StrategiesFile = "strategies.yaml"
 )
 
+// this global is used to auto select the
+// proper ATS traffic_ctl command to use
+// when querying host status. for ATS
+// version 10 and greater this will remain
+// at 0.  For ATS version 9, this will be
+// auto updated to 1
+var traffic_ctl_index = 0
+
 type ParentAvailable interface {
        available(reasonCode string) bool
 }
@@ -85,6 +93,7 @@ type ParentStatus struct {
        ManualReason         bool
        LastTmPoll           int64
        UnavailablePollCount int
+       MarkUpPollCount      int
 }
 
 // used to get the overall parent availablity from the
@@ -531,6 +540,7 @@ func (c *ParentInfo) markParent(fqdn string, cacheStatus 
string, available bool)
                activeReason := pv.ActiveReason
                localReason := pv.LocalReason
                unavailablePollCount := pv.UnavailablePollCount
+               markUpPollCount := pv.MarkUpPollCount
 
                log.Debugf("hostName: %s, UnavailablePollCount: %d, available: 
%v", hostName, unavailablePollCount, available)
 
@@ -544,22 +554,31 @@ func (c *ParentInfo) markParent(fqdn string, cacheStatus 
string, available bool)
                                err = c.execTrafficCtl(fqdn, available)
                                if err != nil {
                                        log.Errorln(err.Error())
-                               }
-                               if err == nil {
+                               } else {
                                        hostAvailable = false
+                                       // reset the poll counts
+                                       markUpPollCount = 0
+                                       unavailablePollCount = 0
                                        log.Infof("marked parent %s DOWN, cache 
status was: %s\n", hostName, cacheStatus)
                                }
                        }
                } else { // available
                        // marking the host up
-                       err = c.execTrafficCtl(fqdn, available)
-                       if err == nil {
-                               hostAvailable = true
-                               // reset the unavilable poll count
-                               unavailablePollCount = 0
-                               log.Infof("marked parent %s UP, cache status 
was: %s\n", hostName, cacheStatus)
-                       } else {
+                       markUpPollCount += 1
+                       if markUpPollCount < c.Cfg.MarkUpPollThreshold {
+                               log.Infof("TM indicates %s is available but the 
MarkUpPollThreshold has not been reached", hostName)
                                hostAvailable = false
+                       } else {
+                               err = c.execTrafficCtl(fqdn, available)
+                               if err != nil {
+                                       log.Errorln(err.Error())
+                               } else {
+                                       hostAvailable = true
+                                       // reset the poll counts
+                                       unavailablePollCount = 0
+                                       markUpPollCount = 0
+                                       log.Infof("marked parent %s UP, cache 
status was: %s\n", hostName, cacheStatus)
+                               }
                        }
                }
 
@@ -576,6 +595,7 @@ func (c *ParentInfo) markParent(fqdn string, cacheStatus 
string, available bool)
                        pv.ActiveReason = activeReason
                        pv.LocalReason = localReason
                        pv.UnavailablePollCount = unavailablePollCount
+                       pv.MarkUpPollCount = markUpPollCount
                        c.Parents[hostName] = pv
                        log.Debugf("Updated parent status: %v", pv)
                }
@@ -587,15 +607,35 @@ func (c *ParentInfo) markParent(fqdn string, cacheStatus 
string, available bool)
 // subsystem.
 func (c *ParentInfo) readHostStatus(parentStatus map[string]ParentStatus) 
error {
        tc := filepath.Join(c.TrafficServerBinDir, TrafficCtl)
-
-       cmd := exec.Command(tc, "metric", "match", "host_status")
        var stdout bytes.Buffer
        var stderr bytes.Buffer
-       cmd.Stdout = &stdout
-       cmd.Stderr = &stderr
-       err := cmd.Run()
-       if err != nil {
-               return fmt.Errorf("%s error: %s", TrafficCtl, stderr.String())
+
+       // auto select traffic_ctl command for ATS version 9 or 10 and later
+       for i := traffic_ctl_index; i <= 1; i++ {
+               var err error
+               switch i {
+               case 0: // ATS version 10 and later
+                       cmd := exec.Command(tc, "host", "status")
+                       cmd.Stdout = &stdout
+                       cmd.Stderr = &stderr
+                       err = cmd.Run()
+               case 1: // ATS version 9
+                       cmd := exec.Command(tc, "metric", "match", 
"host_status")
+                       cmd.Stdout = &stdout
+                       cmd.Stderr = &stderr
+                       err = cmd.Run()
+               }
+               if err == nil {
+                       break
+               }
+               if err != nil && i == 0 {
+                       log.Infof("%s command used is not for ATS version 10 or 
later, downgrading to ATS version 9\n", TrafficCtl)
+                       traffic_ctl_index = 1
+                       continue
+               }
+               if err != nil {
+                       return fmt.Errorf("%s error: %s", TrafficCtl, 
stderr.String())
+               }
        }
 
        if len((stdout.Bytes())) > 0 {
@@ -607,55 +647,66 @@ func (c *ParentInfo) readHostStatus(parentStatus 
map[string]ParentStatus) error
                scanner := bufio.NewScanner(bytes.NewReader(stdout.Bytes()))
                for scanner.Scan() {
                        line := strings.TrimSpace(scanner.Text())
-                       if strings.HasPrefix(line, 
"proxy.process.host_status.") {
-                               fields := strings.Split(line, " ")
-                               if len(fields) > 0 {
-                                       fqdnField := strings.Split(fields[0], 
"proxy.process.host_status.")
-                                       if len(fqdnField) > 0 {
-                                               fqdn = fqdnField[1]
+                       fields := strings.Split(line, " ")
+                       /*
+                        * For ATS Version 9, the host status uses internal 
stats and prefixes
+                        * the fqdn field from the output of the traffic_ctl 
host status and metric
+                        * match commands with "proxy.process.host_status".  
Going forward starting
+                        * with ATS Version 10, internal stats are no-longer 
used and the fqdn field
+                        * is no-longer prefixed with the 
"proxy.process.host_status" string.
+                        */
+                       if len(fields) == 2 {
+                               // check for ATS version 9 output.
+                               fqdnField := strings.Split(fields[0], 
"proxy.process.host_status.")
+                               if len(fqdnField) == 2 { // ATS version 9
+                                       fqdn = fqdnField[1]
+                               } else { // ATS version 10 and greater
+                                       fqdn = fqdnField[0]
+                               }
+                               statField := strings.Split(fields[1], ",")
+                               if len(statField) == 5 {
+                                       if strings.HasPrefix(statField[1], 
"ACTIVE:UP") {
+                                               activeReason = true
+                                       } else if 
strings.HasPrefix(statField[1], "ACTIVE:DOWN") {
+                                               activeReason = false
                                        }
-                                       statField := strings.Split(fields[1], 
",")
-                                       if len(statField) == 5 {
-                                               if 
strings.HasPrefix(statField[1], "ACTIVE:UP") {
-                                                       activeReason = true
-                                               } else if 
strings.HasPrefix(statField[1], "ACTIVE:DOWN") {
-                                                       activeReason = false
-                                               }
-                                               if 
strings.HasPrefix(statField[2], "LOCAL:UP") {
-                                                       localReason = true
-                                               } else if 
strings.HasPrefix(statField[2], "LOCAL:DOWN") {
-                                                       localReason = false
-                                               }
-                                               if 
strings.HasPrefix(statField[3], "MANUAL:UP") {
-                                                       manualReason = true
-                                               } else if 
strings.HasPrefix(statField[3], "MANUAL:DOWN") {
-                                                       manualReason = false
-                                               }
+                                       if strings.HasPrefix(statField[2], 
"LOCAL:UP") {
+                                               localReason = true
+                                       } else if 
strings.HasPrefix(statField[2], "LOCAL:DOWN") {
+                                               localReason = false
                                        }
-                                       pstat := ParentStatus{
-                                               Fqdn:                 fqdn,
-                                               ActiveReason:         
activeReason,
-                                               LocalReason:          
localReason,
-                                               ManualReason:         
manualReason,
-                                               LastTmPoll:           0,
-                                               UnavailablePollCount: 0,
+                                       if strings.HasPrefix(statField[3], 
"MANUAL:UP") {
+                                               manualReason = true
+                                       } else if 
strings.HasPrefix(statField[3], "MANUAL:DOWN") {
+                                               manualReason = false
                                        }
-                                       hostName = parseFqdn(fqdn)
-                                       pv, ok := parentStatus[hostName]
-                                       // create the ParentStatus struct and 
add it to the
-                                       // Parents map only if an entry in the 
map does not
-                                       // already exist.
-                                       if !ok {
+                               }
+                               pstat := ParentStatus{
+                                       Fqdn:                 fqdn,
+                                       ActiveReason:         activeReason,
+                                       LocalReason:          localReason,
+                                       ManualReason:         manualReason,
+                                       LastTmPoll:           0,
+                                       UnavailablePollCount: 0,
+                                       MarkUpPollCount:      0,
+                               }
+                               log.Debugf("processed host status record: 
%v\n", pstat)
+                               hostName = parseFqdn(fqdn)
+                               pv, ok := parentStatus[hostName]
+                               // create the ParentStatus struct and add it to 
the
+                               // Parents map only if an entry in the map does 
not
+                               // already exist.
+                               if !ok {
+                                       parentStatus[hostName] = pstat
+                                       log.Infof("added Host '%s' from ATS 
Host Status to the parents map\n", hostName)
+                               } else {
+                                       available := 
pstat.available(c.Cfg.ReasonCode)
+                                       if pv.available(c.Cfg.ReasonCode) != 
available {
+                                               log.Infof("host status for '%s' 
has changed to %s\n", hostName, pstat.Status())
+                                               pstat.LastTmPoll = pv.LastTmPoll
+                                               pstat.UnavailablePollCount = 
pv.UnavailablePollCount
+                                               pstat.MarkUpPollCount = 
pv.MarkUpPollCount
                                                parentStatus[hostName] = pstat
-                                               log.Infof("added Host '%s' from 
ATS Host Status to the parents map\n", hostName)
-                                       } else {
-                                               available := 
pstat.available(c.Cfg.ReasonCode)
-                                               if 
pv.available(c.Cfg.ReasonCode) != available {
-                                                       log.Infof("host status 
for '%s' has changed to %s\n", hostName, pstat.Status())
-                                                       pstat.LastTmPoll = 
pv.LastTmPoll
-                                                       
pstat.UnavailablePollCount = pv.UnavailablePollCount
-                                                       parentStatus[hostName] 
= pstat
-                                               }
                                        }
                                }
                        }

Reply via email to