This is an automated email from the ASF dual-hosted git repository.
rob pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/trafficcontrol.git
The following commit(s) were added to refs/heads/master by this push:
new 1e0122bcab updates to the trafficcontrol-health-client: (#6724)
1e0122bcab is described below
commit 1e0122bcabb3a589ae7e84d6c5bf5c0684552e51
Author: John J. Rushford <[email protected]>
AuthorDate: Mon Apr 18 09:11:21 2022 -0600
updates to the trafficcontrol-health-client: (#6724)
- updates the host status commands used by the health-client
to support both ATS 9 and ATS 10 traffic_ctl commands.
- updates the parsing of host statuses output to support both
ATS 9 and ATS 10 output formats.
- adds a markup-poll-threshold config variable to help with
health flapping
---
CHANGELOG.md | 2 +
tc-health-client/README.md | 104 ++++++++++++----------
tc-health-client/config/config.go | 2 +
tc-health-client/tmagent/tmagent.go | 173 +++++++++++++++++++++++-------------
4 files changed, 171 insertions(+), 110 deletions(-)
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 89f9a8ddf8..fc498e1427 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -56,6 +56,8 @@ The format is based on [Keep a
Changelog](http://keepachangelog.com/en/1.0.0/).
- Updated the Profiles Traffic Portal page to use a more performant
AG-Grid-based table.
- Updated Go version to 1.18
- Removed the unused `deliveryservice_tmuser` table from Traffic Ops database
+- Adds updates to the trafficcontrol-health-client to, use new ATS Host status
formats, detect and use proper
+ traffic_ctl commands, and adds new markup-poll-threshold config.
## [6.1.0] - 2022-01-18
### Added
diff --git a/tc-health-client/README.md b/tc-health-client/README.md
index ff6d42c0de..e397b8cae0 100644
--- a/tc-health-client/README.md
+++ b/tc-health-client/README.md
@@ -71,24 +71,24 @@ Requires Apache TrafficServer 8.1.0 or later.
-f, -\-config-file=config-file
- Specify the config file to use.
- Defaults to /etc/trafficcontro-health-client/tc-health-client.json
+Specify the config file to use.
+Defaults to /etc/trafficcontro-health-client/tc-health-client.json
-h, -\-help
- Prints command line usage and exits
+Prints command line usage and exits
-l, -\-logging-dir=logging-directory
- Specify the directory where log files are kept. The default location
- is **/var/log/trafficcontrol/**
+Specify the directory where log files are kept. The default location
+is **/var/log/trafficcontrol/**
-v, -\-verbose
- Logging verbosity. Errors are logged to the default log file
- **/var/log/trafficcontrol/tc-health-client.log**
- To add Warnings, use -v. To add Warnings and Informational
- logging, use -vv. Finally you may add Debug logging using -vvv.
+Logging verbosity. Errors are logged to the default log file
+**/var/log/trafficcontrol/tc-health-client.log**
+To add Warnings, use -v. To add Warnings and Informational
+logging, use -vv. Finally you may add Debug logging using -vvv.
# CONFIGURATION
@@ -109,6 +109,7 @@ Sample configuarion file:
"tm-proxy-url", "http://sample-http-proxy.cdn.net:80",
"to-login-dispersion-factor": 90,
"unavailable-poll-threshold": 2,
+ "markup-poll-threshold": 1,
"trafficserver-config-dir": "/opt/trafficserver/etc/trafficserver",
"trafficserver-bin-dir": "/opt/trafficserver/bin",
"poll-state-json-log": "/var/log/trafficcontrol/poll-state.json",
@@ -118,85 +119,91 @@ Sample configuarion file:
### cdn-name
- The name of the CDN that the Traffic Server host is a member of.
+The name of the CDN that the Traffic Server host is a member of.
### enable-active-markdowns
- When enabled, the client will actively mark down Traffic Server parents.
- When disabled, the client will only log that it would have marked down
- Traffic Server parents. Down Parents are always marked UP if Traffic Monitor
- reports them available irregardless of this setting.
+When enabled, the client will actively mark down Traffic Server parents.
+When disabled, the client will only log that it would have marked down
+Traffic Server parents. Down Parents are always marked UP if Traffic Monitor
+reports them available irregardless of this setting.
### reason-code
- Use the reason code **active** or **local** when marking down Traffic Server
- hosts in the Traffic Server **HostStatus** subsystem.
+Use the reason code **active** or **local** when marking down Traffic Server
+hosts in the Traffic Server **HostStatus** subsystem.
### to-credential-file
- The file where **Traffic Ops** credentials are read. The file should define
the
- following variables:
+The file where **Traffic Ops** credentials are read. The file should define
the
+following variables:
- * TO_URL="https://trafficops.cdn.com"
- * TO_USER="touser"
- * TO_PASS="touser_password"
+* TO_URL="https://trafficops.cdn.com"
+* TO_USER="touser"
+* TO_PASS="touser_password"
### to-url
- The **Traffic Ops** URL
+The **Traffic Ops** URL
### to-request-timeout-seconds
- The time in seconds to wait for a query response from both **Traffic Ops**
and
- the **Traffic Monitors**
+The time in seconds to wait for a query response from both **Traffic Ops** and
+the **Traffic Monitors**
### tm-poll-interval-seconds
- The polling interval in seconds used to update **Traffic Server** parent
- status.
+The polling interval in seconds used to update **Traffic Server** parent
+status.
### tm-proxy-url
- If not nil, all Traffic Monitor requests will be proxied through this
- proxy endpoint. This is useful when there are large numbers of caches
- polling a Traffic Monitor and you wish to funnel queries through a caching
- proxy server to limit direct direct connections to Traffic Monitor.
+If not nil, all Traffic Monitor requests will be proxied through this
+proxy endpoint. This is useful when there are large numbers of caches
+polling a Traffic Monitor and you wish to funnel queries through a caching
+proxy server to limit direct direct connections to Traffic Monitor.
### to-login-dispersion-factor
- This is used to calculate TrafficOps login dispersion. It is related to the
- **tm-poll-interval-seconds**. The login dispersion is computed by
multiplying
- **tm-poll-interval-seconds** by the **to-login-dispersion-factor**. For
example
- if to-login-dispersion-factor is 90 and the **tm-poll-interval-seconds** is
10s
- the the dispersion modulo window is 900s.
+This is used to calculate TrafficOps login dispersion. It is related to the
+**tm-poll-interval-seconds**. The login dispersion is computed by multiplying
+**tm-poll-interval-seconds** by the **to-login-dispersion-factor**. For
example
+if to-login-dispersion-factor is 90 and the **tm-poll-interval-seconds** is 10s
+the the dispersion modulo window is 900s.
### unavailable-poll-threshold
- This controls when an unhealthy parent is marked down. An unhealthy parent
- will be marked down when the number of consecutive polls reaches this
threshold
- with the parent reported as unhealthy. The default threshold is 2.
+This controls when an unhealthy parent is marked down. An unhealthy parent
+will be marked down when the number of consecutive polls reaches this threshold
+with the parent reported as unhealthy. The default threshold is 2.
+
+### markup-poll-threshold
+
+This controls when a healthy parent is marked up. An healthy parent
+will be marked up when the number of consecutive polls reaches this threshold
+with the parent reported as healthy. The default threshold is 1.
### trafficserver-config-dir
- The location on the host where **Traffic Server** configuration files are
- located.
+The location on the host where **Traffic Server** configuration files are
+located.
### trafficserver-bin-dir
- The location on the host where **Traffic Server** **traffic_ctl** tool may
- be found.
+The location on the host where **Traffic Server** **traffic_ctl** tool may
+be found.
### poll-state-json-log ###
- The full path to the polling state file which contains information
- about the current status of parents and the health client configuration.
- Polling state data is written to this file after each polling cycle when
- enabled, see **enable-poll-state-log**
+The full path to the polling state file which contains information
+about the current status of parents and the health client configuration.
+Polling state data is written to this file after each polling cycle when
+enabled, see **enable-poll-state-log**
### enable-poll-state-log ###
- Enable writing the Polling state to the **poll-state-json-log** after
- eache polling cycle. Default **false**, disabled
+Enable writing the Polling state to the **poll-state-json-log** after
+eache polling cycle. Default **false**, disabled
# Files
@@ -209,4 +216,3 @@ Sample configuarion file:
* Traffic Server **strategies.yaml**
* Traffic Server **traffic_ctl** command
-
diff --git a/tc-health-client/config/config.go
b/tc-health-client/config/config.go
index 5255534bb7..99dc1d870c 100644
--- a/tc-health-client/config/config.go
+++ b/tc-health-client/config/config.go
@@ -55,6 +55,7 @@ const (
DefaultTrafficServerConfigDir = "/opt/trafficserver/etc/trafficserver"
DefaultTrafficServerBinDir = "/opt/trafficserver/bin"
DefaultUnavailablePollThreshold = 2
+ DefaultMarkupPollThreshold = 1
)
type Cfg struct {
@@ -70,6 +71,7 @@ type Cfg struct {
TmPollIntervalSeconds string
`json:"tm-poll-interval-seconds"`
TOLoginDispersionFactor int
`json:"to-login-dispersion-factor"`
UnavailablePollThreshold int
`json:"unavailable-poll-threshold"`
+ MarkUpPollThreshold int `json:"markup-poll-threshold"`
TrafficServerConfigDir string
`json:"trafficserver-config-dir"`
TrafficServerBinDir string `json:"trafficserver-bin-dir"`
PollStateJSONLog string `json:"poll-state-json-log"`
diff --git a/tc-health-client/tmagent/tmagent.go
b/tc-health-client/tmagent/tmagent.go
index bb4ebe08c8..1783311eaf 100644
--- a/tc-health-client/tmagent/tmagent.go
+++ b/tc-health-client/tmagent/tmagent.go
@@ -49,6 +49,14 @@ const (
StrategiesFile = "strategies.yaml"
)
+// this global is used to auto select the
+// proper ATS traffic_ctl command to use
+// when querying host status. for ATS
+// version 10 and greater this will remain
+// at 0. For ATS version 9, this will be
+// auto updated to 1
+var traffic_ctl_index = 0
+
type ParentAvailable interface {
available(reasonCode string) bool
}
@@ -85,6 +93,7 @@ type ParentStatus struct {
ManualReason bool
LastTmPoll int64
UnavailablePollCount int
+ MarkUpPollCount int
}
// used to get the overall parent availablity from the
@@ -531,6 +540,7 @@ func (c *ParentInfo) markParent(fqdn string, cacheStatus
string, available bool)
activeReason := pv.ActiveReason
localReason := pv.LocalReason
unavailablePollCount := pv.UnavailablePollCount
+ markUpPollCount := pv.MarkUpPollCount
log.Debugf("hostName: %s, UnavailablePollCount: %d, available:
%v", hostName, unavailablePollCount, available)
@@ -544,22 +554,31 @@ func (c *ParentInfo) markParent(fqdn string, cacheStatus
string, available bool)
err = c.execTrafficCtl(fqdn, available)
if err != nil {
log.Errorln(err.Error())
- }
- if err == nil {
+ } else {
hostAvailable = false
+ // reset the poll counts
+ markUpPollCount = 0
+ unavailablePollCount = 0
log.Infof("marked parent %s DOWN, cache
status was: %s\n", hostName, cacheStatus)
}
}
} else { // available
// marking the host up
- err = c.execTrafficCtl(fqdn, available)
- if err == nil {
- hostAvailable = true
- // reset the unavilable poll count
- unavailablePollCount = 0
- log.Infof("marked parent %s UP, cache status
was: %s\n", hostName, cacheStatus)
- } else {
+ markUpPollCount += 1
+ if markUpPollCount < c.Cfg.MarkUpPollThreshold {
+ log.Infof("TM indicates %s is available but the
MarkUpPollThreshold has not been reached", hostName)
hostAvailable = false
+ } else {
+ err = c.execTrafficCtl(fqdn, available)
+ if err != nil {
+ log.Errorln(err.Error())
+ } else {
+ hostAvailable = true
+ // reset the poll counts
+ unavailablePollCount = 0
+ markUpPollCount = 0
+ log.Infof("marked parent %s UP, cache
status was: %s\n", hostName, cacheStatus)
+ }
}
}
@@ -576,6 +595,7 @@ func (c *ParentInfo) markParent(fqdn string, cacheStatus
string, available bool)
pv.ActiveReason = activeReason
pv.LocalReason = localReason
pv.UnavailablePollCount = unavailablePollCount
+ pv.MarkUpPollCount = markUpPollCount
c.Parents[hostName] = pv
log.Debugf("Updated parent status: %v", pv)
}
@@ -587,15 +607,35 @@ func (c *ParentInfo) markParent(fqdn string, cacheStatus
string, available bool)
// subsystem.
func (c *ParentInfo) readHostStatus(parentStatus map[string]ParentStatus)
error {
tc := filepath.Join(c.TrafficServerBinDir, TrafficCtl)
-
- cmd := exec.Command(tc, "metric", "match", "host_status")
var stdout bytes.Buffer
var stderr bytes.Buffer
- cmd.Stdout = &stdout
- cmd.Stderr = &stderr
- err := cmd.Run()
- if err != nil {
- return fmt.Errorf("%s error: %s", TrafficCtl, stderr.String())
+
+ // auto select traffic_ctl command for ATS version 9 or 10 and later
+ for i := traffic_ctl_index; i <= 1; i++ {
+ var err error
+ switch i {
+ case 0: // ATS version 10 and later
+ cmd := exec.Command(tc, "host", "status")
+ cmd.Stdout = &stdout
+ cmd.Stderr = &stderr
+ err = cmd.Run()
+ case 1: // ATS version 9
+ cmd := exec.Command(tc, "metric", "match",
"host_status")
+ cmd.Stdout = &stdout
+ cmd.Stderr = &stderr
+ err = cmd.Run()
+ }
+ if err == nil {
+ break
+ }
+ if err != nil && i == 0 {
+ log.Infof("%s command used is not for ATS version 10 or
later, downgrading to ATS version 9\n", TrafficCtl)
+ traffic_ctl_index = 1
+ continue
+ }
+ if err != nil {
+ return fmt.Errorf("%s error: %s", TrafficCtl,
stderr.String())
+ }
}
if len((stdout.Bytes())) > 0 {
@@ -607,55 +647,66 @@ func (c *ParentInfo) readHostStatus(parentStatus
map[string]ParentStatus) error
scanner := bufio.NewScanner(bytes.NewReader(stdout.Bytes()))
for scanner.Scan() {
line := strings.TrimSpace(scanner.Text())
- if strings.HasPrefix(line,
"proxy.process.host_status.") {
- fields := strings.Split(line, " ")
- if len(fields) > 0 {
- fqdnField := strings.Split(fields[0],
"proxy.process.host_status.")
- if len(fqdnField) > 0 {
- fqdn = fqdnField[1]
+ fields := strings.Split(line, " ")
+ /*
+ * For ATS Version 9, the host status uses internal
stats and prefixes
+ * the fqdn field from the output of the traffic_ctl
host status and metric
+ * match commands with "proxy.process.host_status".
Going forward starting
+ * with ATS Version 10, internal stats are no-longer
used and the fqdn field
+ * is no-longer prefixed with the
"proxy.process.host_status" string.
+ */
+ if len(fields) == 2 {
+ // check for ATS version 9 output.
+ fqdnField := strings.Split(fields[0],
"proxy.process.host_status.")
+ if len(fqdnField) == 2 { // ATS version 9
+ fqdn = fqdnField[1]
+ } else { // ATS version 10 and greater
+ fqdn = fqdnField[0]
+ }
+ statField := strings.Split(fields[1], ",")
+ if len(statField) == 5 {
+ if strings.HasPrefix(statField[1],
"ACTIVE:UP") {
+ activeReason = true
+ } else if
strings.HasPrefix(statField[1], "ACTIVE:DOWN") {
+ activeReason = false
}
- statField := strings.Split(fields[1],
",")
- if len(statField) == 5 {
- if
strings.HasPrefix(statField[1], "ACTIVE:UP") {
- activeReason = true
- } else if
strings.HasPrefix(statField[1], "ACTIVE:DOWN") {
- activeReason = false
- }
- if
strings.HasPrefix(statField[2], "LOCAL:UP") {
- localReason = true
- } else if
strings.HasPrefix(statField[2], "LOCAL:DOWN") {
- localReason = false
- }
- if
strings.HasPrefix(statField[3], "MANUAL:UP") {
- manualReason = true
- } else if
strings.HasPrefix(statField[3], "MANUAL:DOWN") {
- manualReason = false
- }
+ if strings.HasPrefix(statField[2],
"LOCAL:UP") {
+ localReason = true
+ } else if
strings.HasPrefix(statField[2], "LOCAL:DOWN") {
+ localReason = false
}
- pstat := ParentStatus{
- Fqdn: fqdn,
- ActiveReason:
activeReason,
- LocalReason:
localReason,
- ManualReason:
manualReason,
- LastTmPoll: 0,
- UnavailablePollCount: 0,
+ if strings.HasPrefix(statField[3],
"MANUAL:UP") {
+ manualReason = true
+ } else if
strings.HasPrefix(statField[3], "MANUAL:DOWN") {
+ manualReason = false
}
- hostName = parseFqdn(fqdn)
- pv, ok := parentStatus[hostName]
- // create the ParentStatus struct and
add it to the
- // Parents map only if an entry in the
map does not
- // already exist.
- if !ok {
+ }
+ pstat := ParentStatus{
+ Fqdn: fqdn,
+ ActiveReason: activeReason,
+ LocalReason: localReason,
+ ManualReason: manualReason,
+ LastTmPoll: 0,
+ UnavailablePollCount: 0,
+ MarkUpPollCount: 0,
+ }
+ log.Debugf("processed host status record:
%v\n", pstat)
+ hostName = parseFqdn(fqdn)
+ pv, ok := parentStatus[hostName]
+ // create the ParentStatus struct and add it to
the
+ // Parents map only if an entry in the map does
not
+ // already exist.
+ if !ok {
+ parentStatus[hostName] = pstat
+ log.Infof("added Host '%s' from ATS
Host Status to the parents map\n", hostName)
+ } else {
+ available :=
pstat.available(c.Cfg.ReasonCode)
+ if pv.available(c.Cfg.ReasonCode) !=
available {
+ log.Infof("host status for '%s'
has changed to %s\n", hostName, pstat.Status())
+ pstat.LastTmPoll = pv.LastTmPoll
+ pstat.UnavailablePollCount =
pv.UnavailablePollCount
+ pstat.MarkUpPollCount =
pv.MarkUpPollCount
parentStatus[hostName] = pstat
- log.Infof("added Host '%s' from
ATS Host Status to the parents map\n", hostName)
- } else {
- available :=
pstat.available(c.Cfg.ReasonCode)
- if
pv.available(c.Cfg.ReasonCode) != available {
- log.Infof("host status
for '%s' has changed to %s\n", hostName, pstat.Status())
- pstat.LastTmPoll =
pv.LastTmPoll
-
pstat.UnavailablePollCount = pv.UnavailablePollCount
- parentStatus[hostName]
= pstat
- }
}
}
}