masaori335 opened a new issue, #13169:
URL: https://github.com/apache/trafficserver/issues/13169
# Report of connect attempts bugs
- affected versions: 10.1.x and smaller
Relevant configs:
- `proxy.config.http.connect_attempts_max_retries`
- `proxy.config.http.connect_attempts_max_retries_suspect_server`
(replaces the deprecated
`proxy.config.http.connect_attempts_max_retries_down_server` —
see PR #13102)
- `proxy.config.http.connect_attempts_rr_retries`
- `proxy.config.http.down_server.cache_time` (the `fail_window`)
PRs:
- https://github.com/apache/trafficserver/pull/12846
- https://github.com/apache/trafficserver/pull/12880
- https://github.com/apache/trafficserver/pull/13092
- https://github.com/apache/trafficserver/pull/13102
## Net effect between 10.1.x and later
On 10.1.x and earlier, the three `connect_attempts_*` configs behaved
erratically:
- `connect_attempts_max_retries_down_server` was effectively unreachable
because the `is_down()`
check was inverted (#12846).
- Single-address origins ignored down-state entirely — `select_best_http`
unconditionally returned
the first (and only) entry (#12880).
- The "mark host DOWN" threshold was gated by `connect_attempts_rr_retries`
instead of
`connect_attempts_max_retries` (#13102).
- The fail-window clock used `client_request_time` instead of the actual
failure time, so the window
could start in the past or be pre-expired (#13102).
- After `select_next_rr()`, `dst_addr` was not updated, so RR "switches"
silently kept hitting the
same target (#13102).
- `mark_host_failure()` ran at end-of-txn against the post-switch
`dns_info.active`, so failures on
A could be charged to B (#13102).
- RR exhaustion terminated retries early even when the current target still
had per-host budget
(#13102).
The combined effect of PRs #12846, #12880, #13092, and #13102 is a coherent
model in which each
config has a well-defined role tied to an explicit `UP / DOWN / SUSPECT`
state for the active
target.
### Worked examples
All examples use the ATS defaults:
| Config | Default |
|-------------------------------------------------|---------|
| `connect_attempts_max_retries` | 3 |
| `connect_attempts_max_retries_suspect_server` | 1 |
| `connect_attempts_rr_retries` | 3 |
| `down_server.cache_time` (`fail_window`) | 60s |
#### Example 1 — Single-address origin goes offline
Client sends repeated requests to an origin whose only A record points to a
down IP. How many
connect attempts does ATS make before the IP is marked DOWN in HostDB?
**10.1.x behavior — IP marked DOWN after 12 connect attempts (3 failed
transactions).**
Why: `mark_host_failure()` is only called from
`do_hostdb_update_if_necessary()`, which is only
reached at the end of a transaction (via
`handle_server_connection_not_open`). So `fail_count`
increments **once per failed transaction**, not once per attempt. The
threshold is
`connect_attempts_rr_retries` (default `3`) — the wrong config, but that's
what 10.1.x uses.
Per-transaction there are `1 + connect_attempts_max_retries = 4` attempts.
| Attempt | Txn | `fail_count` after | marked DOWN? |
|---------|-----|--------------------|--------------|
| 1 | 1 | 0 | no |
| 2 | 1 | 0 | no |
| 3 | 1 | 0 | no |
| 4 | 1 | 1 (end of txn) | no |
| 5 | 2 | 1 | no |
| 6 | 2 | 1 | no |
| 7 | 2 | 1 | no |
| 8 | 2 | 2 (end of txn) | no |
| 9 | 3 | 2 | no |
| 10 | 3 | 2 | no |
| 11 | 3 | 2 | no |
| **12** | 3 | 3 ≥ 3 | **yes** |
And because `select_best_http()` returns `info[0]` unconditionally for
single-address records
(#12880), every one of those 12 attempts actually hits the network. Even
after the IP is marked
DOWN, subsequent requests keep trying it because the single-address code
path ignores down-state.
**Later behavior — IP marked DOWN after 4 connect attempts (1 failed
transaction).**
Why: `do_hostdb_update_if_necessary()` is now called after **each** failure
in
`handle_response_from_server()`, so `fail_count` increments **once per
attempt**. The threshold is
`max_retries + 1 = 4` (for an UP target). Per-transaction there are still `1
+ max_retries = 4`
attempts, so the threshold is reached on the last attempt of the first
failed transaction.
| Attempt | Txn | `fail_count` after | marked DOWN? |
|---------|-----|--------------------|--------------|
| 1 | 1 | 1 | no |
| 2 | 1 | 2 | no |
| 3 | 1 | 3 | no |
| **4** | 1 | 4 ≥ 4 | **yes** |
Subsequent requests within `fail_window` (60s): `select_best_http()` now
honors down-state for
single-address records, so HostDB lookup returns `nullptr` and the SM
reports `OriginDown`
immediately — **zero connect attempts**.
After `fail_window` elapses, the entry is SUSPECT. The next request gets
`connect_attempts_max_retries_suspect_server + 1 = 2` attempts (1 probe + 1
retry). Success → UP;
failure → back to DOWN for another 60s.
#### Example 2 — Round-robin with one bad backend (A fails, B/C healthy)
Defaults apply: `rr_retries = 3`, `max_retries = 3`. A is unresponsive, B
and C are fine. Initial
active = A.
**10.1.x behavior — client gets 502, and the wrong host is blamed.**
Two compounding bugs: (a) after `select_next_rr()` is called, neither
`dns_info.addr` nor
`server_info.dst_addr` is updated, so the next connect still goes to A's
address. (b)
`mark_host_failure()` is called once at end-of-txn against
`dns_info.active`, which by that point is
the *post-switch* host (B), so B's `fail_count` is incremented for A's
failures.
| Attempt | dst_addr (target) | dns_info.active | retry_attempts after |
`(retry+1) % 3 == 0`? | Result |
|---------|-------------------|-----------------|----------------------|----------------------|--------|
| 1 | A | A | 1 | no
| fail |
| 2 | A | A | 2 | no
| fail |
| 3 | A | A → switched to B (`select_next_rr`);
dst_addr untouched | 3 | **yes** | fail |
| 4 | A (still!) | B | 3 ≥ 3, give up | n/a
| fail → 502 to client |
End of txn: `mark_host_failure()` increments `B.fail_count` even though B
was never tried. Repeated
requests do this 3 times → **B gets marked DOWN** while A (the
actually-broken host) keeps getting
hit on every transaction.
**Later behavior — client gets 200; A's failures are correctly recorded
against A.**
`do_hostdb_update_if_necessary()` runs after each failure (so the correct
`dns_info.active` is in
scope), and after `select_next_rr()` the SM explicitly assigns
`dns_info.addr` and
`server_info.dst_addr` to the new target.
| Attempt | dst_addr (target) | dns_info.active | A.fail_count after |
retry_attempts after | RR switch? | Result |
|---------|-------------------|-----------------|--------------------|----------------------|------------|--------|
| 1 | A | A | 1 | 1
| no | fail |
| 2 | A | A | 2 | 2
| no | fail |
| 3 | A | A | 3 | 3
| **yes** → active=B, dst_addr=B | fail |
| **4** | **B** | B | 3 (unchanged) | n/a
| n/a | **success → 200** |
A finishes this transaction with `fail_count = 3` (not yet at the threshold
of
`max_retries + 1 = 4`). The next request that hashes to A bumps it to 4 and
marks A DOWN; subsequent
requests within `fail_window` skip A entirely until the window expires.
If B and C had also been DOWN (no selectable alternate), `select_next_rr()`
returns false and the SM
stays on A using its remaining per-host retry budget instead of giving up —
new in #13102.
**Later behavior**
- Failure timestamp is `ts_clock::now()` at the moment `mark_down` is
called, so the full
`fail_window` is honored regardless of how long the transaction took to
decide the host was bad.
## Summary of bug-fix PRs
### PR #12846 — Fix `HostDBInfo::is_down` condition
**Bug.** `HostDBInfo::is_down()` had an inverted comparison:
```cpp
// Before
return (last_fail != TS_TIME_ZERO) && (last_fail + fail_window < now);
// After
return (last_fail != TS_TIME_ZERO) && (now <= last_fail + fail_window);
```
The pre-fix code reported a host as DOWN only *after* `fail_window` had
elapsed — the opposite of
intent. During the fail window (when the host should be considered blocked)
it returned `false`.
**Behavior change.** A failed host is now correctly treated as DOWN for the
entire `fail_window`
following `last_failure`, and returns to UP only after the window has
elapsed. Every downstream path
that calls `is_down()` (retry budget, RR selection, negative-cached check)
is impacted.
### PR #12880 — Check state of HostDBInfo in `select_best_http`
**Bug.** For single-address (non round-robin) records,
`HostDBRecord::select_best_http()` bypassed
the state check:
```cpp
// Before
best_alive = &info[0]; // unconditional
// After
if (info[0].select(now, fail_window)) {
best_alive = &info[0];
}
```
A single-address host that was marked down would still be returned from
HostDB and reused for the
next request.
**Behavior change.** Single-address origins now honor `down_server` state
the same way RR origins
do: if the only target is DOWN and still within the fail window, HostDB
lookup fails (`nullptr`),
the SM reports `OriginDown`, and no connection attempt is made until the
window expires. The gold
autest `dns_host_down` was updated accordingly (second request returns 500
instead of 502 once the
first failure marked the IP down).
### PR #13092 — Clarify `HostDBInfo` state
**Not a bug fix — a model/refactor.** Introduces an explicit tri-state for
upstream health and
reshapes the API around it:
```cpp
enum class HostDBInfo::State { UP, DOWN, SUSPECT };
```
- **UP** — no known failure; normal selection.
- **DOWN** — `_last_failure` set, `now` is within `fail_window`; not
eligible.
- **SUSPECT** — `fail_window` has elapsed; a probe is permitted. A
successful response transitions
it to UP via `mark_up()`; another failure returns it to DOWN.
API renames: `is_alive` → `is_up`, `mark_active_server_alive` →
`mark_active_server_up`.
`mark_down` now takes `fail_window`. `select()` is replaced by callers using
`is_down()` directly.
`last_failure` / `fail_count` become private atomics (`_last_failure`,
`_fail_count`).
**Behavior change.** The previous two-state model collapsed SUSPECT into UP
once `fail_window`
expired, which meant a recovering host was treated identically to a
never-failed host for retry
sizing. With SUSPECT explicit, callers (notably the retry machinery in
#13102) can apply
`connect_attempts_max_retries_suspect_server` to probing traffic and
`connect_attempts_max_retries`
to UP traffic.
### PR #13102 — Fix connect attempt retries
**Bugs.** The retry machinery in
`HttpTransact::handle_response_from_server()` and
`HttpSM::mark_host_failure()` had several interlocking issues:
1. **Wrong threshold used to mark host down.** `mark_host_failure()` called
`increment_fail_count(..., connect_attempts_rr_retries, ...)`. The RR
switch-over config was
reused as the "mark down" threshold, so a host was marked DOWN after
`connect_attempts_rr_retries` failures instead of
`connect_attempts_max_retries + 1` total
attempts.
2. **Wrong clock for failure timestamp.** `do_hostdb_update_if_necessary()`
recorded the failure at
`t_state.client_request_time` (request-receipt time), not
`ts_clock::now()`. For long requests
the fail window started in the past, sometimes expiring before the
failure was even recorded.
3. **RR exhaustion short-circuited all retries.** When the
`connect_attempts_rr_retries` boundary
was hit and no other RR member was selectable, the old code gave up, even
though the current
target still had per-host retry budget.
4. **`connect_attempts_max_retries_down_server` had no effect in practice.**
It was only consulted
through `is_server_negative_cached()`, which depended on the inverted
`is_down()` from #12846. In
many cases it was either never applied or applied when it shouldn't have
been.
5. **`connect_attempts_max_retries_down_server == 0` blocked the probe.** A
separate code path in
`do_http_server_open()` refused to connect at all when the target was
negative-cached and the
config was zero — preventing the SUSPECT-state probe that the fail window
was designed to
allow.
6. **Retry config range allowed overflow.** `[0-255]` with `uint8_t`
threshold = `max_retries + 1`
wraps to 0.
7. **Config name didn't match its actual effect.** The "down_server" config
never applied to
DOWN-state origins (DOWN-state retries are now hardcoded to 0); it
controls the probe budget for
SUSPECT-state origins. Renamed to
`proxy.config.http.connect_attempts_max_retries_suspect_server`
to align with `HostDBInfo::State::SUSPECT`. The previous
`proxy.config.http.connect_attempts_max_retries_down_server` record is
kept and marked
deprecated; if only it is set, its value is mirrored forward into the new
record with a warning,
and if both are set the new record wins. The legacy
`TS_CONFIG_HTTP_CONNECT_ATTEMPTS_MAX_RETRIES_DOWN_SERVER` plugin enum
continues to work — it
aliases to the same internal field as the new `_SUSPECT_SERVER` enum.
**Behavior change.**
- Retry budget is now driven by HostDB state (via the new helper
`HttpTransact::origin_server_connect_attempts_max_retries`):
- `UP` → `connect_attempts_max_retries`
- `DOWN` → 0 (bail out immediately)
- `SUSPECT` → `connect_attempts_max_retries_suspect_server`
- `mark_host_failure` uses `max_retries + 1` as the attempt budget and
`ts_clock::now()` as the
failure time.
- RR switch and per-host retry are separated: if RR has no selectable
alternate, the SM stays on
the current target and keeps retrying up to the per-host budget rather
than giving up.
- The `do_http_server_open()` early-bail on "negative cached +
down_server=0" is removed — DOWN vs
SUSPECT is now the single source of truth.
- `proxy.config.http.connect_attempts_max_retries`,
`…_max_retries_suspect_server`, and
`…_rr_retries` clamped to `[0-254]` to avoid the `uint8_t` overflow when
adding 1 for the
total-attempt count.
- A new warning is logged at config reconfigure when
`connect_attempts_max_retries_suspect_server == 0` and
`connect_attempts_rr_retries > 0`, because
that combination prevents any probe of a recovering (SUSPECT) origin.
- `proxy.config.http.connect_attempts_max_retries_down_server` is now a
deprecated alias of
`…_suspect_server`. The struct field
`OverridableHttpConfigParams::connect_attempts_max_retries_down_server`
was removed; both
records.yaml entries and both `TSOverridableConfigKey` enum values write
into a single field, so
existing operator configs and plugins keep working while the codebase has
one source of truth.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]