moonchen opened a new pull request, #13310:
URL: https://github.com/apache/trafficserver/pull/13310
## Introduce Clang Thread Safety Analysis, and apply it to two subsystems
### Motivation
A lot of ATS's concurrency contracts live only in comments and `DEBUG`-only
`ink_assert`s — "must hold this mutex," "only call on this thread." The
compiler
can't see them, so the easy path is the one that silently violates them, and
the
result shows up in production as a data race.
Clang's Thread Safety Analysis (`-Wthread-safety`) lets us put those
contracts in
the type system: annotate which mutex guards which data, and the compiler
proves
every access holds the lock. The annotations are compile-time only — zero
runtime
cost, and nothing on GCC.
To show this pays for itself rather than just adding ceremony, this PR also
uses
the new analysis to **find and fix an existing data race** (see below).
### What's in this PR
1. **Annotation infrastructure** (`tsutil`)
- `tsutil/ts_thread_safety.h`: `TS_*` macros wrapping the Clang
attributes;
they expand to nothing off Clang.
- `ts::shared_mutex` becomes a capability, plus `ts::scoped_writer_lock` /
`ts::scoped_reader_lock` RAII guards.
- A unit test that is itself compiled with the analysis as a worked
example.
- A CMake lane: `-Wthread-safety` is on by default for Clang as a
**warning**.
2. **Annotate `SSLOriginSessionCache`** — its session map and queue are
reached
from every thread; they're now marked guarded by the cache mutex, so the
locking discipline is compiler-enforced instead of conventional. The
hand-rolled lock witness on `remove_oldest_session` (a `std::unique_lock`
parameter checked with `owns_lock()`) becomes a compile-time
`TS_REQUIRES`.
3. **Fix an unsynchronized read race in `Metrics::Storage`** — the analysis
caught it (details below).
4. **Gate it in CI** — the `ci`/`branch` presets set
`THREAD_SAFETY_ANALYSIS_AS_ERROR=ON` so findings are errors on the Clang
lanes and gate merges, while local and `dev` builds keep them warnings.
### The bug the analysis caught
`Metrics::Storage::create()`, `createSpan()`, and `current()` access
`_cur_blob` / `_cur_off` / `_blobs` under `_mutex`, but `valid()`,
`lookup(IdType)`, and `name()` read the same fields with **no lock held**
(`rename()` likewise read them before taking the lock). A single `Storage` is
shared by all threads; metrics registered at runtime — a plugin
`TSStatCreate`,
or a config reload — advance those fields while live traffic reads them
through
the plugin stat APIs. That's a data race / UB.
The fix moves `Storage` to `ts::shared_mutex` with reader/writer guards and
marks
the fields guarded by it, so every access holds the lock and the analysis
keeps
it that way. The read/write lock also lets the common lookup path run
concurrently.
### Why ATS-owned guards instead of `std::lock_guard`/`std::unique_lock`
Clang's analysis can only track lock state through types that carry its
annotations, and the standard RAII lock wrappers ATS uses don't:
- `std::unique_lock`, `std::shared_lock`, and `std::scoped_lock` are
intentionally left un-annotated (even in libc++) because they support
deferred locking, move, `release()`, and adopt — dynamic state the analysis
can't follow. ATS almost never needs that flexibility; the common case is a
plain scoped lock.
- `std::lock_guard` *is* annotated, but only in libc++. Relying on it would
tie
whether the analysis works at all to which standard library a build links
against, and produce false positives otherwise.
`ts::scoped_writer_lock` / `ts::scoped_reader_lock` are simple, scoped,
non-movable guards over `ts::shared_mutex` (ATS's own rwlock) — exactly the
shape the analysis can verify — and they behave the same regardless of
standard
library. Use `std::unique_lock`/`shared_lock` where the dynamic features are
genuinely needed, accepting that those sites sit outside the analysis.
### Developer impact
- **Default Clang builds**: warnings only. `-Wno-error=thread-safety` keeps
it a
warning even in `-Werror` builds, so an in-progress annotation never
blocks a
local build.
- **CI** (`THREAD_SAFETY_ANALYSIS_AS_ERROR=ON`, set in the presets): findings
are errors on the Clang lanes and gate the merge, which keeps `master`
clean —
so the default warnings only ever flag a developer's own in-progress
change.
- **GCC**: unaffected. The flag is Clang-only on purpose (GCC doesn't know
`-Wthread-safety` and would itself error if passed it), and the macros
compile
to nothing.
- **Going forward**: prefer the annotated `ts::` guards for scoped locking
and
mark the data they protect; reach for `std::` lock guards only where the
dynamic features are actually needed.
### Testing
- `test_tsutil` includes `test_thread_safety.cc`, compiled with
`-Werror=thread-safety` under Clang as a live example.
- The annotated translation units build clean under `clang -Wthread-safety`
and
under GCC; removing a lock from an annotated reader produces the expected
diagnostic.
### Scope
This is a deliberately small seed: the infrastructure plus two annotated
subsystems, one of which surfaced a real fix. It is meant to establish the
pattern and the CI lane; further annotation (e.g. NetHandler/cache/HostDB
state,
and enforcing "no blocking lock on ET_NET" via a thread-role capability) can
follow incrementally as each subsystem is brought to zero warnings.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]