moonchen opened a new pull request, #13310:
URL: https://github.com/apache/trafficserver/pull/13310

   ## Introduce Clang Thread Safety Analysis, and apply it to two subsystems
   
   ### Motivation
   
   A lot of ATS's concurrency contracts live only in comments and `DEBUG`-only
   `ink_assert`s — "must hold this mutex," "only call on this thread." The 
compiler
   can't see them, so the easy path is the one that silently violates them, and 
the
   result shows up in production as a data race.
   
   Clang's Thread Safety Analysis (`-Wthread-safety`) lets us put those 
contracts in
   the type system: annotate which mutex guards which data, and the compiler 
proves
   every access holds the lock. The annotations are compile-time only — zero 
runtime
   cost, and nothing on GCC.
   
   To show this pays for itself rather than just adding ceremony, this PR also 
uses
   the new analysis to **find and fix an existing data race** (see below).
   
   ### What's in this PR
   
   1. **Annotation infrastructure** (`tsutil`)
      - `tsutil/ts_thread_safety.h`: `TS_*` macros wrapping the Clang 
attributes;
        they expand to nothing off Clang.
      - `ts::shared_mutex` becomes a capability, plus `ts::scoped_writer_lock` /
        `ts::scoped_reader_lock` RAII guards.
      - A unit test that is itself compiled with the analysis as a worked 
example.
      - A CMake lane: `-Wthread-safety` is on by default for Clang as a 
**warning**.
   
   2. **Annotate `SSLOriginSessionCache`** — its session map and queue are 
reached
      from every thread; they're now marked guarded by the cache mutex, so the
      locking discipline is compiler-enforced instead of conventional. The
      hand-rolled lock witness on `remove_oldest_session` (a `std::unique_lock`
      parameter checked with `owns_lock()`) becomes a compile-time 
`TS_REQUIRES`.
   
   3. **Fix an unsynchronized read race in `Metrics::Storage`** — the analysis
      caught it (details below).
   
   4. **Gate it in CI** — the `ci`/`branch` presets set
      `THREAD_SAFETY_ANALYSIS_AS_ERROR=ON` so findings are errors on the Clang
      lanes and gate merges, while local and `dev` builds keep them warnings.
   
   ### The bug the analysis caught
   
   `Metrics::Storage::create()`, `createSpan()`, and `current()` access
   `_cur_blob` / `_cur_off` / `_blobs` under `_mutex`, but `valid()`,
   `lookup(IdType)`, and `name()` read the same fields with **no lock held**
   (`rename()` likewise read them before taking the lock). A single `Storage` is
   shared by all threads; metrics registered at runtime — a plugin 
`TSStatCreate`,
   or a config reload — advance those fields while live traffic reads them 
through
   the plugin stat APIs. That's a data race / UB.
   
   The fix moves `Storage` to `ts::shared_mutex` with reader/writer guards and 
marks
   the fields guarded by it, so every access holds the lock and the analysis 
keeps
   it that way. The read/write lock also lets the common lookup path run
   concurrently.
   
   ### Why ATS-owned guards instead of `std::lock_guard`/`std::unique_lock`
   
   Clang's analysis can only track lock state through types that carry its
   annotations, and the standard RAII lock wrappers ATS uses don't:
   
   - `std::unique_lock`, `std::shared_lock`, and `std::scoped_lock` are
     intentionally left un-annotated (even in libc++) because they support
     deferred locking, move, `release()`, and adopt — dynamic state the analysis
     can't follow. ATS almost never needs that flexibility; the common case is a
     plain scoped lock.
   - `std::lock_guard` *is* annotated, but only in libc++. Relying on it would 
tie
     whether the analysis works at all to which standard library a build links
     against, and produce false positives otherwise.
   
   `ts::scoped_writer_lock` / `ts::scoped_reader_lock` are simple, scoped,
   non-movable guards over `ts::shared_mutex` (ATS's own rwlock) — exactly the
   shape the analysis can verify — and they behave the same regardless of 
standard
   library. Use `std::unique_lock`/`shared_lock` where the dynamic features are
   genuinely needed, accepting that those sites sit outside the analysis.
   
   ### Developer impact
   
   - **Default Clang builds**: warnings only. `-Wno-error=thread-safety` keeps 
it a
     warning even in `-Werror` builds, so an in-progress annotation never 
blocks a
     local build.
   - **CI** (`THREAD_SAFETY_ANALYSIS_AS_ERROR=ON`, set in the presets): findings
     are errors on the Clang lanes and gate the merge, which keeps `master` 
clean —
     so the default warnings only ever flag a developer's own in-progress 
change.
   - **GCC**: unaffected. The flag is Clang-only on purpose (GCC doesn't know
     `-Wthread-safety` and would itself error if passed it), and the macros 
compile
     to nothing.
   - **Going forward**: prefer the annotated `ts::` guards for scoped locking 
and
     mark the data they protect; reach for `std::` lock guards only where the
     dynamic features are actually needed.
   
   ### Testing
   
   - `test_tsutil` includes `test_thread_safety.cc`, compiled with
     `-Werror=thread-safety` under Clang as a live example.
   - The annotated translation units build clean under `clang -Wthread-safety` 
and
     under GCC; removing a lock from an annotated reader produces the expected
     diagnostic.
   
   ### Scope
   
   This is a deliberately small seed: the infrastructure plus two annotated
   subsystems, one of which surfaced a real fix. It is meant to establish the
   pattern and the CI lane; further annotation (e.g. NetHandler/cache/HostDB 
state,
   and enforcing "no blocking lock on ET_NET" via a thread-role capability) can
   follow incrementally as each subsystem is brought to zero warnings.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to