phongn commented on PR #13255: URL: https://github.com/apache/trafficserver/pull/13255#issuecomment-4722967269
Following on from the thread-safety note above — a reviewer asked whether S3-FIFO being "lock-free" in the literature might let us relax the per-stripe mutex for this policy. Short version: not as implemented, and the stripe mutex isn't the RAM cache's lock to relax — but the FIFO structure does buy a measurably shorter critical section *today*, with no re-architecting. Below: the measurement, a safety-check against the existing stripe-contention work, and a sketch of what a real concurrent RAM cache would take. ### 1. Time under the stripe lock, per policy Every `RamCache::get()`/`put()` runs while `stripe->mutex` is held, so per-call cost is exactly the marginal time each policy spends *under the lock*. We measured it through the real `RamCache` interface — so the numbers include the metric-counter atomics and `Ptr` refcounting that genuinely run under the lock — driving all three policies with one identical Zipf stream. RNG/sampling is hoisted out of the timed region, buffers are pre-allocated and reused so allocation doesn't pollute the figures, and the seen filter is disabled for the put comparison so every policy admits the same way. 16 MB cache, 16 KB objects, Ice Lake; absolute ns is host-dependent, the cross-policy ratio is the point. | policy | `get` ns/op | `put` ns/op (realistic Zipf) | get hit rate | |---|---|---|---| | LRU | 117 | 144 | 0.807 | | CLFUS | 117 | 180 | 0.802 | | **S3-FIFO** | **51** | **85** | **0.841** | S3-FIFO's `get` is ~2.3× cheaper because a hit only bumps a 2-bit counter — it never relinks a list, whereas LRU/CLFUS move the entry on every access (pointer-chasing into scattered nodes is most of their cost). It is also the cheapest `put` here, at a higher hit rate. This is consistent with the microbenchmark in the PR description; measuring through the live interface just confirms it holds with the real under-lock bookkeeping included. Two caveats we want to state plainly: - **The worst case flips the put.** Under pure one-hit-wonder churn (every object unique, never reused) S3-FIFO's put rises to ~10.3 µs vs LRU ~5.6 / CLFUS ~6.3 µs — the cost of pushing every object through the small queue and ghost. That is exactly the workload where its scan-resistance pays off, and it is an unusual put pattern, but it is real. - **Magnitude vs. the actual bottleneck.** These are tens of nanoseconds. On a cache *miss*, the stripe lock is held for microseconds (directory probe + AIO scheduling), so the RAM-cache call is a small slice of the critical section; the saving concentrates on the *RAM-hit* path, where the critical section is short. So this is real aggregate contention relief at high RAM-hit rates — not a structural fix. (The benchmark is a `REGRESSION_TEST(ram_cache_timing)` we kept out of this PR, since it compares all policies and belongs with the evaluation harness rather than the focused change. Happy to share it.) ### 2. Safety check: relaxing the stripe lock is a separate, much larger effort Worth being explicit that this is a far bigger change than adding a policy, and the existing design work shows why it shouldn't be attached to a RAM-cache algorithm: - The RAM cache is called *inside* the stripe critical section, not alongside it. `CacheVC::handleRead` asserts `stripe->mutex` is held before `load_from_ram_cache()`, and `put` runs in the aggregation-write path under the same lock. The lock exists for the on-disk **directory** (and the agg buffer); the RAM cache is nested within it — so there is nothing to relax "for S3-FIFO." - **#12788** frames the real target: `open_read()` serializing on `stripe->mutex`, with a proposed two-tier `StripeSM`/`Stripe` split and an RW lock — a core restructuring. - Both attempts target the **directory**, not the RAM cache, and both hit the two hazards a concurrent RAM cache would also face. **#12601** (closed, pending deeper analysis) took a shared `dir_mutex`, but `Directory::probe()` conditionally writes (it deletes invalid dir entries), so a read lock isn't sufficient. **#12794** (the OpenDir `shared_mutex` approach) only stays correct by making `OpenDirEntry` reference-counted so an entry can't be freed under a reader holding a `Ptr`. Those two lessons — "read-only" paths that conditionally write, and cross-thread object lifetime — are exactly what a concurrent RAM cache must solve too. So the directory line of work is both the higher-leverage lever and the prerequisite learning. Our suggestion is to land S3-FIFO as a policy here (for the shorter critical section now) and keep lock-relaxation on the #12788/#12794 track, where it benefits every policy. ### 3. If we later want a concurrent / independently-locked RAM cache Sketching the design space — none of which should be coupled to S3-FIFO specifically; it should live at the `RamCache` layer so every policy benefits: - **Sharded RAM cache with its own locks (pragmatic first step).** Partition the RAM cache by key hash into N independent sub-caches, each with its own mutex, separate from `stripe->mutex`. No lock-free data structures required — just finer locks — and it captures most of the contention relief. The real work is byte-budget accounting across shards (the global `ram_cache.size` budget has to be split or coordinated) and accepting that eviction becomes per-shard rather than global (a small hit-rate effect). Lowest-risk, algorithm-agnostic. - **Lock-free / concurrent RAM cache (larger).** Atomic counters, a concurrent or sharded hash map, sharded FIFO queues, an atomic frequency counter, and — the hard part — safe reclamation (epoch/hazard pointers, or refcounted entries exactly as #12794 needed) so eviction can't free an entry a reader holds. Here the reviewer's intuition is right: S3-FIFO is a good substrate, because FIFO-plus-a-counter is far friendlier to concurrency than LRU's per-hit reordering — that is the basis of the paper's scalability claims. But it's a substantial, separate project. - **Decoupling the RAM probe from the stripe lock.** To serve a RAM hit without taking `stripe->mutex` at all, `handleRead` would have to probe the RAM cache before/outside the stripe critical section, which interleaves with the same OpenDir/directory ordering #12794 is untangling — i.e. coupled to the directory work, not independent of it. Recommendation: do it generically, and measure first. The time-under-lock data above says the RAM-cache op is a small fraction of the disk-miss critical section, so before investing in a concurrent RAM cache it's worth profiling where `stripe->mutex` is actually held (the directory PRs suggest the probe dominates). That keeps a major change evidence-driven. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
