phongn commented on PR #13255:
URL: https://github.com/apache/trafficserver/pull/13255#issuecomment-4722967269

   Following on from the thread-safety note above — a reviewer asked whether 
S3-FIFO being "lock-free" in the literature might let us relax the per-stripe 
mutex for this policy. Short version: not as implemented, and the stripe mutex 
isn't the RAM cache's lock to relax — but the FIFO structure does buy a 
measurably shorter critical section *today*, with no re-architecting. Below: 
the measurement, a safety-check against the existing stripe-contention work, 
and a sketch of what a real concurrent RAM cache would take.
   
   ### 1. Time under the stripe lock, per policy
   
   Every `RamCache::get()`/`put()` runs while `stripe->mutex` is held, so 
per-call cost is exactly the marginal time each policy spends *under the lock*. 
We measured it through the real `RamCache` interface — so the numbers include 
the metric-counter atomics and `Ptr` refcounting that genuinely run under the 
lock — driving all three policies with one identical Zipf stream. RNG/sampling 
is hoisted out of the timed region, buffers are pre-allocated and reused so 
allocation doesn't pollute the figures, and the seen filter is disabled for the 
put comparison so every policy admits the same way. 16 MB cache, 16 KB objects, 
Ice Lake; absolute ns is host-dependent, the cross-policy ratio is the point.
   
   | policy | `get` ns/op | `put` ns/op (realistic Zipf) | get hit rate |
   |---|---|---|---|
   | LRU | 117 | 144 | 0.807 |
   | CLFUS | 117 | 180 | 0.802 |
   | **S3-FIFO** | **51** | **85** | **0.841** |
   
   S3-FIFO's `get` is ~2.3× cheaper because a hit only bumps a 2-bit counter — 
it never relinks a list, whereas LRU/CLFUS move the entry on every access 
(pointer-chasing into scattered nodes is most of their cost). It is also the 
cheapest `put` here, at a higher hit rate. This is consistent with the 
microbenchmark in the PR description; measuring through the live interface just 
confirms it holds with the real under-lock bookkeeping included.
   
   Two caveats we want to state plainly:
   
   - **The worst case flips the put.** Under pure one-hit-wonder churn (every 
object unique, never reused) S3-FIFO's put rises to ~10.3 µs vs LRU ~5.6 / 
CLFUS ~6.3 µs — the cost of pushing every object through the small queue and 
ghost. That is exactly the workload where its scan-resistance pays off, and it 
is an unusual put pattern, but it is real.
   - **Magnitude vs. the actual bottleneck.** These are tens of nanoseconds. On 
a cache *miss*, the stripe lock is held for microseconds (directory probe + AIO 
scheduling), so the RAM-cache call is a small slice of the critical section; 
the saving concentrates on the *RAM-hit* path, where the critical section is 
short. So this is real aggregate contention relief at high RAM-hit rates — not 
a structural fix.
   
   (The benchmark is a `REGRESSION_TEST(ram_cache_timing)` we kept out of this 
PR, since it compares all policies and belongs with the evaluation harness 
rather than the focused change. Happy to share it.)
   
   ### 2. Safety check: relaxing the stripe lock is a separate, much larger 
effort
   
   Worth being explicit that this is a far bigger change than adding a policy, 
and the existing design work shows why it shouldn't be attached to a RAM-cache 
algorithm:
   
   - The RAM cache is called *inside* the stripe critical section, not 
alongside it. `CacheVC::handleRead` asserts `stripe->mutex` is held before 
`load_from_ram_cache()`, and `put` runs in the aggregation-write path under the 
same lock. The lock exists for the on-disk **directory** (and the agg buffer); 
the RAM cache is nested within it — so there is nothing to relax "for S3-FIFO."
   - **#12788** frames the real target: `open_read()` serializing on 
`stripe->mutex`, with a proposed two-tier `StripeSM`/`Stripe` split and an RW 
lock — a core restructuring.
   - Both attempts target the **directory**, not the RAM cache, and both hit 
the two hazards a concurrent RAM cache would also face. **#12601** (closed, 
pending deeper analysis) took a shared `dir_mutex`, but `Directory::probe()` 
conditionally writes (it deletes invalid dir entries), so a read lock isn't 
sufficient. **#12794** (the OpenDir `shared_mutex` approach) only stays correct 
by making `OpenDirEntry` reference-counted so an entry can't be freed under a 
reader holding a `Ptr`.
   
   Those two lessons — "read-only" paths that conditionally write, and 
cross-thread object lifetime — are exactly what a concurrent RAM cache must 
solve too. So the directory line of work is both the higher-leverage lever and 
the prerequisite learning. Our suggestion is to land S3-FIFO as a policy here 
(for the shorter critical section now) and keep lock-relaxation on the 
#12788/#12794 track, where it benefits every policy.
   
   ### 3. If we later want a concurrent / independently-locked RAM cache
   
   Sketching the design space — none of which should be coupled to S3-FIFO 
specifically; it should live at the `RamCache` layer so every policy benefits:
   
   - **Sharded RAM cache with its own locks (pragmatic first step).** Partition 
the RAM cache by key hash into N independent sub-caches, each with its own 
mutex, separate from `stripe->mutex`. No lock-free data structures required — 
just finer locks — and it captures most of the contention relief. The real work 
is byte-budget accounting across shards (the global `ram_cache.size` budget has 
to be split or coordinated) and accepting that eviction becomes per-shard 
rather than global (a small hit-rate effect). Lowest-risk, algorithm-agnostic.
   - **Lock-free / concurrent RAM cache (larger).** Atomic counters, a 
concurrent or sharded hash map, sharded FIFO queues, an atomic frequency 
counter, and — the hard part — safe reclamation (epoch/hazard pointers, or 
refcounted entries exactly as #12794 needed) so eviction can't free an entry a 
reader holds. Here the reviewer's intuition is right: S3-FIFO is a good 
substrate, because FIFO-plus-a-counter is far friendlier to concurrency than 
LRU's per-hit reordering — that is the basis of the paper's scalability claims. 
But it's a substantial, separate project.
   - **Decoupling the RAM probe from the stripe lock.** To serve a RAM hit 
without taking `stripe->mutex` at all, `handleRead` would have to probe the RAM 
cache before/outside the stripe critical section, which interleaves with the 
same OpenDir/directory ordering #12794 is untangling — i.e. coupled to the 
directory work, not independent of it.
   
   Recommendation: do it generically, and measure first. The time-under-lock 
data above says the RAM-cache op is a small fraction of the disk-miss critical 
section, so before investing in a concurrent RAM cache it's worth profiling 
where `stripe->mutex` is actually held (the directory PRs suggest the probe 
dominates). That keeps a major change evidence-driven.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to