sollhui opened a new pull request, #61272:
URL: https://github.com/apache/doris/pull/61272
### Problem
The LRU-K (K=2) implementation in LRUCache uses a uint32_t hash value as the
key for the visits list map (_visits_lru_cache_map), instead of the full cache
key string:
// Before
using visits_lru_cache_key = uint32_t; // only 32-bit hash
With ~27,000 entries (typical for SegmentCache), the collision probability
can be estimated via the birthday problem:
$$P(\text{at least one collision}) \approx 1 - e^{-n^2 / 2m}$$
where $n = 27000$ (number of entries), $m = 2^{32}$ (hash space):
$$P \approx 1 - e^{-27000^2 ;/; (2 \times 2^{32})} = 1 - e^{-729000000 ;/;
8589934592} = 1 - e^{-0.0849} \approx \mathbf{8.1%}$$
This means roughly 1 in 12 caches with ~27,000 segments will have at least
one hash collision at any given time, causing two types of incorrect behavior:
1. False promotion (early entry into main cache)
Segment B whose hash(B) == hash(A) finds A's existing visits list entry on
its first access and gets promoted to the main cache, bypassing the LRU-K
requirement of two distinct accesses.
2. Lost visits record (delayed entry)
When B's first access collides with A's hash and removes A's visits entry, A
is forced to restart the two-access sequence from scratch, even though it was
already waiting for promotion.
The lookup() path has the same issue — when a miss occurs, it moves the
visits list entry to the front of the LRU using hash as the key, so it may
refresh the wrong entry's recency position under collision.
### Fix
Change visits_lru_cache_key from uint32_t to std::string, using the full
cache key string as the map key. Update _lru_k_insert_visits_list(), lookup(),
and insert() accordingly.
// After
using visits_lru_cache_key = std::string; // full key string, zero collision
The visits list holds at most a few hundred to a few thousand entries (its
virtual footprint is bounded by _capacity), so the additional per-entry string
overhead is negligible compared to the actual segment data being cached.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]