sollhui opened a new pull request, #61272:
URL: https://github.com/apache/doris/pull/61272

   ### Problem
   The LRU-K (K=2) implementation in LRUCache uses a uint32_t hash value as the 
key for the visits list map (_visits_lru_cache_map), instead of the full cache 
key string:
   
   
   // Before
   using visits_lru_cache_key = uint32_t;  // only 32-bit hash
   With ~27,000 entries (typical for SegmentCache), the collision probability 
can be estimated via the birthday problem:
   
   $$P(\text{at least one collision}) \approx 1 - e^{-n^2 / 2m}$$
   
   where $n = 27000$ (number of entries), $m = 2^{32}$ (hash space):
   
   $$P \approx 1 - e^{-27000^2 ;/; (2 \times 2^{32})} = 1 - e^{-729000000 ;/; 
8589934592} = 1 - e^{-0.0849} \approx \mathbf{8.1%}$$
   
   This means roughly 1 in 12 caches with ~27,000 segments will have at least 
one hash collision at any given time, causing two types of incorrect behavior:
   
   1. False promotion (early entry into main cache)
   
   Segment B whose hash(B) == hash(A) finds A's existing visits list entry on 
its first access and gets promoted to the main cache, bypassing the LRU-K 
requirement of two distinct accesses.
   
   2. Lost visits record (delayed entry)
   
   When B's first access collides with A's hash and removes A's visits entry, A 
is forced to restart the two-access sequence from scratch, even though it was 
already waiting for promotion.
   
   The lookup() path has the same issue — when a miss occurs, it moves the 
visits list entry to the front of the LRU using hash as the key, so it may 
refresh the wrong entry's recency position under collision.
   
    ### Fix
   Change visits_lru_cache_key from uint32_t to std::string, using the full 
cache key string as the map key. Update _lru_k_insert_visits_list(), lookup(), 
and insert() accordingly.
   
   
   // After
   using visits_lru_cache_key = std::string;  // full key string, zero collision
   The visits list holds at most a few hundred to a few thousand entries (its 
virtual footprint is bounded by _capacity), so the additional per-entry string 
overhead is negligible compared to the actual segment data being cached.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to