Hello Igniters,

I want to share some thoughts about supporting eviction on new page memory storage.

I think, we should reconsider eviction policies in their current state. Current pluggable eviction policy mechanism allows to determine which key will be evicted from the cache and when eviction should be performed. There are two issues about it:

1) Such mechanism tends to make off-heap storage highly fragmented: batch of small evicted entries can accumulate needed total amount of free space in FreeList, but it still would be impossible to store one big entry.

2) Growth of cache size implies creating O(n) heap objects and links to maintain eviction policy, which causes linear growth of GC load. That doesn't suit us: getting rid of GC overhead was one of the main reasons for keeping cache data in off-heap memory.

I propose to use "whole page eviction, off-heap eviction metadata" as default mode for eviction. It's almost non-deterministic, because actual eviction result depends on how data is packed into pages, but is more efficient, and shouldn't cause long-term performance degradation due to fragmentation.

I can suggest two ways to implement it:

1) *K-random-LRU
*The idea is to allocate big off-heap array, to track timestamp of last usage for each data page. When data page on address X is accessed, we store current timestamp in X / PAGE_SIZE array position. When there's a need for eviction, we randomly choose K indices of array (if some of indices point to non-data pages, re-choose them) and evict data page with minimum timestamp. In case K=5, we'll evict page from 17% least recently used set with 50% probability <http://math.stackexchange.com/questions/786392/expectation-of-minimum-of-n-i-i-d-uniform-random-variables>. Non-data pages can be simply recognized by having zero in corresponding position of timestamp array. If entry is too big and occupies more than one page, only first page will be tracked, and other will be considered as non-data pages. *K-random-LRU-2* is perspective variant of this approach. We'll evict page with minimum time of penultimate access. Implementation is mostly the same, the difference is that we should store two timestamps for each data page. LRU-2 outperforms <http://www.cs.cmu.edu/%7Echristos/courses/721-resources/p297-o_neil.pdf> LRU by resolving "one-hit wonder" problem: if page is accessed very rarely, but accidentally accessed once, it's "protected" from eviction for long time. Memory overhead: timestamp can be packed into 4 bytes. 100GB off-heap cache with standard 2KB pages requires 200MB of additional memory (400MB in case of LRU-2 strategy).*
*

2)*Adaptation of CLOCK-Pro
*CLOCK-Pro (article <https://www.usenix.org/legacy/event/usenix05/tech/general/full_papers/jiang/jiang_html/html.html>, presentation <http://www.slideshare.net/huliang64/clockpro>) is modern page replacement algorithm, used in actual OS kernels. It is an approximation of LIRS <http://web.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-02-6.pdf> with significant advantage for our case: instead of doubly linked list, which is hard to maintain in off-heap memory, CLOCK-Pro uses only circular list and three pointers. However, we can't use CLOCK-Pro as is: there's difference between OS paged memory and Ignite off-heap storage. In OS, if memory page is swapped out, it still contains the same data, with same relevance and patterns of access. That's why keeping metadata for swapped-out page (see Non-resident Cold Pages set) works. In Ignite, in case we evict page, all data is discarded; entries with same keys may appear in the cache again, but they will be evenly distributed across the whole storage. I propose to use simplified version of CLOCK-Pro with only two pointers and two page sets: hot and cold. Non-resident pages won't be tracked. To implement it, we should allocate big off-heap array to track flags for each page. We should store (hot, cold) bit, (accessed, not accessed) bit and one bit indicating that the page is data page, totally 3 bits per page. 100GB off-heap cache with standard 2KB pages requires ~20MB of additional memory, or 50MB in case we don't want to shrink metadata for multiple pages in one byte. Memory overhead is really small, but we can't predict average and maximum time for one eviction, it has to be discovered during an experiment (in hypothetical worst case, clock hand can travel through the whole array). Also, hot/cold ratio can affect performance dramatically and should be tuned well.

Note about classic "key" eviction: fair per-entry policies (FIFO, LRU, Sorted) may be useful for some users, and we may want to support them as well. It's no problem at all to implement them on-heap (more than, current implementations of EvictionPolicy will work in if we make minor changes in the code). In off-heap, FIFO is easy to implement (circular buffer), but LRU (and more effective LIRS, if someone wants it) require doubly linked lists. The only option I see is using our BPlusTree with some kind of implicit key, but this implies O(log n) overhead on each entry read access.

Another note: current GridCacheEvictionManager have eviction backup synchronization mechanism. Do we want to support it for off-heap storage as well?

Please let me know if my understanding of any algorithm or part of the system is wrong.

--
Best Regards,
Ivan Rakov

Reply via email to