[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223479#comment-14223479 ]
Ariel Weisberg commented on CASSANDRA-7438: ------------------------------------------- I think that for caches the behavior you want to avoid most is slowly growing heap. People hate that because it's unpredictable and they don't know when it's going to stop. You can always start with jemalloc and get the feature working and then iterate on memory management. Fixed block sizes is a baby and bath water scenario to get the desirable fixed memory utilization property. When you want to build everything out of fixed size pages you have to slot the pages or do some other internal page management strategy so you can pack multiple things and rewrite pages as they fragment. You also need size tiered free lists and fragmentation metadata for pages so you can find partial free pages. That kind of thing only makes sense in ye olden database land where rewriting an already dirty page is cheaper than more IOPs. In memory you can relocate objects. Memcached used to have the problem that instead of the heap growing the cache would lose capacity to fragmentation. FB implemented slab rebalancing in their fork, and then Memcached did its own implementation. The issue was internal fragmentation due to having too many of the wrong size slabs. For Robert * Executor service shutdown, never really got why it takes a timeout nor why there is no blocking version. 99% of the time if it doesn't shutdown within the timeout it's a bug and you don't want to ignore it. We are pedantic about everything else why not this? It's also unused right now. * Stats could go into an atomic long array with padding. It really depends on the access pattern. You want data that is read/written at the same time on the same cache line. These are global counters so they will be contended by everyone accessing the cache, better that they only have to pull in one cache line with all counters then multiple and have to wait for exclusive access before writing to each one. Also consider LongAdder. * If you want to do your own memory management strategy I think something like segregated storage as in boost pool with size tiers for powers of two and power of two plus previous power of two. You can CAS the head of the free list for each tier to make it thread safe, and lock when allocating out a new block instead of the free list. This won't adapt to changing size distributions. For that stuff needs to be relocatable * I'll bet you could use a stamped lock pattern and readers might not have to lock all. I think getting it working with just a lock is fine. * I am not sure shrinking is very important? The table is pretty dense and should be a small portion of total memory once all the other memory is accounted for. You would need a lot of tiny cache entries to really bloat the table and then the population distribution would need to change to make that a waste. * LRU lists per segment seems like it's not viable. That isn't a close enough approximation to LRU since we want at most two or three entries per partition. * Some loops of very similar byte munging in HashEntryAccess * Periodic cleanup check is maybe not so nice. An edge trigger via a CAS field would be nicer and move that up to > 80% since on a big-memory machine that is a lot of wasted cache space. Walking the entire LRU could take several seconds, but if it is amortized across a lot of expiration maybe it is ok. * Some rehash required checking is duplicated in OHCacheImpl For Vijay * sun.misc.Hashing doesn't seem to exist for me, maybe a Java 8 issue? * The queue really needs to be bounded, producer and consumer could proceed at different rates. With striped * Tasks submitted to executor services via submit will wrap the result including exceptions in a future which silently discards them. The library might take at initialization time a listener for these errors, or if it is going to be C* specific it could use the wrapped runnable or similar. * A lot of locking that was spin locking (which unbounded I don't think is great) is now blocking locking. There is no adaptive spinning if you don't use synchronized. If you are already using unsafe maybe you could do monitor enter/exit. Never tried it. * It looks like concurrent calls to rehash could cause the table to rehash twice since the rebalance field is not CASed. You should do the volatile read, and then attempt the CAS (avoids putting the cache line in exclusive state every time). * StatsHolder, same AtomicLongArray suggestion. Also consider LongAdder. * In Segment.java in the replace path AtomicLong.addAndGet is called back to back, could be called once with the math already done. I believe each of those stalls processing until the store buffers have flushed. The put path does something similar and could have the same optimization. * Having the table (segments) on heap is pretty undesirable to me. Happy to be proved wrong, but I think a flyweight over off heap would be better. * If the expiration lock is already locked some other thread is doing the expiration work. You might keep a semaphore for puts that bypass the lock so other threads can move on during expiration. I suppose after the first few evictions new puts will move on anyways. This would show up in a profiler if it were happening. * hotN looks like it could lock for quite a while (hundreds of milliseconds, seconds) depending on the size of N. You don't need to use a linked list for the result just allocate an array list of size N. Maybe hotN should be able to yield, possibly leaving behind an iterator that evictors will have to repair. Maybe also depends on how top N handles duplicate or multiple versions of keys. Alternatively hotN could take a read lock, and writers could skip the cache? For LRUC I think the segments need to go off heap unless benchmarks say they don't for the cache sizes we care about. I need to step in the debugger and watch it go and look at code coverage from the tests. OHC needs to have a better allocator story, LRU approximation story, and we need to see if the maintenance thread keeps up at large cache sizes with a lot of cores. Same issue running through the debugger and looking at coverage. > Serializing Row cache alternative (Fully off heap) > -------------------------------------------------- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux > Reporter: Vijay > Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)