[jira] [Commented] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

Ariel Weisberg (JIRA) Mon, 24 Nov 2014 12:42:54 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223479#comment-14223479
 ]


Ariel Weisberg commented on CASSANDRA-7438:
-------------------------------------------

I think that for caches the behavior you want to avoid most is slowly growing 
heap. People hate that because it's unpredictable and they don't know when it's 
going to stop. You can always start with jemalloc and get the feature working 
and then iterate on memory management.

Fixed block sizes is a baby and bath water scenario to get the desirable fixed 
memory utilization property. When you want to build everything out of fixed 
size pages you have to slot the pages or do some other internal page management 
strategy so you can pack multiple things and rewrite pages as they fragment. 
You also need size tiered free lists and fragmentation metadata for pages so 
you can find partial free pages. That kind of thing only makes sense in ye 
olden database land where rewriting an already dirty page is cheaper than more 
IOPs. In memory you can relocate objects. 

Memcached used to have the  problem that instead of the heap growing the cache 
would lose capacity to fragmentation. FB implemented slab rebalancing in their 
fork, and then Memcached did its own implementation. The issue was internal 
fragmentation due to having too many of the wrong size slabs. 

For Robert
* Executor service shutdown, never really got why it takes a timeout nor why 
there is no blocking version. 99% of the time if it doesn't shutdown within the 
timeout it's a bug and you don't want to ignore it. We are pedantic about 
everything else why not this? It's also unused right now.
* Stats could go into an atomic long array with padding. It really depends on 
the access pattern. You want data that is read/written at the same time on the 
same cache line. These are global counters so they will be contended by 
everyone accessing the cache, better that they only have to pull in one cache 
line with all counters then multiple and have to wait for exclusive access 
before writing to each one. Also consider LongAdder.
* If you want to do your own memory management strategy I think something like 
segregated storage as in boost pool with size tiers for powers of two and power 
of two plus previous power of two. You can CAS the head of the free list for 
each tier to make it thread safe, and lock when allocating out a new block 
instead of the free list. This won't adapt to changing size distributions. For 
that stuff needs to be relocatable
* I'll bet you could use a stamped lock pattern and readers might not have to 
lock all. I think getting it working with just a lock is fine.
* I am not sure shrinking is very important? The table is pretty dense and 
should be a small portion of total memory once all the other memory is 
accounted for. You would need a lot of tiny cache entries to really bloat the 
table and then the population distribution would need to change to make that a 
waste.
* LRU lists per segment seems like it's not viable. That isn't a close enough 
approximation to LRU since we want at most two or three entries per partition.
* Some loops of very similar byte munging in HashEntryAccess
* Periodic cleanup check is maybe not so nice. An edge trigger via a CAS field 
would be nicer and move that up to > 80% since on a big-memory machine that is 
a lot of wasted cache space. Walking the entire LRU could take several seconds, 
but if it is amortized across a lot of expiration maybe it is ok.
* Some rehash required checking is duplicated in OHCacheImpl

For Vijay
* sun.misc.Hashing doesn't seem to exist for me, maybe a Java 8 issue?
* The queue really needs to be bounded, producer and consumer could proceed at 
different rates. With striped 
* Tasks submitted to executor services via submit will wrap the result 
including exceptions in a future which silently discards them. The library 
might take at initialization time a listener for these errors, or if it is 
going to be C* specific it could use the wrapped runnable or similar.
* A lot of locking that was spin locking (which unbounded I don't think is 
great) is now blocking locking. There is no adaptive spinning if you don't use 
synchronized. If you are already using unsafe maybe you could do monitor 
enter/exit. Never tried it.
* It looks like concurrent calls to rehash could cause the table to rehash 
twice since the rebalance field is not CASed. You should do the volatile read, 
and then attempt the CAS (avoids putting the cache line in exclusive state 
every time).
* StatsHolder, same AtomicLongArray suggestion. Also consider LongAdder.
* In Segment.java  in the replace path AtomicLong.addAndGet is called back to 
back, could be called once with the math already done. I believe each of those 
stalls processing until the store buffers have flushed. The put path does 
something similar and could have the same optimization. 
* Having the table (segments) on heap is pretty undesirable to me. Happy to be 
proved wrong, but I think a flyweight over off heap would be better.
* If the expiration lock is already locked some other thread is doing the 
expiration work. You might keep a semaphore for puts that bypass the lock so 
other threads can move on during expiration. I suppose after the first few 
evictions new puts will move on anyways. This would show up in a profiler if it 
were happening.
* hotN looks like it could lock for quite a while (hundreds of milliseconds, 
seconds) depending on the size of N. You don't need to use a linked list for 
the result just allocate an array list of size N. Maybe hotN should be able to 
yield, possibly leaving behind an iterator that evictors will have to repair. 
Maybe also depends on how top N handles duplicate or multiple versions of keys. 
Alternatively hotN could take a read lock, and writers could skip the cache?

For LRUC I think the segments need to go off heap unless benchmarks say they 
don't for the cache sizes we care about. I need to step in the debugger and 
watch it go and look at code coverage from the tests.

OHC needs to have a better allocator story, LRU approximation story, and we 
need to see if the maintenance thread keeps up at large cache sizes with a lot 
of cores. Same issue running through the debugger and looking at coverage.



> Serializing Row cache alternative (Fully off heap)
> --------------------------------------------------
>
>                 Key: CASSANDRA-7438
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>         Environment: Linux
>            Reporter: Vijay
>            Assignee: Vijay
>              Labels: performance
>             Fix For: 3.0
>
>         Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

Reply via email to