[
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196296#comment-14196296
]
Ariel Weisberg commented on CASSANDRA-7438:
-------------------------------------------
.bq No we don't. We have locks per Segment, this is very similar to lock
stripping/Java's concurrent hash map.
Thanks for clearing that up
.bq Not really we lock globally when we reach 100% of the space and we freeup
to 80% of the space and we spread the overhead to other threads based on who
ever has the item partition lock. It won't be hard to make this part of the
queue thread and will try it for the next release of lruc.
OK, that make sense. 20% of the cache could be many milliseconds of work if you
are using many gigabytes of cache. That's not a great thing to foist on a
random victim thread. If you handed that to the queue thread, well I think you
run into another issue which is that the ring buffer doesn't appear to check
for queue full? The queue thread could go out to lunch for a while. Not a big
deal, but finer grained scheduling will probably be necessary.
.bq If you look at the code closer to memcached. Actually I started of
stripping memcached code so we can run it in process instead of running as a
separate process and removing the global locks in queue reallocation etc and
eventually diverged too much from it. The other reason it doesn't use slab
allocators is because we wanted the memory allocators to do the right thing we
already have tested Cassandra with Jemalloc.
Ah very cool.
jemalloc is not a moving allocator where as it looks like memcached slabs
implement rebalancing to accommodate changes in size distribution. That would
actually be one of the really nice things to keep IMO. On large memory systems
with a cache that scales and performs you would end up dedicating as much RAM
as possible to the row cache/key cache and not the page cache since the page
cache is not as granular (correct me if the story for C* is different). If you
dedicate 80% of RAM to the cache that doesn't leave a lot of space left for
fragmentation. By using a heap allocator you also lose the ability to implement
hard predictable limits on memory used by the cache since you didn't map it
yourself. I could be totally off base and jemalloc might be good enough.
.bq There is some comments above which has the reasoning for it (please see the
above comments). PS: I believe there was some tickets on Current RowCache
complaining about the overhead.
I don't have a performance beef with JNI, especially the way you have done
which I think is pretty efficient. I think the overhead of JNI (one or two
slightly more expensive function calls) would be eclipsed by things like the
cache misses, coherence, and pipeline stalls that are part of accessing and
maintaining a concurrent cache (Java or C++). It's all just intuition without
comparative microbenchmarks of the two caches. Java might look a little faster
just due to allocator performance, but we know you pay for that in other ways.
I think what you have made scratches the itch for a large cache quite well, and
beats the status quo. I don't agree that Unsafe couldn't do the exact same
thing with no on heap references.
The hash table, ring buffer, and individual item entries are all being malloced
and you can do that from Java using Unsafe. You don't need to implement a ring
buffer because you can use Disruptor. I also wonder if splitting the cache into
several instances each with a coarse lock per instance wouldn't result in
simpler, and I know performance is not an issue, fast enough code. I don't want
to advocate doing something different for performance, but rather that there is
the possibility of a relatively simple implementation via Unsafe.
You could coalesce all the contended fields for each instance (stats, lock
field, LRU head) into a single cache line, and then rely on a single barrier
when releasing a coarse grained lock. The fine grained locking and CASing
results in several pipeline stalls because the memory barriers that are
implicit in each one require the store buffers to drain. There may even be a
suitable off heap map implementation out there already.
> Serializing Row cache alternative (Fully off heap)
> --------------------------------------------------
>
> Key: CASSANDRA-7438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Environment: Linux
> Reporter: Vijay
> Assignee: Vijay
> Labels: performance
> Fix For: 3.0
>
> Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in
> JVM heap as BB,
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off
> heap and use JNI to interact with cache. We might want to ensure that the new
> implementation match the existing API's (ICache), and the implementation
> needs to have safe memory access, low overhead in memory and less memcpy's
> (As much as possible).
> We might also want to make this cache configurable.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)