[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

Ariel Weisberg (JIRA) Mon, 03 Nov 2014 16:07:57 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195418#comment-14195418
 ]


Ariel Weisberg edited comment on CASSANDRA-7438 at 11/4/14 12:06 AM:
---------------------------------------------------------------------

RE refcount:

I think hazard pointers (never used them personally) are the no-gc no-refcount 
way of handling this. It also won't be fetched twice if it is uncontended which 
in many cases it will be since it should be decrefd as soon as the data is 
copied.

I think that with the right QA work this solves the problem of running 
arbitrarily large caches. That means running a validating workload in 
continuous integration that demonstrates the cache doesn't lock up, leak, or 
return the wrong answer. I would probably test directly against the cache to 
get more iterations in.

RE Implementation as a library via JNI:

We give up something by using JNI so it only makes sense if we get something 
else in return. The QA and release work created by JNI is pretty large. You 
really need a plan for running something like Valgrind or similar against a 
comprehensive suite of tests. Valgrind doesn't run well with Java AFAIK so you 
end up doing things like running the native code in a separate process, and 
have to write an interface amenable to that. Valgrind is also slow enough that 
if you try and run all your tests against a configuration using it a lot you 
end up with timeouts and many hours to run all the tests plus time spent 
interpreting results.

Unsafe is worse some respects because there is no Valgrind and I can attest 
that debugging an off-heap red-black tree is not fun.

I am not clear on why the JNI is justified. It really seems like this could be 
written against Unsafe and then it would work on any platform. There are no 
libraries or system calls in use that are only accessible via JNI. I think JNI 
would make more sense if we were pulling in existing code like memcached that 
already handles memory pooling, fragmentation, and concurrency. If it were in 
Java you could use Disruptor for the queue and would only need to implement a 
thread safe off heap hash table.

RE Performance and implementation:

What kind of hardware was the benchmark run on? Server class NUMA? I am just 
wondering if there are enough cores to bring out any scalability issues in the 
cache implementation.

It would be nice to see a benchmark that showed the on heap cache falling over 
while the off heap cache provides good performance.

Subsequent comments aren't particularly useful if performance is satisfactory 
under relevant configurations.

Given the use of a heap allocator and locking it might not make sense to have a 
background thread do expiration. I think that splitting the cache into several 
instances with one lock around each instance might result in less contention 
overall and it would scale up in a more straightforward way.

It appears that some common operations will hit a global lock in may_expire() 
quite frequently? It seems like there are other globally shared frequently 
mutated cache lines in the write path like stats.

Is there something subtle in the locking that makes the use of the custom queue 
and maps necessary or could you use stuff from Intel TBB and still make it 
work? It is hypothetically less code to have to QA and maintain.

I still need to dig more, but I am also not clear on why locks are necessary 
for individual items. It looks like there is a table for all of them? Random 
intuition is that it could be done without a lock or at least a discrete lock. 
Striping against a padded pool of locks might make sense if that isn't going to 
cause deadlocks. Apparently every pthread_mutex_t is 40 bytes according to a 
random stack overflow post. It might make sense to use the same cache line as 
the refcount to store a lock field, or the bucket in the hash table?

Another implementation question is do we want to use C++11? It would remove a 
lot of platform and compiler specific code.


was (Author: aweisberg):
RE refcount:

I think hazard pointers (never used them personally) are the no-gc no-refcount 
way of handling this. It also won't be fetched twice if it is uncontended which 
in many cases it will be since it should be decrefd as soon as the data is 
copied.

I think that with the right QA work this solves the problem of running 
arbitrarily large caches. That means running a validating workload in 
continuous integration that demonstrates the cache doesn't lock up, leak, or 
return the wrong answer. I would probably test directly against the cache to 
get more iterations in.

RE Implementation as a library via JNI:

We give up something by using JNI so it only makes sense if we get something 
else in return. The QA and release work created by JNI is pretty large. You 
really need a plan for running something like Valgrind or similar against a 
comprehensive suite of tests. Valgrind doesn't run well with Java AFAIK so you 
end up doing things like running the native code in a separate process, and 
have to write an interface amenable to that. Valgrind is also slow enough that 
if you try and run all your tests against a configuration using it a lot you 
end up with timeouts and many hours to run all the tests plus time spent 
interpreting results.

Unsafe is worse in that respect because there is no Valgrind and I can attest 
that debugging an off-heap red-black tree is not fun.

I am not clear on why the JNI is justified. It really seems like this could be 
written against Unsafe and then it would work on any platform. There are no 
libraries or system calls in use that are only accessible via JNI. I think JNI 
would make more sense if we were pulling in existing code like memcached that 
already handles memory pooling, fragmentation, and concurrency. If it were in 
Java you could use Disruptor for the queue and would only need to implement a 
thread safe off heap hash table.

RE Performance and implementation:

What kind of hardware was the benchmark run on? Server class NUMA? I am just 
wondering if there are enough cores to bring out any scalability issues in the 
cache implementation.

It would be nice to see a benchmark that showed the on heap cache falling over 
while the off heap cache provides good performance.

Subsequent comments aren't particularly useful if performance is satisfactory 
under relevant configurations.

Given the use of a heap allocator and locking it might not make sense to have a 
background thread do expiration. I think that splitting the cache into several 
instances with one lock around each instance might result in less contention 
overall and it would scale up in a more straightforward way.

It appears that some common operations will hit a global lock in may_expire() 
quite frequently? It seems like there are other globally shared frequently 
mutated cache lines in the write path like stats.

Is there something subtle in the locking that makes the use of the custom queue 
and maps necessary or could you use stuff from Intel TBB and still make it 
work? It is hypothetically less code to have to QA and maintain.

I still need to dig more, but I am also not clear on why locks are necessary 
for individual items. It looks like there is a table for all of them? Random 
intuition is that it could be done without a lock or at least a discrete lock. 
Striping against a padded pool of locks might make sense if that isn't going to 
cause deadlocks. Apparently every pthread_mutex_t is 40 bytes according to a 
random stack overflow post. It might make sense to use the same cache line as 
the refcount to store a lock field, or the bucket in the hash table?

Another implementation question is do we want to use C++11? It would remove a 
lot of platform and compiler specific code.

> Serializing Row cache alternative (Fully off heap)
> --------------------------------------------------
>
>                 Key: CASSANDRA-7438
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>         Environment: Linux
>            Reporter: Vijay
>            Assignee: Vijay
>              Labels: performance
>             Fix For: 3.0
>
>         Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

Reply via email to