[ 
https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920405#comment-13920405
 ] 

Pavel Yaskevich commented on CASSANDRA-6689:
--------------------------------------------

bq. Zero copy would certainly be nice to retain. Otherwise why not just 
ref-count the memory?

I'm not sure what is to retain here if we do that copy when we send to the 
wire. Wouldn't we better of with ref count and copy but having separate 
allocator for temporary things (e.g. write/read buffers) and for more long term 
(e.g. memtable, compression buffers)?

So can you please answer the question - what is the biggest advantage of 
tracking everything, introducing separate gc, sort of RCU (i'm looking at you 
OpOrder) vs. mem copy + different allocators for different things if there is a 
fixed size pre-allocated buffer pool off heap (we can also do COW for some of 
the chunks)? Copy is *not* be a big problem (regarding throughput/latency) 
comparing to other operations (do a simple memcpy test and see how much mb/s 
can you get from copying from one pre-allocated pool to another), there is a 
trade-off of course as we are going to spend more memory to store temporary 
things but as we have a fixed number of threads it is going to work out the 
same way as for buffering open files in the steady system state... Temporary 
memory allocated by readers is *exactly* what we should be managing at the 
first place because they allocate the most and it always the biggest concern 
for us (ParNew pauses are reaching 300 ms in 1 sec. intervals), that's why I'm 
saying let's use "low level" allocator for things like reads with it's own 
arena and chunk sizes, as we originally wanted and have a separate "low level" 
allocator for the memtable, preallocate at the startup, return buffers to the 
pool when we serialize for the wire and go from there with global thresholds 
etc.

After all that said, I would suggest that we do a separate "low level" 
allocator in this ticket (the same way as jemalloc) and plug it into the 
memtable and do a copy on read path, once that step is done we can plug-in 
allocator for all of the serialization/deserialization logic that we have when 
it's going from/to the wire and the last step would be to use allocator for all 
of the sstable read operations.

All this might seem irrelevant for some people but I can say that after 
fire-fighting Cassandra for some time for different use cases, it's not the 
memtable which creates the most of the noise and memory presure in the system 
(even tho it uses big chunk of heap) but the reads and internode communication 
(especially the latter).

> Partially Off Heap Memtables
> ----------------------------
>
>                 Key: CASSANDRA-6689
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>             Fix For: 2.1 beta2
>
>         Attachments: CASSANDRA-6689-small-changes.patch
>
>
> Move the contents of ByteBuffers off-heap for records written to a memtable.
> (See comments for details)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to