[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920405#comment-13920405 ]
Pavel Yaskevich commented on CASSANDRA-6689: -------------------------------------------- bq. Zero copy would certainly be nice to retain. Otherwise why not just ref-count the memory? I'm not sure what is to retain here if we do that copy when we send to the wire. Wouldn't we better of with ref count and copy but having separate allocator for temporary things (e.g. write/read buffers) and for more long term (e.g. memtable, compression buffers)? So can you please answer the question - what is the biggest advantage of tracking everything, introducing separate gc, sort of RCU (i'm looking at you OpOrder) vs. mem copy + different allocators for different things if there is a fixed size pre-allocated buffer pool off heap (we can also do COW for some of the chunks)? Copy is *not* be a big problem (regarding throughput/latency) comparing to other operations (do a simple memcpy test and see how much mb/s can you get from copying from one pre-allocated pool to another), there is a trade-off of course as we are going to spend more memory to store temporary things but as we have a fixed number of threads it is going to work out the same way as for buffering open files in the steady system state... Temporary memory allocated by readers is *exactly* what we should be managing at the first place because they allocate the most and it always the biggest concern for us (ParNew pauses are reaching 300 ms in 1 sec. intervals), that's why I'm saying let's use "low level" allocator for things like reads with it's own arena and chunk sizes, as we originally wanted and have a separate "low level" allocator for the memtable, preallocate at the startup, return buffers to the pool when we serialize for the wire and go from there with global thresholds etc. After all that said, I would suggest that we do a separate "low level" allocator in this ticket (the same way as jemalloc) and plug it into the memtable and do a copy on read path, once that step is done we can plug-in allocator for all of the serialization/deserialization logic that we have when it's going from/to the wire and the last step would be to use allocator for all of the sstable read operations. All this might seem irrelevant for some people but I can say that after fire-fighting Cassandra for some time for different use cases, it's not the memtable which creates the most of the noise and memory presure in the system (even tho it uses big chunk of heap) but the reads and internode communication (especially the latter). > Partially Off Heap Memtables > ---------------------------- > > Key: CASSANDRA-6689 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Benedict > Assignee: Benedict > Fix For: 2.1 beta2 > > Attachments: CASSANDRA-6689-small-changes.patch > > > Move the contents of ByteBuffers off-heap for records written to a memtable. > (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)