[
https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920405#comment-13920405
]
Pavel Yaskevich commented on CASSANDRA-6689:
--------------------------------------------
bq. Zero copy would certainly be nice to retain. Otherwise why not just
ref-count the memory?
I'm not sure what is to retain here if we do that copy when we send to the
wire. Wouldn't we better of with ref count and copy but having separate
allocator for temporary things (e.g. write/read buffers) and for more long term
(e.g. memtable, compression buffers)?
So can you please answer the question - what is the biggest advantage of
tracking everything, introducing separate gc, sort of RCU (i'm looking at you
OpOrder) vs. mem copy + different allocators for different things if there is a
fixed size pre-allocated buffer pool off heap (we can also do COW for some of
the chunks)? Copy is *not* be a big problem (regarding throughput/latency)
comparing to other operations (do a simple memcpy test and see how much mb/s
can you get from copying from one pre-allocated pool to another), there is a
trade-off of course as we are going to spend more memory to store temporary
things but as we have a fixed number of threads it is going to work out the
same way as for buffering open files in the steady system state... Temporary
memory allocated by readers is *exactly* what we should be managing at the
first place because they allocate the most and it always the biggest concern
for us (ParNew pauses are reaching 300 ms in 1 sec. intervals), that's why I'm
saying let's use "low level" allocator for things like reads with it's own
arena and chunk sizes, as we originally wanted and have a separate "low level"
allocator for the memtable, preallocate at the startup, return buffers to the
pool when we serialize for the wire and go from there with global thresholds
etc.
After all that said, I would suggest that we do a separate "low level"
allocator in this ticket (the same way as jemalloc) and plug it into the
memtable and do a copy on read path, once that step is done we can plug-in
allocator for all of the serialization/deserialization logic that we have when
it's going from/to the wire and the last step would be to use allocator for all
of the sstable read operations.
All this might seem irrelevant for some people but I can say that after
fire-fighting Cassandra for some time for different use cases, it's not the
memtable which creates the most of the noise and memory presure in the system
(even tho it uses big chunk of heap) but the reads and internode communication
(especially the latter).
> Partially Off Heap Memtables
> ----------------------------
>
> Key: CASSANDRA-6689
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6689
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Benedict
> Assignee: Benedict
> Fix For: 2.1 beta2
>
> Attachments: CASSANDRA-6689-small-changes.patch
>
>
> Move the contents of ByteBuffers off-heap for records written to a memtable.
> (See comments for details)
--
This message was sent by Atlassian JIRA
(v6.2#6252)