[
https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13921110#comment-13921110
]
Pavel Yaskevich commented on CASSANDRA-6689:
--------------------------------------------
bq. I've stated clearly what this introduces as a benefit: overwrite workloads
no longer cause excessive flushes
If you do a copy before of the memtable buffer, you can clearly put it back to
the allocator once it's overwritten or becomes otherwise useless, in the
process of merging columns with previous row contents.
bq. Your next sentence states how this is a large cause of memory consumption,
so surely we should be using that memory if possible for other uses (returning
it to the buffer cache, or using it internally for more caching)?
It doesn't state that is a *large cause of memory consumption*, it states that
it has additional cost but it the steady state it don't be allocating over the
limit because of the properties of the system that we have, namely the fixed
number of threads.
bq. Are you performing a full object tree copy, and doing this with a running
system to see how it affects the performance of other system components? If
not, it doesn't seem to be a useful comparison. Note that this will still
create a tremendous amount of heap churn, as most of the memory used by objects
right now is on-heap. So copying the records is almost certainly no better for
young gen pressure than what we currently do - in fact, it probably makes the
situation worse.
Do you mean this? Let's say we copy a Cell (or Column object), which is 1 level
deep so just allocate additional space for the object headers and do a copy,
most of the work would be spend by doing a copy of the data (name/value)
anyway, so as we want to live inside of ParNew, see how many such allocations
you will be able to do in e.g. 1 second then wipe the whole thing and do it
again. We are doing mlockall too which should make that even faster as we are
sure that heap is pre-faulted already.
bq. It may not be causing the young gen pressure you're seeing, but it
certainly offers some benefit here by keeping more rows in memory so recent
queries are more likely to be answered with zero allocation, so reducing young
gen pressure; it is also a foundation for improving the row cache and
introducing a shared page cache which could bring us closer to zero allocation
reads. _And so on...._
I'm not sure how this would help in the case of row cache, once reference is
added to the row cache it means that memtable would hang in there until that
row is purged, so if there is a long lived row (write once, read multiple
times) in each of the regions (and we reclaim based on regions) would that keep
memtable around longer than expected?
bq. It's also not clear to me how you would be managing the reclaim of the
off-heap allocations without OpOrder, or do you mean to only use off-heap
buffers for readers, or to ref-count any memory as you're reading it? Not using
off-heap memory for the memtables would negate the main original point of this
ticket: to support larger memtables, thus reducing write amplification.
Ref-counting incurs overhead linear to the size of the result set, much like
copying, and is also fiddly to get right (not convinced it's cleaner or
neater), whereas OpOrder incurs overhead proportional to the number of times
you reclaim. So if you're using OpOrder, all you're really talking about is a
new RefAction: copyToAllocator() or something. So it doesn't notably reduce
complexity, it just reduces the quality of the end result.
In terms of memory usage copy adds additional linear cost yes but at the same
time it makes the system behavior more controllable/predictable which is what
ops usually care about where, even on the artificial stress test, there seems
to be a low once off-heap feature is enabled which is no surprise once you look
at how much complexity does it actually add.
bq. Also, I'd love to see some evidence for this (particularly the latter). I'm
not disputing it, just would like to see what caused you to reach these
conclusions. These definitely warrant separate tickets IMO, but if you have
evidence for it, it would help direct any work.
Well, it seems like you never operated a real Cassandra cluster, did you? All
of the problems that I have listed here are well known, you can even simulate
this with docker VMs and making internal network gradually slower, there is no
back pressure mechanism built-in so right now Cassandra would accept a bunch or
operations on the normal speed (if the outgoing link is physically different
than internal) but suddenly would just stop accepting anything and fail
internally because of GC storm caused by all of the internode buffers hanging
around.
> Partially Off Heap Memtables
> ----------------------------
>
> Key: CASSANDRA-6689
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6689
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Benedict
> Assignee: Benedict
> Fix For: 2.1 beta2
>
> Attachments: CASSANDRA-6689-small-changes.patch
>
>
> Move the contents of ByteBuffers off-heap for records written to a memtable.
> (See comments for details)
--
This message was sent by Atlassian JIRA
(v6.2#6252)