[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13946318#comment-13946318 ] Benedict commented on CASSANDRA-6689: - Agreed. Pushed updated version to branch, which also modifies the FileUtils.clean() code to be the neater direct cast, and to check for the possibility by actually attempting a clean of a DirectByteBuffer. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13946494#comment-13946494 ] Tupshin Harper commented on CASSANDRA-6689: --- +1 Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13946986#comment-13946986 ] Jonathan Ellis commented on CASSANDRA-6689: --- You good with the latest, Pavel? Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13947001#comment-13947001 ] Pavel Yaskevich commented on CASSANDRA-6689: Can we at least make return code 0 so automated tools can interpret this correctly? Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13947020#comment-13947020 ] Jonathan Ellis commented on CASSANDRA-6689: --- Sure. Can you do that on commit? I think Benedict is gone for the night. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13947123#comment-13947123 ] Pavel Yaskevich commented on CASSANDRA-6689: Sure, I will do -1 return there. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13947219#comment-13947219 ] Jonathan Ellis commented on CASSANDRA-6689: --- Thanks [~benedict] [~xedin] [~krummas]! Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13945906#comment-13945906 ] Benedict commented on CASSANDRA-6689: - I've updated the repository to include your remaining changes. One last change I did not include was your suppression of errors from calling the direct cleaner: IMO this is dangerous, and the user should immediately find out that we do not have full support for off-heap functionality by crashing, so that they can reconfigure to use the on-heap allocator only. If we cannot clean, we are at risk of completely trashing the file system page cache through competition with our direct buffers. Better to fail badly ASAP. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13945949#comment-13945949 ] Pavel Yaskevich commented on CASSANDRA-6689: bq. One last change I did not include was your suppression of errors from calling the direct cleaner: IMO this is dangerous, and the user should immediately find out that we do not have full support for off-heap functionality by crashing, so that they can reconfigure to use the on-heap allocator only. If we cannot clean, we are at risk of completely trashing the file system page cache through competition with our direct buffers. Better to fail badly ASAP. First of all it would not fail badly ASAP because that cause ClassCast or NoSuchMethod exception, which would just get logged but the executing thread. It's better to error to the log with something useful e.g. that the cleanup is not going to happen rather then fails with ClassCastException which doesn't say much to the users the same way we do for FileUtils. [~jbellis] Can you remind us what is the general policy on the crashing? :) Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13945955#comment-13945955 ] Pavel Yaskevich commented on CASSANDRA-6689: After thinking about this once again, it would be better if we can detect (e.g. FileUtils.isCleanerAvailable()) that on startup and just fail over to the heap allocation, also we can reuse FileUtils.clean to accept ByteBuffer... Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13946168#comment-13946168 ] Jonathan Ellis commented on CASSANDRA-6689: --- bq. it would be better if we can detect (e.g. FileUtils.isCleanerAvailable()) that on startup and just fail over to the heap allocation That's along the lines of what I'd expect as a user. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944136#comment-13944136 ] Benedict commented on CASSANDRA-6689: - bq. One more thing I forgot to mention which is for consideration, we might take advantage of SlabAllocator (on-heap) to accommodate all of the temporary buffers created by copying from Memtable so when that allocation spills to oldgen it would create less fragmentation in there. This is likely to require as much complexity as RefAction to ensure resource cleanup, at which point we might as well just have proper zero copy. As to your suggested changes, I am fine with most of the generics changes, but -1 on several of your renames and your config note changes: unslabbed_heap_buffers is *not* intended to be included in the config file, as it is very much discouraged. Including slab in the name adds nothing, whereas buffers permits other non-buffer implementations in future without confusion in naming/renaming. I might change offheap_buffers to direct_buffers for consistency with the JDK though. heapAllocOnly is uglier and no better than allocateOnHeap. If you want, make it allocateOnHeapOnly. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944176#comment-13944176 ] Benedict commented on CASSANDRA-6689: - bq. unslabbed_heap_buffers is not intended to be included in the config file I realise that my comments in the config file do imply a non-slabbed allocation method, but I propose this is fixed instead of renaming + including the unslabbed option. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944213#comment-13944213 ] Pavel Yaskevich commented on CASSANDRA-6689: bq. This is likely to require as much complexity as RefAction to ensure resource cleanup, at which point we might as well just have proper zero copy. I has thinking of thing as easy as replacing HeapAllocator.instance with SlabAllocator.instance (which would only have a heap allocation enabled). bq. As to your suggested changes, I am fine with most of the generics changes, but -1 on several of your renames and your config note changes: unslabbed_heap_buffers is not intended to be included in the config file, as it is very much discouraged. Including slab in the name adds nothing, whereas buffers permits other non-buffer implementations in future without confusion in naming/renaming. I might change offheap_buffers to direct_buffers for consistency with the JDK though. If you have unslabbed_heap_buffers in your code then you better document it in the config, because people would find it and create a issue for that. Rather from that I don't care much about names as soon as they present what they do - {on}heap_slab - means that those buffers are going to be allocated in the on-heap slabs, same goes for offheap_slab (I was thinking of using direct_slab but all of the code has it as offheap), jvm_default is just a default way of allocating single buffers. bq. heapAllocOnly is uglier and no better than allocateOnHeap. If you want, make it allocateOnHeapOnly. I'm fine with allocateOnHeapOnly if you like big names better, just allocateOnHeap doesn't coalesce with a usage of the variable. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944214#comment-13944214 ] Benedict commented on CASSANDRA-6689: - bq. because people would find it and create a issue for that My understanding (not my policy) is that this is explicitly not a default supported option (in the current state of affairs) because of the expectation of increased heap fragmentation and GC death spiral, which is why I'm happy with the ugly name for that one option. bq. I has thinking of thing as easy as replacing HeapAllocator.instance with SlabAllocator.instance (which would only have a heap allocation enabled). In this case what is to be gained? Seems likely it would only increase the need for old gen collections, as we wouldn't be able to do any reuse? Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944217#comment-13944217 ] Pavel Yaskevich commented on CASSANDRA-6689: bq. My understanding (not my policy) is that this is explicitly not a default supported option (in the current state of affairs) because of the expectation of increased heap fragmentation and GC death spiral, which is why I'm happy with the ugly name for that one option. So we can go both ways - remove it completely from the allocation type enum and don't support it or leave it in place and add it to documentation so users can make informed decision weather to use it or not meanwhile default would still be heap_slab or similar name. bq. In this case what is to be gained? Seems likely it would only increase the need for old gen collections, as we wouldn't be able to do any reuse? It could perform the same function as previous heap slab, for par new it doesn't make much different but if slab slips to the old get then it would create less fragmentation comparing to individual buffers getting promoted (plus we can pack some fresh allocations inside), but that's just an idea and isn't critical for this ticket... Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944247#comment-13944247 ] Jonathan Ellis commented on CASSANDRA-6689: --- bq. So we can go both ways - remove it completely from the allocation type enum and don't support it or leave it in place and add it to documentation Why are those our only choices? It's the way it is for a reason; for a very few users, it's legitimately useful for running on the Azul JVM or for creating huge numbers of CFs, but for mainstream users it's a Very Bad Idea. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944262#comment-13944262 ] Pavel Yaskevich commented on CASSANDRA-6689: [~jbellis] Sorry if I want clear, by 'it' i mean unslabbed_buffers or jvm_default as I call it. I think we should mention it as legitimate option in the config instead of having that it only in the code... Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944310#comment-13944310 ] Jonathan Ellis commented on CASSANDRA-6689: --- Right, I'm saying that it's undocumented right now (CASSANDRA-5935) and I'd rather leave it that way. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944314#comment-13944314 ] Pavel Yaskevich commented on CASSANDRA-6689: Ok, I don't like that we don't mention things like that in the config with big fat warning, but if you guys think opposite (as you have more context) I am fine with leaving it out... Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942954#comment-13942954 ] Sylvain Lebresne commented on CASSANDRA-6689: - bq. Reviewing under the assumption that 1-3 will go in 2.1 and the rest into 3.0 So at the risk of adding more debate, I do think one of the point of separating things is also that we can evaluate things separately. Typically, I would be happy to see a quick evaluation of what #1+#2 gives us, and what #3 adds to the table. If 1+2 is borderline unusable without 3 (and if that's the case, it should be easy to cook up simple graph that shows it), then fine, let's do 3 in 2.1. But if it turns out that the difference is no that big, I would personally be keen to move #3 to 3.0, cause we're shoving enough big patches in 2.1 post beta1 as it is imo. To put it another way, if we're making steps, let's try to be true to them, and not make too strong assumptions about what goes where, because as far as I'm concerned, nothing should never be set in stone until it's committed (and even then really). Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942958#comment-13942958 ] Benedict commented on CASSANDRA-6689: - Can everyone please settle on what functionality is desired for this ticket/release, so I do not need to keep reorganising it. I am in the middle of rebasing and reorganising it again, but I will wait until some consensus is reached. If we only want #2 (#1 is useless on its own), the patch set will be quite different; I will drop all of the refactoring. If we plan to include #3 (note that this quite provably reduces fixed on-heap overhead by a minimum of 75%, and easily 90% for clustering columns), I will retain #1 with some minor tweaks, and do my best to further make the commits incremental, but I am not going to completely reorder everything for no good reason. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942966#comment-13942966 ] Sylvain Lebresne commented on CASSANDRA-6689: - Well, this was kind of my point. If we decide to make steps that are properly defined and self contained (which is my understanding of what people are asking here, and what I though was agreed on truly), then we do not have to decide right away if we sign for step 2 by taking step 1. Otherwise, what's the point of separating it in steps? Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942968#comment-13942968 ] Benedict commented on CASSANDRA-6689: - bq. what's the point of separating it in steps? The point here was to make review easier. This is one single ticket. If we want to split it out into multiple tickets, please let's agree that and then I'll do it. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942995#comment-13942995 ] Jonathan Ellis commented on CASSANDRA-6689: --- That is a good way to think about it -- we are asking for patches or patchsets that are self contained enough to be split out into a ticket apiece if necessary. If the divisions I proposed are a poor fit for that, by all means let's come up with ones that make better sense. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943005#comment-13943005 ] Benedict commented on CASSANDRA-6689: - FTR, this is unlikely to improve review, as ridding ourselves of #1 means #3 will simply be much bigger. However it seems this is the consensus, so proceeding. This may take some time. I will reopen CASSANDRA-6694 to take step #2, and have this ticket take #1. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943285#comment-13943285 ] Benedict commented on CASSANDRA-6689: - Patch for this ticket available [here|https://github.com/belliottsmith/cassandra/tree/iss-6689-final] I have made absolutely no unnecessary changes or refactors. There is one commit only, as there was no good way to make any of the changes for this specific commit whilst leaving it in a compilable state. For CASSANDRA-6694 I will split the patch up into multiple commits, but this is likely to take quite some time. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943299#comment-13943299 ] Jonathan Ellis commented on CASSANDRA-6689: --- Thanks! That looks fairly straightforward now. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943787#comment-13943787 ] Pavel Yaskevich commented on CASSANDRA-6689: One more thing I forgot to mention which is for consideration, we might take advantage of SlabAllocator (on-heap) to accommodate all of the temporary buffers created by copying from Memtable so when that allocation spills to oldgen it would create less fragmentation in there. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-final-changes.patch, CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941498#comment-13941498 ] Marcus Eriksson commented on CASSANDRA-6689: [~xedin] it is here: https://github.com/belliottsmith/cassandra/tree/iss-6689-3 Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942480#comment-13942480 ] Pavel Yaskevich commented on CASSANDRA-6689: [~benedict] I looked at the [#1|https://github.com/belliottsmith/cassandra/tree/iss-6689-1] and I have a concern... Can you elaborate why did you introduce all these Impl static classes? That looks especially interesting in context of Cell.Impl where e.g. ExpiringCell.Impl extends Cell.Impl all of the methods of which are static and then overrides Cell.Impl methods with it's own implementations, why can't this be solved by making Cell and descendants abstract classes and overriding required methods? Another example I see is in DecoratedKey - only Impl method used outside is Impl.compareTo(IPartitioner, ByteBuffer, RowPosition), which could be changed (moved) to be a static method of RowPosition and would call DecorateKey.compareTo(DecoratedKey) when needed, so there is no jump from Impl.compareTo(IPartitioner, ByteBuffer, RowPosition) to DecorateKey.compareTo(RowPosition) and back to Impl.compareTo(DecoratedKey, RowPosition). One more nit thing, in DecoratedKey.java there is no need to mark token() and key() explicitly abstract, also token() is already defined in RingPosition so no need to declare it in DecoratedKey. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942490#comment-13942490 ] Benedict commented on CASSANDRA-6689: - See my comment explaining [here|https://issues.apache.org/jira/browse/CASSANDRA-6694?focusedCommentId=13904708page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13904708] bq. One more nit thing, in DecoratedKey.java there is no need to mark token() and key() explicitly abstract, also token() is already defined in RingPosition so no need to declare it in DecoratedKey. The latter is an artefact of merging the getToken() and token() into the same method, and I agree. Although there's no harm in either, happy to change them. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942521#comment-13942521 ] Pavel Yaskevich commented on CASSANDRA-6689: How many times do I have to ask for the same thing in this ticket, *let's keep the changes to the point of the current ticket, please*?... As we don't need Impl for this ticket (if we need it at all in that way is the topic of the separate discussion, did we at least test how much do we actually safe by doing it that way? I think [~jbellis] with me on this one after reading CASSANDRA-6694) let's remove it, the less code we have to read the faster review process would go. So if the NativeCell is going to be introduced in #3 that means all of that Impl stuff should go there too. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942528#comment-13942528 ] Benedict commented on CASSANDRA-6689: - This ticket includes everything up to and including #3, so we DO need it for this ticket. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942550#comment-13942550 ] Pavel Yaskevich commented on CASSANDRA-6689: Let me repeat this again, if we need Impl for #3 it should be in #3 not #1. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942553#comment-13942553 ] Benedict commented on CASSANDRA-6689: - That is not repeating yourself, that is a different statement. You said anything not needed for this ticket. I introduced #1 specifically, prior to starting the split, as introducing all of the necessary refactors. I do not intend to change it again now, it has been waiting long enough. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942573#comment-13942573 ] Jonathan Ellis commented on CASSANDRA-6689: --- Benedict, 1-3 were intended to be distinct steps so that review could be correspondingly simplified. It does look to me like there is no clean breakdown as such in your branch, e.g. commits named in progress muddy the water considerably. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942576#comment-13942576 ] Pavel Yaskevich commented on CASSANDRA-6689: Here is the quote from my previous comment: _As we don't need Impl for this ticket before #3 (if we need it at all in that way is the topic of the separate discussion, did we at least test how much do we actually safe by doing it that way? I think Jonathan Ellis with me on this one after reading CASSANDRA-6694) let's remove it, the less code we have to read the faster review process would go. *So if the NativeCell is going to be introduced in #3 that means all of that Impl stuff should go there too*._ So first of all it's not clear if we even want to move in Impl direction as mentioned in CASSANDRA-6694, secondly we all agree that we are going to keep this to the point with roadmap we have so each of the points is individually deliverable/commitable, which essentially means less work for everybody and the ticket moves on faster, thirdly it looks like 4 people are reaching to one target reflected at least in Jonathan's roadmap where we want to get off-heap memtables with on-heap read side copy and so on _incrementally_, and you are just trying to do as such refactoring/changes as possible in all three of the bullet points, it looks that way because iss-6689-{1,2,3} all have at least *3K* additions and *2K* deletions, concerns about the size of those changes have already been raised by multiple people multiple times and yet we are still at the same spot with this. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942623#comment-13942623 ] Pavel Yaskevich commented on CASSANDRA-6689: Just for the reference, [a good example for the patch-set done right|https://github.com/iamaleksey/cassandra/compare/apache:cassandra-2.1...6506] all commits are to the point and built on top of each other, even thought it's ~4K total I can review whole thing quickly because there is no noise and it's efficient to traceback in isolated changes. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940231#comment-13940231 ] Pavel Yaskevich commented on CASSANDRA-6689: [~jbellis] I am, sorry, I got mixed up in something else but I'm going to look at it mid this week (tomorrow or thursday). Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940983#comment-13940983 ] Pavel Yaskevich commented on CASSANDRA-6689: [~benedict] So what is the branch to look at for #1? https://github.com/belliottsmith/cassandra/tree/iss-6689-1 still has DataGroup inside, which should have been removed after Marcus's review, shouldn't it? Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940202#comment-13940202 ] Jonathan Ellis commented on CASSANDRA-6689: --- Are you still planning to look at this soon, Pavel? Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13931656#comment-13931656 ] Benedict commented on CASSANDRA-6689: - Pushed an updated version with your changes merged, the cassandra.yaml reverted, and a theoretical deadlock risk eliminated in KeysSearcher and CompositesSearcher Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932277#comment-13932277 ] Pavel Yaskevich commented on CASSANDRA-6689: Thanks, [~benedict]! I'm going to take a look at this tomorrow (i hope). Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13930239#comment-13930239 ] Benedict commented on CASSANDRA-6689: - Uploaded a fixed version of #3 to the same repository, to fix a couple of bugs spotted by [~krummas] Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13930396#comment-13930396 ] Marcus Eriksson commented on CASSANDRA-6689: Reviewing under the assumption that 1-3 will go in 2.1 and the rest into 3.0, otherwise there is a some of stuff in #1 that should be in #3 etc, but leaving that aside for now. Main point when reviewing was that I found myself trying to wrap my head around the Group concept several times, especially since it is not actually adding any functionality at this stage (I know it will when we do GC). We should probably remove it since it adds indirection that we don't need right now. Pushed a branch with the DataGroup and various Group classes in o.a.c.u.memory removed here: https://github.com/krummas/cassandra/commits/bes/6689-3.1 , wdyt? Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13930404#comment-13930404 ] Benedict commented on CASSANDRA-6689: - [~krummas]: It will make me sad, but you're absolutely right, it isn't necessary just yet. Only thing to question is the change of default in conf/cassandra.yaml, but guessing this is a debugging oversight. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13924780#comment-13924780 ] Benedict commented on CASSANDRA-6689: - Note we will probably want to either include CASSANDRA-6781 at this point, or introduce another change to copy all records on-heap during memtable flush, else we may see the slower flush performance Marcus indicated. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923770#comment-13923770 ] Benedict commented on CASSANDRA-6689: - bq. One native Buffer each for name and value. Probably we'll stick to a Buffer per name component here, rather than per name, for simplicity at this step. This timeline translates to mostly what I suggested already, with a little tweaking, so I'll work with that (I have a new step 1, though): # I will introduce the - .data refactor as my starting point, and refactors of .memory in 6694, all ignoring any Native implementations; # Introduce a NativeByteBufferAllocator, and always copy onto heap if reading from it # Introduce NativeCell, and a simplified NativeAllocator (i.e. without any magic to reclaim it) # Introduce RefAction + RefAction driven reclaim # Introduce GC Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923907#comment-13923907 ] Benedict commented on CASSANDRA-6689: - Okay, so #1 is now [here|https://github.com/belliottsmith/cassandra/tree/iss-6689-1] It also includes necessary patches not yet incorporated from CASSANDRA-6692, and redefines CellName.dataSize() to return the size on disk, not the size of the components, but otherwise only performs all the necessary refactoring. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13924194#comment-13924194 ] Benedict commented on CASSANDRA-6689: - #2 is now [here|https://github.com/belliottsmith/cassandra/tree/iss-6689-2] Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13924471#comment-13924471 ] Benedict commented on CASSANDRA-6689: - [#3|https://github.com/belliottsmith/cassandra/tree/iss-6689-3] Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13924711#comment-13924711 ] Jonathan Ellis commented on CASSANDRA-6689: --- Thanks, Benedict. Let's have Marcus and/or Pavel review those and create new tickets for 4/5. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922259#comment-13922259 ] Benedict commented on CASSANDRA-6689: - bq. Well, it seems like you never operated a real Cassandra cluster, did you? You seem to have interpreted my query as an attack on the veracity of your statement. It was not. I only wanted more specific facts that could be used to target a solution, and preferably a new ticket on which to discuss them. This discussion has all the hallmarks of approaching unproductivity, so after this I do not think I have anything useful to add to the discussion, and will leave the committers to decide whether or not to include this work or to wait for you to produce your alternative: # Any scheme that copies data will inherently incur larger GC pressure, as we then copy for memtable reads as well as disk reads. Object overhead is in fact _larger_ than the payload for many workloads, so even if we have arenas this effect is not eliminated or even appreciably ameliorated. # Temporary reader space (and hence your approach) is *not* predictable: it is not proportional to the number of readers, but to the number and size of columns the readers read. In fact it is larger than this, as we probably have to copy anything we *might* want to use (given the way the code is encapsulated, this is what I do currently when copying on-heap - anything else would introduce notable complexity), not just columns that end up in the result set. # We appear to be in agreement that your approach has higher costs associated with it. Further, copying potentially GB/s of (randomly located) data around destroys the CPU cache, reduces peak memory bandwidth by inducing strobes, consumes bandwidth directly, wastes CPU cycles waiting for the random lookups; all to no good purpose. We should be reducing these costs, not introducing more. # It is simply not clear, despite your assertion of clarity, how you would reclaim any freed memory without separate GC (what else is GC but this reclamation?), however you want to call it, when it will be interspersed with non-freed memory, nor how you would guard the non-atomic copying (ref-counting, OpOrder, Lock: what?). Without this information it is not clear to me that it would be any simpler either. # Your approach is currently (still poorly defined) vaporware. Some further advantages specific to my approach: # Pauseless operation, so improved predictability # Absolute bound on memory utilisation, that can be rolled out to other data structures, further improving overall performance predictability # Lock-freedom and low overhead, so we move closer to being able to answer queries directly from the messaging threads themselves, improving latency and throughput An alternative approach needs, IMO, to demonstrate a clear superiority to the patch that is already available, especially when it will incur further work to produce. It is not clear to me that your solution is superior in any regard, nor any simpler. It also seems to be demonstrably less predictable and more costly, so I struggle to see how it could be considered preferable. Also: bq. would that keep memtable around longer than expected I'm not sure why you suppose this would be so. We can already happily reclaim any subportion of a region or memtable, so there is no reason to think this would be necessary, even if they resided in the same structure. bq. there seems to be a low once off-heap feature is enabled which is no surprise once you look at how much complexity does it actually add. This is certainly addressable. The off-heap feature by itself I have performance tested somewhat, and competes with Java GC for throughput (beating it as number of live objects increases), whilst being _pauseless_, so the complexity you refer to is no slouch and highly unlikely to be the culprit. There are issues with the way we manage IO for direct byte buffers, but I have addressed these in CASSANDRA-6781. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922830#comment-13922830 ] Pavel Yaskevich commented on CASSANDRA-6689: You probably still don't understand my point so let me clarify, I only care about 3 things: maintainability, consistency, performance. This is a big chunk of code which somebody has to maintain, which allows of inconsistent style (can do it with refferer = null or maybe somehow else situation, maybe put RefAction as null in the argument or maybe RefAction.impossible() but it really need to look throught it to make sure that it check for null everywhere and so on), and brings it's own assumptions e.g. _ in front, also adding a poor performance. Before that is addressed, I'm -1 of this. vnodes were a big chunk of work but people were able to split into roadmap and successfully finish, so I don't see any reason why we can't do it here. bq. Any scheme that copies data will inherently incur larger GC pressure, as we then copy for memtable reads as well as disk reads. Object overhead is in fact larger than the payload for many workloads, so even if we have arenas this effect is not eliminated or even appreciably ameliorated. For disk reads we have to copy even for mmap, so we don't keep any references on deletion time and files can be safely deallocated. So why don't copy directly to the memory allocated by the pool?... Object overhead would stay inside ParNew bounds (for ( p999)) so object allocation is relatively cheap comparing to everything else, that's the goal of JVM as a whole. bq. Temporary reader space (and hence your approach) is not predictable: it is not proportional to the number of readers, but to the number and size of columns the readers read. In fact it is larger than this, as we probably have to copy anything we might want to use (given the way the code is encapsulated, this is what I do currently when copying on-heap - anything else would introduce notable complexity), not just columns that end up in the result set. Doesn't matter how many emphasises you put here it won't make it this argument stronger because, as the main idea is to have those pools of a fixed size which would create back-pressure to client in the situations of heavy load which is exactly what operators want - go gradually slower without extreme latency disturbance. bq. We appear to be in agreement that your approach has higher costs associated with it. Further, copying potentially GB/s of (randomly located) data around destroys the CPU cache, reduces peak memory bandwidth by inducing strobes, consumes bandwidth directly, wastes CPU cycles waiting for the random lookups; all to no good purpose. We should be reducing these costs, not introducing more. Let's say we live in the modern NUMA world, so we are going to do the following pin the group threads to CPU cores so we have fixed scope of allocation of different things, that why there is no significant bus pressure for copy among other things JVM/Cassandra does with memory (not even significant cache coherency traffic). bq. It is simply not clear, despite your assertion of clarity, how you would reclaim any freed memory without separate GC (what else is GC but this reclamation?), however you want to call it, when it will be interspersed with non-freed memory, nor how you would guard the non-atomic copying (ref-counting, OpOrder, Lock: what?). Without this information it is not clear to me that it would be any simpler either. The same way as jemalloc or any other allocator does it, it least that is not reinventing the wheel. bq. Pauseless operation, so improved predictability What do you mean by this, we still leave on the JVM, do we not? Also what would it do in the low memory situation? allocate from heap? wait? This is not pauseless operation. bq. Lock-freedom and low overhead, so we move closer to being able to answer queries directly from the messaging threads themselves, improving latency and throughput We won't be able to answer queries directly from the messaging threads for the number of reasons not even indirectly related to your approach, at least for not breaking SEDA, which also supposed to be a safe guide for over utilization. bq. An alternative approach needs, IMO, to demonstrate a clear superiority to the patch that is already available, especially when it will incur further work to produce. It is not clear to me that your solution is superior in any regard, nor any simpler. It also seems to be demonstrably less predictable and more costly, so I struggle to see how it could be considered preferable. Overall, I'm not questioning the idea of being able to track what goes where would be great, I'm questioning implementation and trade-offs comparing to other approaches. Partially Off Heap Memtables
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922861#comment-13922861 ] Benedict commented on CASSANDRA-6689: - bq. Before that is addressed, I'm -1 of this These are already addressed in CASSANDRA-6694. bq. Object overhead would stay inside ParNew bounds (for ( p999)) The more we rely on staying within ParNew, the more often we are going to exceed it; and reducing the number of ParNew runs is also a good thing. You said you have 300ms ParNew pauses, occuring every second? So reducing the max latency and total latency is surely a good thing? bq. as the main idea is to have those pools of a fixed size How does this work without knowing the maximum size of a result set? We can't have a client block forever because we didn't provide enough room in the pools. Potentially we could have it error, but this seems inelegant to me, when it can be avoided. It also seems a suboptimal way to introduce back pressure, since it only affects concurrent reads / large reads. We should raise a ticket specifically to address back pressure, IMO, and try to come up with a good all round solution to the problem. bq. Let's say we live in the modern NUMA world, so we are going to do the following pin the group threads to CPU cores so we have fixed scope of allocation of different things, that why there is no significant bus pressure for copy among other things JVM/Cassandra does with memory It would be great to be more NUMA aware, but this is not about traffic over the interconnect, but simply with the arrays/memory banks themselves, and doesn't address any of the other negative consequences. You'll struggle to get more than a few GB/s bandwidth out of a modern CPU given that we are copying object trees (even shallow ones - they're still randomly distributed), and we don't want to waste any of that if we can avoid it bq. What do you mean by this, we still leave on the JVM, do we not? Also what would it do in the low memory situation? allocate from heap? wait? This is not pauseless operation. I did not mean to imply pauseless globally, but the memory reclaim operations introduced here are pauseless, thus reducing pauses overall, as whenever we would have had a pause from ParNew/FullGC to reclaim, we would not here. bq. We won't be able to answer queries directly from the messaging threads for the number of reasons not even indirectly related to your approach, at least for not breaking SEDA, which also supposed to be a safe guide for over utilization. I'm not sure why you think this would be a bad thing. It would only help for CL=1, but we are often benchmarked using this, so it's an important thing to be fast on if possible, and there are definitely a number of our users who are okay with CL=1 for whom faster responses would be great. Faster query answering should reduce over-utilisation, assuming some back-pressure built in to MessagingService or the co-ordinator managing its outstanding proxied requests to ensure it isn't overwhelmed by the responses. bq. The same way as jemalloc or any other allocator does it, it least that is not reinventing the wheel. Do you mean you would use jemalloc for every allocation? In which case there are further costs incurred for crossing the JNA barrier so frequently, almost certainly outweighing any benefit to using jemalloc. Otherwise we would need to maintain free-lists ourselves, or perform compacting GC. Personally I think compacting GC is actually much simpler. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922983#comment-13922983 ] Pavel Yaskevich commented on CASSANDRA-6689: bq. These are already addressed in CASSANDRA-6694. Is there a branch/patch to see all of the changes involved? bq. The more we rely on staying within ParNew, the more often we are going to exceed it; and reducing the number of ParNew runs is also a good thing. You said you have 300ms ParNew pauses, occuring every second? So reducing the max latency and total latency is surely a good thing? I'm not trying to imply that we should rely on the ParNew, I'm just saying that all of the read/write requests are short lived enough to stay inside young generation region, even if we slip the effect are masked to all other long term allocations that we do which get promoted. bq. How does this work without knowing the maximum size of a result set? We can't have a client block forever because we didn't provide enough room in the pools. Potentially we could have it error, but this seems inelegant to me, when it can be avoided. It also seems a suboptimal way to introduce back pressure, since it only affects concurrent reads / large reads. We should raise a ticket specifically to address back pressure, IMO, and try to come up with a good all round solution to the problem. Let the users specify directly or if not specified just take a guess based on total system memory, plus we can add an option to extend in the run time, of any problem that uses database there is a capacity planing stage and use-case spec or at least experimentation which would allow to size pools correctly. bq. I did not mean to imply pauseless globally, but the memory reclaim operations introduced here are pauseless, thus reducing pauses overall, as whenever we would have had a pause from ParNew/FullGC to reclaim, we would not here. Sorry but I still don't get it, do you mean lock-free/non-blocking or that it does no syscalls or something similar? But that doesn't matter for pauses as much as allocation throughput and fragmentation of Java GC. bq. I'm not sure why you think this would be a bad thing. It would only help for CL=1, but we are often benchmarked using this, so it's an important thing to be fast on if possible, and there are definitely a number of our users who are okay with CL=1 for whom faster responses would be great. Faster query answering should reduce over-utilisation, assuming some back-pressure built in to MessagingService or the co-ordinator managing its outstanding proxied requests to ensure it isn't overwhelmed by the responses. The fact is that we have SEDA at least as a first line of defense for over-utilization, so local reads a scheduled directly a different stage, we shouldn't be trying to do anything directly in the messaging stage, it adds another complications not related to this very ticket. bq. Do you mean you would use jemalloc for every allocation? In which case there are further costs incurred for crossing the JNA barrier so frequently, almost certainly outweighing any benefit to using jemalloc. Otherwise we would need to maintain free-lists ourselves, or perform compacting GC. Personally I think compacting GC is actually much simpler. As I mentioned, there is a jemalloc implementation in Netty project already which is pure Java, so we at least should consider it before trying to re-invent. bq. It would be great to be more NUMA aware, but this is not about traffic over the interconnect, but simply with the arrays/memory banks themselves, and doesn't address any of the other negative consequences. You'll struggle to get more than a few GB/s bandwidth out of a modern CPU given that we are copying object trees (even shallow ones - they're still randomly distributed), and we don't want to waste any of that if we can avoid it I'm still not sure how worse it would make the things, Java is the worst of cache locality with it's object placement anyway but we are not going to be copying deep trees. Let me outline the steps that I want to see to be taken to make incremental, which is how we usually do things for Cassandra project: # code an off-heap allocator or use existing one like on of the ByteBufAlloc implementations (evaluate new vs. existing); # Change memtables to use allocator for step #1 and copy data to heap buffers when it's read from memtable so it's easy to track the life time of buffers; # Do an extensive testing to check how horrible the copy really is for performance, find ways to optimize; # If everything is bad, switch from copy to reference tracking (in all of the commands, native protocol etc.); # Do an extensive testing to check if it improves the situation; # Change serialization/deserialization to use new allocator (pooled buffer instead of always allocating on heap); # Same as
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923029#comment-13923029 ] Benedict commented on CASSANDRA-6689: - bq. Is there a branch/patch to see all of the changes involved? Yes, [offheap2c+6781|https://github.com/belliottsmith/cassandra/tree/offheap2c+6781] and [offheap2c|https://github.com/belliottsmith/cassandra/tree/offheap2c+6781] The former includes performance enhancements for DirectByteBuffer use, from a separate ticket. bq. I'm just saying that all of the read/write requests are short lived enough to stay inside young generation region Well, yes and no. We have to wait until any client finishes processing the data, so there's no absolute guarantee they'll not survive. But either way, ParNew pauses are almost as bad as full GC, only they happen much more often. 300ms pauses are not a good thing, and if we can then reduce the size of YG so that when these pauses do happen they're shorter (or, say, maybe even use G1GC) then that's even better. bq. Sorry but I still don't get it, do you mean lock-free/non-blocking or that it does no syscalls or something similar? But that doesn't matter for pauses as much as allocation throughput and fragmentation of Java GC. I mean no worker threads have to stop in order for memory to be reclaimed. There is no STW for reclaiming memory. This is unrelated to the lock freedom. bq. Java is the worst of cache locality with it's object placement anyway Well, exactly. That's my point here :-) bq. it adds another complications not related to this very ticket. I have no intention of doing it in this ticket, just indicating it as a very useful improvement. bq. which is how we usually do things for Cassandra project It looks like you really have two steps, if I boil it down: 1) implement copying approach; 2) if slow, implement this approach? The issue with this, really, though, is that if we simply allocate ByteBuffers off-heap like we have in this ticket, we get none of the benefit of increased memtable capacity, since the on-heap overheads are still huge. Since that's one of the main goals here, it seems problematic to lose out on it - we don't need to test to find out what the result would be. This ticket was supposed to only be a stepping stone. Possibly we could scale CASSANDRA-6694 back somewhat, removing any support for RefAction.refer(), and always performing a copy onto heap from the NativeCell implementations, and spending some time ripping out any of the GC or any of the code at Memtable discard for coping with RefAction.refer(). But honestly this seems like a waste of effort to me, as the majority of the code would remain, we'll just not have as good a final solution. But it could be done if that is the community preference. We could probably split it up into further commits, but each commit adds the potential for more errors in my book, when we have a good solution that is ready to go. Maximally separated timeline for separated commits of 6694 would be: # introduce concurrency primitives # introduce .memory and .data refactor, but only for ByteBuffer allocators, and RefAction, but only allocateOnHeap # introduce all .data.Native* implementations and a cut down native allocator, using OpOrder to guard copying like we currently do for referencing (we need to use something, and it is simpler than ref counting or anything else) # introduce RefAction.refer(), GC, etc. (i.e. final patch) I would rather not split it up as, as I say, each new patch is an opportunity to mess up, but it could be done. We can do performance testing to our hearts content at each stage, although personally I think such testing would not be sufficient to demonstrate no benefit to the current approach, as even with little benefit seen it presupposes no future performance benefit is capped by the slower solution. So I would push for the final patch anyway. That said, I would be surprised if we did not see any improvement by comparison. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923147#comment-13923147 ] Pavel Yaskevich commented on CASSANDRA-6689: Doing everything together is even bigger opportunity to mess up. Let me clarify my point, hopefully you will hear it this time as you plead for committers to decide so I'm representing. From my side nothing from this ticket would go in until we separate all of the work into multiple committable stages, namely (but not necessarily exactly like that): # Make use off-heap memory inside of Memtables, which delivers increased capacity and less frequent/painful CMS (commit #1): a. Determine if we actually need to re-implement custom allocator or can go with existing one e.g. netty's implementation b. plug allocator from #1 into Memtable for memory usage tracking, do all necessary bookkeeping to release memory back to the pool once it's no longer needed (overwrite, flush etc.), make Memtable memory life time limited by copying data from Memtable on the read path. # Optimizations after #1 which determine how big of impact the copying actually makes comparing to original situation on different workloads and how good is ref tracking comparing to that (commit #2) Delivers - memory reference tracking (RefAction, GC + what ever is needed) (if needed) # Further enhancements to allocators/ref accounting (commit #3) This way we can reuse already existing RefAction and related code that we already have, commit #1 doesn't require any substantial changes in the way we treat memtables and adds decoupled allocator framework, commit #2 plugs reference tracking into all previous changes, commit #3 further improves things if needed. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923157#comment-13923157 ] Benedict commented on CASSANDRA-6689: - I think my suggested split meets your requirement, but I'm not absolutely sure. My 1-2 are refactors to bring abstractions closer; 3 is your step 1.b; 4+5 is your step 2. The only difference is I assume we want our own allocator, since we won't be able to use Netty's for steps 4+5, and that we allocate Native* within the memtable to save substantially more space than allocating ByteBuffer would. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923186#comment-13923186 ] Benedict commented on CASSANDRA-6689: - Whilst it would be nice to depend on an external project here, it doesn't resolve the problem that it's heavier-weight on heap than what we can achieve ourselves is. In CASSANDRA-6694 we really try to push the heap overhead to its minimum, and every byte counts; also, we need an extra field we can manipulate for maintaining the allocation list for GC, when we come to introduce it. So we really do need control of it, otherwise we'll only get us as far as step 1. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923188#comment-13923188 ] Pavel Yaskevich commented on CASSANDRA-6689: Can you elaborate why is it heavier-weight on heap? Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923201#comment-13923201 ] Benedict commented on CASSANDRA-6689: - The smallest direct byte buffer from Netty is on the order of 64-bytes minimum, whereas we manage 24-bytes; plus for maintaining the allocation list we then need to have a separate list, rather than chaining the allocations, so at least another 4-bytes overhead (more, but in best case it tends to 4). [https://github.com/netty/netty/blob/master/buffer/src/main/java/io/netty/buffer/AbstractByteBuf.java] [https://github.com/netty/netty/blob/master/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeDirectByteBuf.java] Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923561#comment-13923561 ] Jonathan Ellis commented on CASSANDRA-6689: --- It seems like we have a pretty broad consensus (Pavel + Sylvain + Aleksey; Marcus?) that we should break this up more. How about something like this? # Trading some performance for much bigger memtables is useful, so introduce off-heap memtables as an option, default to off. No GC, just free after flush. One native Buffer each for name and value. Copy onto heap when reading. # Add NativeCell a la 6694 to reduce at-rest on-heap overhead. Still copying into an on-heap Cell for reads. Still no GC. # Add the simplest possible GC to avoid copying into on-heap Cells. If we still have to flush for overwrite-heavy workloads that's okay. # Full GC that doesn't need to flush unless we really have used up our memory in live Cells. # Get rid of on-heap memtables. I'm waving my hands a bit b/c I'm not sure what 3 looks like but I think/hope we can probably come up with something simpler than what we have today as an intermediate step. And maybe 5 comes after 3. But I think 1 and 2 make sense for 2.1b2 with the others for 3.0. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923642#comment-13923642 ] Marcus Eriksson commented on CASSANDRA-6689: Sounds good, though i would have been good with shipping the whole patch but marking the off-heap parts as experimental and defaulting to HeapSlabPool Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920720#comment-13920720 ] Benedict commented on CASSANDRA-6689: - bq. sort of RCU (i'm looking at you OpOrder) What do you mean here? If you mean read-copy-update, OpOrder is nothing like this. bq. I'm not sure what is to retain here if we do that copy when we send to the wire Ultimately, doing this copying before sending to the wire is something I would like to avoid. Using the RefAction.allocateOnHeap() on top of this copying sees wire transfer speeds for thrift drop by about 10% in my fairly rough-and-ready benchmarks, so obviously copying has a cost. Possibly this cost is due to unavoidably copying data you don't necessarily want to serialise, but it seems to be there. Ultimately if we want to get in-memory read operations to 10x their current performance, we can't go cutting any corners. bq. introducing separate gc I've stated clearly what this introduces as a benefit: overwrite workloads no longer cause excessive flushes bq. things but as we have a fixed number of threads it is going to work out the same way as for buffering open files in the steady system state Your next sentence states how this is a large cause of memory consumption, so surely we should be using that memory if possible for other uses (returning it to the buffer cache, or using it internally for more caching)? bq. Temporary memory allocated by readers is exactly what we should be managing at the first place because they allocate the most and it always the biggest concern for us I agree we should be moving to managing this as well, however I disagree about how we should be managing it. In the medium term we should be bringing the buffer cache in process, so that we can answer some queries without handing off to the mutation stage (anything known to be non-blocking and fast should be answered immediately by the thread that processed the connection), at which point we will benefit from shared use of the memory pool, and concrete control over how much memory readers are using, and zero-copy reads from the buffer cache. I hope we may be able to do this for 3.0. bq. do a simple memcpy test and see how much mb/s can you get from copying from one pre-allocated pool to another Are you performing a full object tree copy, and doing this with a running system to see how it affects the performance of other system components? If not, it doesn't seem to be a useful comparison. Note that this will still create a tremendous amount of heap churn, as most of the memory used by objects right now is on-heap. So copying the records is almost certainly no better for young gen pressure than what we currently do - in fact, *it probably makes the situation worse*. bq. it's not the memtable which creates the most of the noise and memory presure in the system (even tho it uses big chunk of heap) It may not be causing the young gen pressure you're seeing, but it certainly offers some benefit here by keeping more rows in memory so recent queries are more likely to be answered with zero allocation, so reducing young gen pressure; it is also a foundation for improving the row cache and introducing a shared page cache which could bring us closer to zero allocation reads. It's also not clear to me how you would be managing the reclaim of the off-heap allocations without OpOrder, or do you mean to only use off-heap buffers for readers, or to ref-count any memory as you're reading it? Not using off-heap memory for the memtables would negate the main original point of this ticket: to support larger memtables, thus reducing write amplification. Ref-counting incurs overhead linear to the size of the result set, much like copying, and is also fiddly to get right (not convinced it's cleaner or neater), whereas OpOrder incurs overhead proportional to the number of times you reclaim. So if you're using OpOrder, all you're really talking about is a new RefAction: copyToAllocator() or something. So it doesn't notably reduce complexity, it just reduces the quality of the end result. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920721#comment-13920721 ] Benedict commented on CASSANDRA-6689: - bq. but the reads and internode communication (especially the latter). Also, I'd love to see some evidence for this (particularly the latter). I'm not disputing it, just would like to see what caused you to reach these conclusions. These definitely warrant separate tickets IMO, but if you have evidence for it, it would help direct any work. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921110#comment-13921110 ] Pavel Yaskevich commented on CASSANDRA-6689: bq. I've stated clearly what this introduces as a benefit: overwrite workloads no longer cause excessive flushes If you do a copy before of the memtable buffer, you can clearly put it back to the allocator once it's overwritten or becomes otherwise useless, in the process of merging columns with previous row contents. bq. Your next sentence states how this is a large cause of memory consumption, so surely we should be using that memory if possible for other uses (returning it to the buffer cache, or using it internally for more caching)? It doesn't state that is a *large cause of memory consumption*, it states that it has additional cost but it the steady state it don't be allocating over the limit because of the properties of the system that we have, namely the fixed number of threads. bq. Are you performing a full object tree copy, and doing this with a running system to see how it affects the performance of other system components? If not, it doesn't seem to be a useful comparison. Note that this will still create a tremendous amount of heap churn, as most of the memory used by objects right now is on-heap. So copying the records is almost certainly no better for young gen pressure than what we currently do - in fact, it probably makes the situation worse. Do you mean this? Let's say we copy a Cell (or Column object), which is 1 level deep so just allocate additional space for the object headers and do a copy, most of the work would be spend by doing a copy of the data (name/value) anyway, so as we want to live inside of ParNew, see how many such allocations you will be able to do in e.g. 1 second then wipe the whole thing and do it again. We are doing mlockall too which should make that even faster as we are sure that heap is pre-faulted already. bq. It may not be causing the young gen pressure you're seeing, but it certainly offers some benefit here by keeping more rows in memory so recent queries are more likely to be answered with zero allocation, so reducing young gen pressure; it is also a foundation for improving the row cache and introducing a shared page cache which could bring us closer to zero allocation reads. _And so on_ I'm not sure how this would help in the case of row cache, once reference is added to the row cache it means that memtable would hang in there until that row is purged, so if there is a long lived row (write once, read multiple times) in each of the regions (and we reclaim based on regions) would that keep memtable around longer than expected? bq. It's also not clear to me how you would be managing the reclaim of the off-heap allocations without OpOrder, or do you mean to only use off-heap buffers for readers, or to ref-count any memory as you're reading it? Not using off-heap memory for the memtables would negate the main original point of this ticket: to support larger memtables, thus reducing write amplification. Ref-counting incurs overhead linear to the size of the result set, much like copying, and is also fiddly to get right (not convinced it's cleaner or neater), whereas OpOrder incurs overhead proportional to the number of times you reclaim. So if you're using OpOrder, all you're really talking about is a new RefAction: copyToAllocator() or something. So it doesn't notably reduce complexity, it just reduces the quality of the end result. In terms of memory usage copy adds additional linear cost yes but at the same time it makes the system behavior more controllable/predictable which is what ops usually care about where, even on the artificial stress test, there seems to be a low once off-heap feature is enabled which is no surprise once you look at how much complexity does it actually add. bq. Also, I'd love to see some evidence for this (particularly the latter). I'm not disputing it, just would like to see what caused you to reach these conclusions. These definitely warrant separate tickets IMO, but if you have evidence for it, it would help direct any work. Well, it seems like you never operated a real Cassandra cluster, did you? All of the problems that I have listed here are well known, you can even simulate this with docker VMs and making internal network gradually slower, there is no back pressure mechanism built-in so right now Cassandra would accept a bunch or operations on the normal speed (if the outgoing link is physically different than internal) but suddenly would just stop accepting anything and fail internally because of GC storm caused by all of the internode buffers hanging around. Partially Off Heap Memtables Key: CASSANDRA-6689 URL:
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919225#comment-13919225 ] Benedict commented on CASSANDRA-6689: - bq. What I want is to see more control over how we deal with cleaner threads, as the main point to be able to run with smaller heaps, and thread stacks could consume ~300K we are risking to OOM in certain situations because we don't control whole thread life time, that would also help for people in environments guided by ulimit or similar. Just for clarity, it's the _collector_ threads, not the _cleaner_ threads you're worried about, right? Because there is only ever one cleaner thread. Like I say, this is mostly trying to help with an anti-pattern, so I'm not sure it's worth addressing too aggressively in this ticket. In real terms you'll struggle to spin up very many threads at all, but let's put an arbitrary limit of 100 threads on the pool for now. I don't think this is worth offering configuration options to the user here. bq. Please do think about it, we need to try to reduce the complexity of this to minimize bus factor As I said, I (still) can't see a good way to use this information to simplify matters usefully. If we flattened Pool, AllocatorGroup and Allocator all into the one object, and we had a per-CF memory limit, we would be able to simplify all of the SubPool and SubAllocator memory management considerably, but then we have the issue that there's no global management of memory resources, nor shared re-use. I don't think this would simplify the actual recycling of memory regions that is introduced in this ticket at all though, but it's possible I'm missing something. Suggestions always welcome :-) bq. Exactly, maybe it's time to re-evaluate trade-off between copying data to per-allocated regions with limited ttl vs. tracking data all around the place. This is the opposite suggestion, right? :-) It is definitely something worth exploring in future. bq. Have you had a chance to do more on vs. off heap performance testing? I'm not sure Marcus was performance testing. Personally I'm most interested in correctness right now: performance should follow (in particular, slow flushing is certainly a bug which can be fixed). Certainly for overwrite workloads I've demonstrated the reduced flushing on my box, but further testing of the impact is likely to be a long drawn out process, and probably beyond scope of this ticket IMO. Performance testing scope might include: - Establish how much heap limit reduction this permits - Establish a severe GC pause inducing workload to see if this mitigates - Establish if G1GC is made viable with this, and if not, what remaining impacts there are to adoption that might be lifted - Establish how much improvement to overwrite workloads (or mixed workloads, where reduced overwrite workload memory consumption = reduced non-overwrite workload flushing) - Establish new reasonable upper memory limit for heap and memtable (both independently), for different workloads. e.g. large data payload workloads may now practicably support much larger memtables and dramatically reduced flushing - Establish impact on upper *data* limit for different workloads, with same memory limit as olde, and the impact on size-of-flush and concomitant reduction in write amplification Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919511#comment-13919511 ] Marcus Eriksson commented on CASSANDRA-6689: bq. As I said, I (still) can't see a good way to use this information to simplify matters usefully. +1, my initial review-comment was probably badly phrased, the patch adds a complicated feature and it will take effort to learn these things for anyone working with the Cassandra code base, but I think it will be worth it in the end bq. I'm not sure Marcus was performance testing. no, not at all, just ran stress a couple of times to get ballpark numbers Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920328#comment-13920328 ] Pavel Yaskevich commented on CASSANDRA-6689: So the only question that I still have - what is the main goal of the ticket? Is it to move to zero copy for memtables or minimize number of allocations on java heap? If it's the latter then I don't see why we can't have separate allocator for memtable and use a separate allocator for operations which would do a copy from memtable memory, that way we don't have to track as such as we do right now and allocator interface is going to be better encapsulated (basically expose alloc/release), I'm not sure if we need sub pools or groups of allocators in there, I really like how jemalloc and others do it also Netty already has implementation of jemalloc (ByteBufAlloc) so maybe we can reuse that?... Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920357#comment-13920357 ] Benedict commented on CASSANDRA-6689: - bq. So the only question that I still have - what is the main goal of the ticket? Is it to move to zero copy for memtables or minimize number of allocations on java heap? Zero copy would certainly be nice to retain. Otherwise why not just ref-count the memory? bq. allocator interface is going to be better encapsulated (basically expose alloc/release) I don't follow how this would improve encapsulation, sorry. Could you sketch an example? It sounds to me like this actually would create more complexity outside of the main body of code. It might reduce complexity a little where the fun stuff happens, but is it all really so difficult to understand that we want to start opting for less good solutions that require more work? Also makes it difficult to impose tight caps on memory used, as we can't practicably impose/manage temporary memory allocated by readers. We already have this problem for disk reads, but no point opening up further doors for OOM. bq. I'm not sure if we need sub pools or groups of allocators in there, I really like how jemalloc and others do it also Netty already has implementation of jemalloc (ByteBufAlloc) so maybe we can reuse that?... I'm unclear how this addresses the concerns of resource management. The ByteBufAllocator appears to simply allocate byte buffers, which by itself is a pretty simple problem. jemalloc is also just a low level replacement to malloc, so I'm not sure where it enters the discussion? Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920405#comment-13920405 ] Pavel Yaskevich commented on CASSANDRA-6689: bq. Zero copy would certainly be nice to retain. Otherwise why not just ref-count the memory? I'm not sure what is to retain here if we do that copy when we send to the wire. Wouldn't we better of with ref count and copy but having separate allocator for temporary things (e.g. write/read buffers) and for more long term (e.g. memtable, compression buffers)? So can you please answer the question - what is the biggest advantage of tracking everything, introducing separate gc, sort of RCU (i'm looking at you OpOrder) vs. mem copy + different allocators for different things if there is a fixed size pre-allocated buffer pool off heap (we can also do COW for some of the chunks)? Copy is *not* be a big problem (regarding throughput/latency) comparing to other operations (do a simple memcpy test and see how much mb/s can you get from copying from one pre-allocated pool to another), there is a trade-off of course as we are going to spend more memory to store temporary things but as we have a fixed number of threads it is going to work out the same way as for buffering open files in the steady system state... Temporary memory allocated by readers is *exactly* what we should be managing at the first place because they allocate the most and it always the biggest concern for us (ParNew pauses are reaching 300 ms in 1 sec. intervals), that's why I'm saying let's use low level allocator for things like reads with it's own arena and chunk sizes, as we originally wanted and have a separate low level allocator for the memtable, preallocate at the startup, return buffers to the pool when we serialize for the wire and go from there with global thresholds etc. After all that said, I would suggest that we do a separate low level allocator in this ticket (the same way as jemalloc) and plug it into the memtable and do a copy on read path, once that step is done we can plug-in allocator for all of the serialization/deserialization logic that we have when it's going from/to the wire and the last step would be to use allocator for all of the sstable read operations. All this might seem irrelevant for some people but I can say that after fire-fighting Cassandra for some time for different use cases, it's not the memtable which creates the most of the noise and memory presure in the system (even tho it uses big chunk of heap) but the reads and internode communication (especially the latter). Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918701#comment-13918701 ] Pavel Yaskevich commented on CASSANDRA-6689: I read through most of the code in offheap1b skipping queue implementations add there and I have couple of questions: I see this pattern in the code (e.g. {RangeSlice, Read}VerbHandler.java) where Referrer is allocated try { operation(); referrer = null } finally { referrer.setDone() if (referrer != null); } where documentation says that that setDone() should be called when we are done with the referrer, can you elaborate why do we need to explicitly set it to null? Pool.java - can we schedule cleaners to a static thread pool (or something similar) instead of creating a thread per cleaner? With would add more control in the situation when we flush memtables in short intervals. Related question, why can't we have per-cf (or global) pool instead of creating it per memtable, might be possible to remove allocator groups if we do that? Also I just want to throw an idea here maybe there is something useful in it - as we have fixed number of stages and threads in them and don't normally share much between them, maybe it would be possible to use that as an advantage for allocator gc? What I mean is, it might be possible to track life-time of the allocation not by how buffer was allocated but where it was allocated, more like what RCU... Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918726#comment-13918726 ] Benedict commented on CASSANDRA-6689: - bq. can you elaborate why do we need to explicitly set it to null? I do this in places where I'm handing off control of the referrer to another thread/context, so we only call setDone() in the case of an error before we did this. Admittedly, looking at these places now, it looks like replacing the finally block with a catch (Throwable t) block would be fine, and given we can do untyped rethrow there isn't anything lost by doing this. Probably an old habit from Java 6 to do it the way I did. bq. With would add more control in the situation when we flush memtables in short intervals. If I understand you correctly, you mean to deal with the situation where we want to flush memtables quick enough that the single cleaner thread may become the bottleneck? In which case, I don't think this should ever be a problem: the work done by the cleaner itself is very minimal unless it's got some GC to do, in which case it is potentially non-trivial, but still unlikely to be a bottleneck. There might be a case for handing off the GC candidate selection to another thread, except that we expect it to certainly be quicker than any flush, and we want to wait for it to complete before deciding if flushing is actually worthwhile, so I'm not sure it buys us much. Maybe we could do this and continue to check the memory state isn't worsening, and if it is just force a flush without waiting for the GC, but it probably is too unlikely to be of benefit to be worth the extra complexity. That is, unless you meant some other benefit I missed? bq. Related question, why can't we have per-cf (or global) pool instead of creating it per memtable, might be possible to remove allocator groups if we do that? The pool _is_ global; a per-cf pool would have the issue of poor sharing of resources, though, no? I would probably like to see us move to allocators surviving across memtable replacement, so that we could then flatten allocator group with allocator, but I'd prefer to save that for a later date. bq. Also I just want to throw an idea here maybe there is something useful in it - as we have fixed number of stages and threads in them and don't normally share much between them, maybe it would be possible to use that as an advantage for allocator gc? What I mean is, it might be possible to track life-time of the allocation not by how buffer was allocated but where it was allocated, more like what RCU... Not really sure what you're suggesting here, could you elaborate? Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918785#comment-13918785 ] Pavel Yaskevich commented on CASSANDRA-6689: bq. If I understand you correctly, you mean to deal with the situation where we want to flush memtables quick enough that the single cleaner thread may become the bottleneck? In which case, I don't think this should ever be a problem: the work done by the cleaner itself is very minimal unless it's got some GC to do, in which case it is potentially non-trivial, but still unlikely to be a bottleneck. There might be a case for handing off the GC candidate selection to another thread, except that we expect it to certainly be quicker than any flush, and we want to wait for it to complete before deciding if flushing is actually worthwhile, so I'm not sure it buys us much. Maybe we could do this and continue to check the memory state isn't worsening, and if it is just force a flush without waiting for the GC, but it probably is too unlikely to be of benefit to be worth the extra complexity. That is, unless you meant some other benefit I missed? I can image the situation when we frequently switch memtables which, in current code, starts a new thread, so I wonder how quickly we would be able to see a slowdown affect caused by that + potentially OOM on stack space... bq. The pool is global; a per-cf pool would have the issue of poor sharing of resources, though, no? I would probably like to see us move to allocators surviving across memtable replacement, so that we could then flatten allocator group with allocator, but I'd prefer to save that for a later date. I'm thinking maybe it would lessen internal complexity if we do (sort-of) arena allocation inside instead of having groups and sub pools? bq. Not really sure what you're suggesting here, could you elaborate? It's just an idea so people can just ignore it but what I meant there is maybe it would be a better first step for us to consider making all of the operations that we do more self-contained so we can track exactly where it started and ends (by operation I mean all types of reads, writes), which could potentially make allocator job easier as we would be able to container per operation memory... Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918822#comment-13918822 ] Benedict commented on CASSANDRA-6689: - bq. I can image the situation when we frequently switch memtables which, in current code, starts a new thread, so I wonder how quickly we would be able to see a slowdown affect caused by that + potentially OOM on stack space... It only starts a new thread if there is GC work to be done, room to do it in, and the static pool doesn't have any threads that haven't timed out. The cleaner thread itself is started once and remains forever. If you mean you're worried about flushing so many memtables that all have GC work possible on them and that this would spam the collector thread pool, I guess I agree it's a theoretical possibility, but it would have to be pretty extreme. We can spin a thousand threads and not break a sweat, but even with tens of thousands of memtables we'll hit our memory limit and stop being able to do any more GC before we cause major problems. That said, we could safely impose a cap on the pool if we wanted. Set it largeish so it isn't a bottleneck, but prevent this kind of problem. bq. I'm thinking maybe it would lessen internal complexity if we do (sort-of) arena allocation inside instead of having groups and sub pools? Probably not appreciably. Most of complexity is likely to remain, as it's mostly about dealing with the interactions of writers with readers, and those will be mostly the same regardless, since readers can touch anything at any time. The allocator group does currently allow for some slight _optimisations_, but I don't immediately see a way that merging with pool would reduce _complexity_. I will give it some thought though. On this vein, I would like to see thread local allocation buffers, but again this is an optimisation rather than a complexity reducer. bq. It's just an idea so people can just ignore it but what I meant there is maybe it would be a better first step for us to consider making all of the operations that we do more self-contained so we can track exactly where it started and ends (by operation I mean all types of reads, writes), which could potentially make allocator job easier as we would be able to container per operation memory... We're currently dealing with allocations that necessarily outlive the duration of the operation, so I'm not sure how well that would apply here, but possibly you have some specific examples? It may be that we decide to use this code for creating the mutations from thrift/native/MS etc., though, in which case it would be very helpful. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918825#comment-13918825 ] Benedict commented on CASSANDRA-6689: - bq. Almost forgot, I have some minor changes to the code: removed couple of used methods, added couple of RefAction.impossible() instead of null, changed Memtable code so it avoids double cast, and very minor typo. Thanks. I'll incorporate these into CASSANDRA-6694 instead if that's okay with you? Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918866#comment-13918866 ] Pavel Yaskevich commented on CASSANDRA-6689: bq. It only starts a new thread if there is GC work to be done, room to do it in, and the static pool doesn't have any threads that haven't timed out. The cleaner thread itself is started once and remains forever. If you mean you're worried about flushing so many memtables that all have GC work possible on them and that this would spam the collector thread pool, I guess I agree it's a theoretical possibility, but it would have to be pretty extreme. We can spin a thousand threads and not break a sweat, but even with tens of thousands of memtables we'll hit our memory limit and stop being able to do any more GC before we cause major problems. That said, we could safely impose a cap on the pool if we wanted. Set it largeish so it isn't a bottleneck, but prevent this kind of problem. What I want is to see more control over how we deal with cleaner threads, as the main point to be able to run with smaller heaps, and thread stacks could consume ~300K we are risking to OOM in certain situations because we don't control whole thread life time, that would also help for people in environments guided by ulimit or similar. bq. I will give it some thought though. On this vein, I would like to see thread local allocation buffers, but again this is an optimisation rather than a complexity reducer. Please do think about it, we need to try to reduce the complexity of this to minimize bus factor :) bq. We're currently dealing with allocations that necessarily outlive the duration of the operation, so I'm not sure how well that would apply here, but possibly you have some specific examples? It may be that we decide to use this code for creating the mutations from thrift/native/MS etc., though, in which case it would be very helpful. Exactly, maybe it's time to re-evaluate trade-off between copying data to per-allocated regions with limited ttl vs. tracking data all around the place. bq. Thanks. I'll incorporate these into CASSANDRA-6694 instead if that's okay with you? Sure. [~krummas] Have you had a chance to do more on vs. off heap performance testing? Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Attachments: CASSANDRA-6689-small-changes.patch Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915840#comment-13915840 ] Sylvain Lebresne commented on CASSANDRA-6689: - 2 remarks that I remarked while skimming over the patch (those are more codestyle than anything but well): * there's tons of method whose name starts with an underscore. I'd really rather avoid that since it's inconsistent with the code base (and given the time it took us a few years back to remove all underscores to get the code consistent, I'd rather not reintroduce some now). * why changing all the ByteBuffer.hasArray() calls to !ByteBuffer.isDirect()? In almost all cases, hasArray() is called to guard some latter calls to array() and/or arrayOffset() and the [ByteBuffer javadoc|http://docs.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html#hasArray()] is pretty explicit on the fact that hasArray() is *the* correct method to call in that case. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915948#comment-13915948 ] Benedict commented on CASSANDRA-6689: - bq. there's tons of method whose name starts with an underscore. I'd really rather avoid that since it's inconsistent with the code base (and given the time it took us a few years back to remove all underscores to get the code consistent, I'd rather not reintroduce some now). These are almost all within one class: AbstractMemory, also a couple in NativeAllocation. The reason for this is that these are the base classes for building off-heap objects, so we need methods for the implementation of the interfaces these objects will represent, but these methods absolutely should not escape or interfere with the abstractions we are implementing. As an example, we need a size() method in CellName, and an internal _size() method in NativeAllocation. They mean two completely different things, and one of them is only interesting to the internals of any native implementation. I want to avoid any such clash, or any overhead in establishing that there is no such clash, and so underscores seem the best solution. We also need to provide lots of internal methods for getting/putting data/fields from the off-heap region the object is made up of, and I don't want any confusion about what methods mess with this, and which methods are actual functionality. There are a lot of these methods, and putting internal as a prefix or something else doesn't seem any better to me. bq. why changing all the ByteBuffer.hasArray() calls to !ByteBuffer.isDirect()? In almost all cases, hasArray() is called to guard some latter calls to array() and/or arrayOffset() and the ByteBuffer javadoc is pretty explicit on the fact that hasArray() is the correct method to call in that case. This is a valid criticism, and in fact I should roll back this change for this patch. It was necessary for CASSANDRA-6689 as we were abusing ByteBuffer, and breaking this contract. This is no longer the case, now that we use NativeAllocation instead. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13916011#comment-13916011 ] Sylvain Lebresne commented on CASSANDRA-6689: - bq. There are a lot of these methods, and putting internal as a prefix or something else doesn't seem any better to me. Well, it's more consistent with the code style, and consistency is better. It's imo also more explicit since you had to use the word internal to describe what the underscore were standing for in your comment. And if size() and _size() mean two completely different things, I think having more than one character separating them is not a bad thing. But honestly the latter points are almost irrelevant, code base style consistency is (to me at least). Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13916020#comment-13916020 ] Benedict commented on CASSANDRA-6689: - bq. code base style consistency is (to me at least). Well, readability of the functional methods is important to me, so having dozens of internalGetLongVolatile or internalSize calls around the place seems an unnecessary price for such an isolated contradiction to the code style, but that's just my take. It seems pretty clear that all of the _ methods are internal, to me. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13916028#comment-13916028 ] Sylvain Lebresne commented on CASSANDRA-6689: - I give up. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13916063#comment-13916063 ] Jonathan Ellis commented on CASSANDRA-6689: --- bq. it's more consistent with the code style, and consistency is better. It's imo also more explicit since you had to use the word internal to describe what the underscore were standing for in your comment +1 Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13916071#comment-13916071 ] Benedict commented on CASSANDRA-6689: - I should note these underscore concerns all apply to CASSANDRA-6694, so I will apply the change there, once we have any other concerns here cleared up. As merging these changes with those in CASSANDRA-6694 has been a pain already, and I don't want to make that worse. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13916545#comment-13916545 ] Pavel Yaskevich commented on CASSANDRA-6689: So [~benedict] what's the latest branch, is it offheap1b? On the side note, +1 with [~slebresne], let's incapsulate and name everything properly, I worked on renaming here already and don't want to history to repeat itself. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13916720#comment-13916720 ] Benedict commented on CASSANDRA-6689: - [~xedin] the latest latest is offheap2c, which I have just uploaded, and is for ticket 6694, building on this ticket. I think it's the better one to review, personally, but if you want to start with this ticket first offheap1b is the latest, yes. To possibly aid the reviewers to understand the decisions taken here, I will outline briefly the main areas of code that have been changed, and why the approach was selected. I will leave details about _how_ it works to be addressed by the documentation in the codebase, and won't cover if or why we want off-heap memtables, only the reasons for implementing them in this way. First, though, I'd like to potentially put some words in Marcus' mouth, and suggest that when he says it is complicated/involved he's mostly talking about the complexity in reasoning about thread ordering/behaviour, as we do not use locks anywhere, and we endeavour to communicate between threads as little as possible, and we use the JMM to its fullest (and in this ticket only, abuse ByteBuffer badly, and depend sometimes on x86 ordering guarantees rather than the JMM). As such it is definitely a complex patch, but actually almost all of the complexity is in OffHeapCleaner (NativeCleaner in 6694) and OffHeapAllocator (NativeAllocator), with little extra snippets in the concurrency utilities, in Referrer/s, and also in the work introduced in CASSANDRA-5549 which is probably new to everyone reviewing. This is not complexity in the normal sense of code sprawl (although there are lots of places touched by this, mostly these are simple modifications), but in that it requires sitting and thinking hard about the model and the object lifecycles as documented. This is why I have spent a great deal of time specifying the assumptions and safety concerns at any of the danger points. This kind of complexity is difficult to justify in a simple paragraph or three, and either stands or falls by the code itself and its design decisions; I don't think I can summarise it here any better than the code comments, and I have no alternative implementation to compare and contrast with. In this patch (ignoring CASSANDRA-6694 for now) there are three main areas of changes: 1) New concurrency utilities 2) OffHeapAllocator et al 3) RefAction changes across the codebase As to 1, these simply help to make the implementation easier and safer. NBQ adds a number of useful behaviours, such as CAS-like modification, safe iterator removal, multiple views on the same queue. These facilities improve clarity and obviousness in a number of places. WaitQueue and OpOrder are also improved from those introduced in CASSANDRA-5549. OpOrder prevents garbage accumulation from old Group objects being kept floating around in Referrer instances (and makes it faster); WaitQueue reduces the cost of wake-ups by eliminating a race through use of NBQ. All of the concurrency utilities could be split off into a separate patch if we wanted. As to 2 and 3, as a quick starting point which I will refer back to, I will paste a quick bit about the memory lifecycle from the code: {code} * 1) Initially memory is managed like malloc/free: it is considered referenced at least until the allocation is *matched by a corresponding free() * 2) The memory is then protected by the associated OpOrder(s); an allocation is only available for collection *once all operations started prior to the free() have completed. These are effectively short lived transactions that *should always complete in a timely fashion (never block or wait indefinitely) * 3) During this time any operation protected by the read/write OpOrder may optionally 'ref' {@link Referrer} an object, *or objects, that each reference (in the normal java sense) some allocations; once the read operation finishes these *will have been registered with one or more GC roots {@link Referrers} - one per participating allocator group (CFS). * 4) When a GC (or allocator discard) occurs, any extant {@link Referrer} that were created during an operation that *began prior to the collect phase are walked, and any regions that are reachable from any of these 'refs' are *switched to a refcount phase. When each ref completes it decrements the count of any such regions it reached. *Once this count hits 0, the region is finally eligible for reuse. {code} Now, as Jonathan mentioned, in CASSANDRA-5549, we introduced the OpOrder concurrency primitive. This provides a mechanism for cleaning up on-heap memtables because we can guarantee that no new writes will touch them. They can also guarantee no new reads will touch them. The problem is that a read's lifetime lasts beyond
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13914452#comment-13914452 ] Benedict commented on CASSANDRA-6689: - Okay, pushed a patch [here|https://github.com/belliottsmith/cassandra/tree/offheap1b] which includes all of your branch, is up-to-date with 2.1, and also: bq. In NBQV, drainTo does not return the correct number of inserted items. I've fixed this, and also merged NQB.drainTo() to some extent with NBQV.drainTo, to optimise it a little as well, and added some long tests to NonBlockingQueueTest to cover it bq. Make OpOrder.Group and RefAction implement AutoClosable to be able to do try-with-resources? I've introduced this change, and I've updated all of the try/finally blocks to use it. Much neater. Things I haven't touched: bq. AtomicReferenceArrayUpdater.shift(), really? I'm really not sure what we can improve with this? bq. I flush ~40MB/s with HeapSlabPool but only ~15MB/s with OffHeapPool. I think we should split this into another ticket. I suspect we might be able to improve this by introducing an extension of DataOutput which can accept a ByteBuffer. bq. GC metrics! Time spent, amount of data reallocated etc. Also think this deserves its own ticket. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13913657#comment-13913657 ] Pavel Yaskevich commented on CASSANDRA-6689: I agree with [~krummas] regarding having more granular to-the-point patches, plus it would be nice to merge this with 6694 so we can review/test all together especially considering the amount of changes. I will try to get some time to review this in upcoming days and that change would help us all a lot. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13913686#comment-13913686 ] Benedict commented on CASSANDRA-6689: - bq. more granular to-the-point patches, plus it would be nice to merge this with 6694 so we can review/test all together I don't want to seem obtuse here, I want to help as much as I can, but it really is one or the other. Even this patch would be difficult to break up, but 6694 makes it all even more intertwined. Note that excluding the minor suggestions made by [~krummas], 6694 is built on top of this, so contains this patch. I will be posting a patch shortly that addresses his concerns, but they're pretty small changes, so shouldn't affect the review process. I'll merge any changes with 6694 when I post. If we can think of a good split to perform I'm all for it, but literally the only thing I can think of is to disable GC in the first commit. But if we commit 6694 excluding GC, it's basically a waste of time to separate it, as it's a small fraction of the changes, and almost all of the complicated stuff needs to be done for disposal after flush without GC. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911646#comment-13911646 ] Benedict commented on CASSANDRA-6689: - As far as performance testing is concerned, I would say we should test CASSANDRA-6694, as this should be faster in general (excluding its caveats), and also make sure we're running with cassandra.paranoidgc=off (as this has negative performance implications). Also, there are some issues with performance with places in the codebase we've assumed heap based bytebuffers, and have slow routes otherwise. I didn't see such a reduction in performance in my tests, though, so I will have to investigate that a bit more thoroughly. There is an issue with SequentialWriter that it expects a byte[], so with DBB we are writing one byte at a time. I optimised this single-byte path, though, and found my performance problems disappeared then. Some quick off the cuff answers to some of your points: bq. Should we add QueryProcessor.process(without RefAction) methods to avoid all the RefAction.allocateOnHeap() methods? Or, perhaps it is good to always force people to think about where they want the data. I think the latter. It's not too onerous, and it's better to be forced to think about it IMO. bq. a rebase on 2.1 would be nice, does not work running in a cluster now. CASSANDRA-6694 is fully up-to-date, but I guess if you'd prefer to split it up further, skipping ahead to include more changes might be off the cards :-) I'll see about merging it with your changes and 2.1, and then remerging 6694 as well. bq. Also, just throwing the idea out here, maybe we should make this a separate library? bq. Make OpOrder.Group and RefAction implement AutoClosable to be able to do try-with-resources? Good ideas. bq. Would be nice if we didn't have to copy data to heap for thrift queries, but i guess that is out of scope for now. Yes, and thrift users can continue with heap allocation if they're worried about this overhead. But since it's transient they might want to benefit from the reduced heap size even if it means more frequent ygen collections. I'll have a look over your changes more closely and respond to the rest later. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911625#comment-13911625 ] Marcus Eriksson commented on CASSANDRA-6689: Overall this is a very complicated/involved change that we really need to test/benchmark (I can't make it go faster than HeapSlabPool)/figure out if we want it in. That being said, it looks very solid. The biggest risks are, I guess (and where I spent most of my time when reviewing), that we start corrupting/freeing used/leaking offheap memory, but as far as I can tell, the patch avoids that. We should do long-running tests to verify it as well. It would probably have been nice to split this patch up in smaller pieces, say doing the GC in a separate ticket and only discarding entire regions on flush here, but I understand it would be difficult to separate the things. A commit on top of this pushed to https://github.com/krummas/cassandra/commits/marcuse/6689-3 with comment fixups/small changes. * GC metrics! Time spent, amount of data reallocated etc. * In NBQV, drainTo does not return the correct number of inserted items. * Class structure in WaitQueue a bit confusing (mixing static/non-static inner classes, signalledUpdater in WaitQueue while updating field in RegisteredSignal. Small refactor in my branch, feel free to ignore if you prefer your way. * Should we add QueryProcessor.process(without RefAction) methods to avoid all the RefAction.allocateOnHeap() methods? Or, perhaps it is good to always force people to think about where they want the data. * Config file wrong, repeated memtable_cleanup_threshold * Passing null as RefAction to Rows constructor (ListPermissionsStatement and Rows.subcodec * Make OpOrder.Group and RefAction implement AutoClosable to be able to do try-with-resources? * AtomicReferenceArrayUpdater.shift(), really? :) * CommitState unused * Would be nice if we didn't have to copy data to heap for thrift queries, but i guess that is out of scope for now. -- silly single-node perf testing (on my laptop): -- * Seems we get about 10% less write op rate throughput with OffHeapPool. * I flush ~40MB/s with HeapSlabPool but only ~15MB/s with OffHeapPool. * Running stress leads to client TimedOutException during flushing (I guess due to the slower flushing) Oh, and a rebase on 2.1 would be nice, does not work running in a cluster now. Also, just throwing the idea out here, maybe we should make this a separate library? It would make fixing bugs etc in it much more painful, but with a bit of thinking it could be useful for external users. Probably not possible for a 2.1 timeframe, but maybe in the future. I'll spend some more time on this, running actual tests etc. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 beta2 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903185#comment-13903185 ] Benedict commented on CASSANDRA-6689: - Pushed a slightly revised version [here|https://github.com/belliottsmith/cassandra/tree/offheap1], as didn't override a couple of methods in the Cell heirarchy. It's also merged with latest trunk. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899081#comment-13899081 ] Marcus Eriksson commented on CASSANDRA-6689: [~jasobrown] [~xedin] I wouldn't mind more sets of eyes on this patch, so, if you have time, please take a look! Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899096#comment-13899096 ] Benedict commented on CASSANDRA-6689: - There are some natural boundaries if you want to share the burden. Everything inside of utils.concurrent is pretty isolated from everything outside, so could easily be vetted independently. Also, the utilisation of Referrer/RefAction is probably going to be a painstaking thing to vet (that's what makes the majority of small touches outside of the main changes), and quite independent of their declarations. We just need to be certain we always use the correct type of RefAction, and never let one disappear somewhere - OutboundTCPConnection and native transport writing are the two danger areas here (also Memtable flushing needs a bit of care, but is definitely less scary). The most difficult thing to review is going to be the main body of work inside of utils.memory, however. This is pretty hardcore lock-free stuff, and the thing we're looking for is _unintended_ race conditions (there are lots of intended races) - in particular pay attention to the way in which we now asynchronously manage the subpool and suballocator (ledgers of how much we've allocated / claimed / are reclaiming), and obviously most importantly that we never accidentally overwrite data that is being read elsewhere. This should all hopefully be very clearly documented both at the level of abstraction and the individual points where interesting / dangerous things happen. But try to figure it out for yourself as well, in case I and my tests missed something. I will be doing further tests in the near future, but I much prefer to catch things by eye if possible. Always feel free to throw up a this bit isn't well explained flag and I'll try to improve it. I want this stuff to be as clearly self documenting as possible. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13898564#comment-13898564 ] Benedict commented on CASSANDRA-6689: - I've uploaded a patch [here|https://github.com/belliottsmith/cassandra/tree/offheap.merge] In the simplest terms this boils down to a new off heap allocator. However it is much more involved than that, as we need to now manage the life cycle of the memory in question. I won't go into excessive detail here, because I have tried to do so as much as possible in the code itself. For full details see the comments in OffHeapCleaner, RefAction/Referrer, OffHeapAllocator, OffHeapRegion. The basic thrust of the approach is that we guard accesses to the memory through the OpOrder synchronisation primitive introduced with the CASSANDRA-5549. However to prevent stalls and make management of the memory easier, we have some further stages to the lifecycle which are much more GC-like. A corollary is that we also effectively perform GC of the data in the memtable, so that space that is no longer needed as a result of overwritten records can be made available again without first flushing the memtable. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Reporter: Benedict Assignee: Benedict Fix For: 2.1 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13898665#comment-13898665 ] Pavel Yaskevich commented on CASSANDRA-6689: Thanks, [~jbellis], I'm already in the watcher list :) Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13898661#comment-13898661 ] Jason Brown commented on CASSANDRA-6689: I can jump in if [~krummas] is busy. Partially Off Heap Memtables Key: CASSANDRA-6689 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1 Move the contents of ByteBuffers off-heap for records written to a memtable. (See comments for details) -- This message was sent by Atlassian JIRA (v6.1.5#6160)