[ 
https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891761#comment-16891761
 ] 

Aleksey Yeschenko commented on CASSANDRA-15202:
-----------------------------------------------

Cheers. Addressed most in a separate commit, with a few exceptions:

bq. Use {{ByteOrder.LITTLE_ENDIAN}} for off heap?

Don't want to change the protocol in any way in this patch - just internal 
cleanup and efficiency. And make it trivially cherry-pickable for 3.0 without 
breaking compatibility in-between minors - for those who would want this 
improvement in their 3.0-based branches.

bq. {{RandomPartitioner.MAXIMUM_TOKEN_SIZE}}: use {{(bitLength + 7) / 8}}?

Why? {{bitLength() / 8 + 1}} is taken verbatim from {{BigInteger#toByteArray()}}

bq. {{prefer_offheap_merkle_trees}} - why prefer?

Primarily to decouple from the actual partitioner setting, as we don't support 
off-heap representation for at least BOP.

If all else LGTY, will commit once I've beefed up test coverage a little.

> Deserialize merkle trees off-heap
> ---------------------------------
>
>                 Key: CASSANDRA-15202
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15202
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Repair
>            Reporter: Jeff Jirsa
>            Assignee: Aleksey Yeschenko
>            Priority: Normal
>             Fix For: 4.0
>
>         Attachments: offheap-mts-gc.png
>
>
> CASSANDRA-14096 made the first step to address the heavy on-heap footprint of 
> merkle trees on repair coordinators - by reducing the time frame over which 
> they are referenced, and by more intelligently limiting depth of the trees 
> based on available heap size.
> That alone improves GC profile and prevents OOMs, but doesn’t address the 
> issue entirely. The coordinator still must hold all the trees on heap at once 
> until it’s done diffing them with each other, which has a negative effect, 
> and, by reducing depth, we lose precision and thus cause more overstreaming 
> than before.
> One way to improve the situation further is to build on CASSANDRA-14096 and 
> move the trees entirely off-heap. This is a trivial endeavor, given that we 
> are dealing with what should be full binary trees (though in practice aren’t 
> quite, yet). This JIRA makes the first step towards there - by moving just 
> deserialisation off-heap, leaving construction on the replicas on-heap still.
> Additionally, the proposed patch fixes the issue of replica coordinators 
> sending merkle trees to itself over loopback, costing us a ser/deser loop per 
> tree.
> Please note that there is more room for improvement here, and depending on 
> 4.0 timeline those improvements may or may not land in time. To name a few:
> - with some minor modifications to init(), we can make sure that no matter 
> the range, the tree is *always* perfectly full; this would allow us to get 
> rid of child pointers in inner nodes, as child node addresses will be 
> trivially calculatable given fixed size of nodes
> - the trees can be easily constructed off-heap so long as you run init() to 
> pre-size the tree to find out how large a buffer you need
> - on-wire format doesn’t need to stream inner nodes, only leaves, and, 
> really, only the hashes of the leaves



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to