[
https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525569#comment-17525569
]
Branimir Lambov commented on CASSANDRA-15510:
---------------------------------------------
The code LGTM, but I can't push the latest changes from [my
branch|https://github.com/blambov/cassandra/tree/CASSANDRA-15510-4.0] to
Benjamin's pull request, or open a pull request against his branch.
[~blerer], could you copy the two last commits over so that we can run CI and
merge this?
> BTree: Improve Building, Inserting and Transforming
> ---------------------------------------------------
>
> Key: CASSANDRA-15510
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15510
> Project: Cassandra
> Issue Type: Improvement
> Components: Local/Other
> Reporter: Benedict Elliott Smith
> Assignee: Benedict Elliott Smith
> Priority: Normal
> Fix For: 4.0.x, 4.x
>
> Time Spent: 10h
> Remaining Estimate: 0h
>
> This work was originally undertaken as a follow-up to CASSANDRA-15367 to
> ensure performance is strictly improved, but it may no longer be needed for
> that purpose. It’s still hugely impactful, however. It remains to be
> decided where this should land.
> The current {{BTree}} implementation is suboptimal in a number of ways, with
> very little focus having been given to its performance besides its
> memory-occupancy. This patch aims to address that, specifically improving
> the performance and allocations involved in: building, transforming and
> inserting into a tree.
> To facilitate this work, the {{BTree}} definition is modified slightly, so
> that we can perform some simple arithmetic on tree sizes. Specifically,
> trees of depth n are defined to have a maximum capacity of {{branchFactor^n -
> 1}}, which translates into capping the number of leaf children at
> {{branchFactor-1}}, as opposed to {{branchFactor}}. Since {{branchFactor}}
> is a power of 2, this permits fast tree size arithmetic, enabling some of
> these changes.
> h2. Building
> The static build method has been modified to utilise dedicated
> {{buildPerfect}} methods that build either perfectly dense or perfectly
> sparse sub-trees. These perfect trees all share their {{sizeMap}} with each
> other, and can be built more efficiently than trees of arbitrary size. The
> specifics are described in detail in the comments, but this building block
> can be used to construct trees of any size, using at most one child at each
> level that is not either perfectly sparse or perfectly dense. Bulk methods
> are used where possible.
> For large trees this can produce up to 30x throughput improvement and 30%
> allocation reduction vs 3.0 (TBC, and to be tested vs 4.0).
> {{FastBuilder}} is introduced for building a tree in-order (or in reverse)
> without duplicate elements to resolve, without necessarily knowing the size
> upfront. This meets the needs of most use cases. Data is built directly
> into nodes, with up to one already-constructed node, and one partially
> constructed node, on each level, being mutated to share their contents in the
> event of insufficient data to populate the tree. These builders are
> thread-locally shared. These leads to minimal copying, the same sharing of
> {{sizeMap}} as above, zero wasted allocations, and results in minimal
> difference in performance between utilising the less-ergonomic static build
> and builder approach.
> For large trees this leads to ~4.5x throughput improvement, and 70% reduction
> in allocations vs a normal Builder. For small trees performance is
> comparable, but allocations similarly reduced.
> h2. Inserting
> It turns out that we only ever insert another tree into a tree, so we exploit
> this to implement an efficient union of two trees, operating on them directly
> via stacks in the transformer, instead of via a collection interface. A
> builder-like object is introduced that shares functionality with
> {{FastBuilder}}, and permits us to build the result of the union directly
> into the final nodes, reusing as much of the original trees as possible.
> Bulk methods are used where possible.
> The result is not _uniformly_ faster, but is _significantly_ faster on
> average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean
> improvement of 10x. Worst reduction is 30%, and it may be that we can
> isolate and alleviate that. Allocations are also reduced significantly, with
> a median of 30% and mean of 42% for the tested workloads. As the trees get
> larger the improvement drops, but remains uniformly lower.
> h2. Transforming
> Transformations garbage overhead is minimal, i.e. the main allocations are
> those necessary to represent the new tree. It is significantly faster and
> particularly more efficient when removing elements, utilising the shared
> functionality of the builder and transformer objects to define an efficient
> builder that reuses as much of the original tree as possible.
> We also introduce dedicated {{transform}} methods (that forbid returning
> {{null}}), and {{BiFunction}} transformations to permit efficient follow-ups.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]