[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629583#comment-14629583 ] Branimir Lambov commented on CASSANDRA-8915: The rebased and squashed code is [here|https://github.com/blambov/cassandra/tree/8195-rebase] with [testall|http://cassci.datastax.com/job/blambov-8195-rebase-testall/] and [dtest|http://cassci.datastax.com/job/blambov-8195-rebase-dtest/]. Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor Labels: compaction, performance Fix For: 3.x Time Spent: 0.2h Remaining Estimate: 0h The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629604#comment-14629604 ] Benedict commented on CASSANDRA-8915: - Thanks. [Committed|https://github.com/apache/cassandra/commit/c29001b94df769fe428f414c042daffbb6dd209d]. For future reference, our commit form usually ends with a line like: patch by branimir; reviewed by benedict for CASSANDRA-8915 It's not a perfect git history, but better to do our best to stick to it when we can (it also helps later analysis of commit/review). I think we should endeavour to create more descriptive commits alongside, though, like you have done here. Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor Labels: compaction, performance Fix For: 3.x Time Spent: 0.2h Remaining Estimate: 0h The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627840#comment-14627840 ] Benedict commented on CASSANDRA-8915: - [~blambov]: I realise this isn't marked as critical for 3.0, but it's a nice improvement and the only hurdle is a rebase / flatten - otherwise I'm happy and ready to commit. Would you mind prepping that when you have a moment? Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor Labels: compaction, performance Fix For: 3.x Time Spent: 0.2h Remaining Estimate: 0h The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570791#comment-14570791 ] Benedict commented on CASSANDRA-8915: - Thanks for taking the time to provide this. I have some comments and suggestions for clarifying the proof for future readers. It is always tricky to find your bearings in a new proof, and I found myself confused by some subtle ambiguities that are hard to spot as the author. * It would help to state that {{=}} refers to the partial order over key equality, not node identity (the first sentence in particular lead to some double-takes) * stands for and predicate (and use of a function-like identifier) to descibe {{σ(N)}} is confusing. This caused a surprising number of misunderstandings that took a good hour to crystalise in the realisation that {{σ(N)}} meant the flag or property that encodes the result of the predicate {{ρ(N) = N}}, not a short-hand for that predicate * It would help to state {{P ≙ ep-heap(M,N)}} to indicate the ep-heap P rooted at N that leads to M, to avoid confusion with the similarity of rooted at and leading to * I also find that switching case to capitals for heaps/structures and retaining lower case for nodes/values is helpful for clarity. P and R are in the same run of variables as N, but refer to a different category of things. i.e. I would have found it clearer to follow the statement {{P ≙ ep-heap(m,n)}}. * \{\{\}\} can then be used to mark the boundaries of mathematical expressions (so lower case chars are still easily parsed as not prose) bq. 3. ... If ¬σ(P), set σ(R) := (P = R), otherwise leave σ(R) unchanged. Perhaps clarify, that σ(P) = σ(R), so already correct. I realise you state this later, but here is where the question is raised. The later sentence is also very hard to parse, and could be simplified with the earlier short explanation. bq. 3. ... Use the algorithm recursively to *turn* the two ep-subheaps leading to P at its children and the new value C P into an ep-heap leading to P Consider *merge*, to clarify that this step restores the binary-tree property. I haven't yet corroborated the code against the proof, but will do so shortly. I would consider supplying the initial proof outline with its three cases along with the code, so that we can annotate each major branch with a short demonstration of invariant maintenance. Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor Labels: compaction, performance Fix For: 3.x The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570983#comment-14570983 ] Benedict commented on CASSANDRA-8915: - np - they were just that: suggestions. Hopefully anyone who would be confused by the aspects that I was will be helped sufficiently by the combination of your updates and my comment. I've finished vetting the code. I really like this implementation: not only does it ace the comparison tests, it is very clear to read. Perhaps reading the proof made it clearer, but I found it as - or easier - to read than the non-optimised variant. I've pushed some small comments and one minor cleanup (merging two near-identical paths in the replaceAndSink method). I've also applied my earlier updates to MergeIteratorComparisonTest. I ran these against the other two versions, and found it to meet or beat on every test (or be close enough that it didn't matter), which matches with expectation. I'm happy to merge once we squash and get clean CI runs. Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor Labels: compaction, performance Fix For: 3.x The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570981#comment-14570981 ] Branimir Lambov commented on CASSANDRA-8915: Updated above with some of your suggestions. bq. Perhaps clarify, that σ(P) = σ(R), so already correct. This is not entirely true, {{(σ(P) P = R) = σ(R)}}. I would find it confusing to put that info there, though, I prefer to split actions from proof. bq. It would help to state {{P ≙ ep-heap(M,N)}} ... and more. I hesitate to add this; I think the way it is now is enough to convince people that it is sound even if it is not very precise (esp wrt bottom). Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor Labels: compaction, performance Fix For: 3.x The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569392#comment-14569392 ] Branimir Lambov commented on CASSANDRA-8915: I wasn't completely sure of the correctness of the flagging approach and I wrote the following proof to make sure I'm not missing something: {panel} In the discussion below ρ(N) stands for parent of N and σ(N) stands for the 'equalParent' predicate, i.e. σ(N) ↔ ρ(N) = N. We define ep-heap leading to M to be a binary tree of nodes where each node N satisfies ρ(N) ≤ N σ(N) ↔ ρ(N) = N and the root has M as its parent. Full ep-heaps are ep-heaps leading to top, where 'top' is assumed to be a special value strictly smaller than any other (thus σ(R) is always false for the root of a full ep-heap). For simplicity we will take all heaps to end at a special node called 'bottom' whose value is assumed to be greater than any other value. The critical step here is adding a new value C M to combine two existing well-formed ep-heaps P and R leading to the same M into a bigger single well-formed ep-heap leading to M. Without loss of generality we can take P ≤ R (otherwise we can swap the places of P and R in the discussion below). We distinguish three cases: # C P (necessarily ¬σ(P)): We place C at the top by setting ρ(P), ρ(R) := C, ρ(C) := M and leaving σ(P), σ(R) := false (Note that σ(R) cannot be true as that contradicts with C M). σ(C) := false. The ep-heap rooted at C is well-formed leading to M. # C = P (necessarily ¬σ(P)): We place C at the top, setting ρ(P), ρ(R) := C, ρ(C) := M, σ(P) := true and σ(R) := (P = R). σ(C) := false. The heap rooted at C is well-formed as C = P σ(P) as well as C = P ≤ R C = R if and only if σ(R). # C P (σ(P) can be true or false): Set ρ(R) := P. If ¬σ(P), set σ(R) := (P = R), otherwise leave σ(R) unchanged. Use the algorithm recursively to turn the two ep-subheaps leading to P at its children and the new value C P into an ep-heap leading to P. Let its root be Q. The necessary conditions for an element of an ep-heap hold for Q. They are also true for R as P ≤ R and either σ(P) in which case P = M and σ(R) if and only if R = M = P, or ¬σ(P) and σ(R) is explicitly set to be true if and only if P = R. We also have ρ(P) = M as well as P = M if and only if σ(P), thus the constructed ep-heap rooted at P leads to M. The recursion always ends for finite heaps as we must reach the first case due to the ordering of bottom. For performance the order of comparing elements and identifying the case to use can be changed to sequentially checking: - σ(P): apply pt. 3 for P and return (P must be ≤ R). No update of σ(P)/σ(R) necessary. - σ(R): apply pt. 3 for R and return (R must now be P). No update of σ(P)/σ(R) necessary. - compare P and R. - If R P, perform the next steps with P and R swapped. - compare C and P. - if C P apply pt. 1. - if C = P apply pt. 2. - if C P apply pt. 3. Once an ep-heap with root R is formed, it trivially follows that for each element E in the heap E = R if and only if σ is true for E and the chain of parents of E leading to but not including the root. To consume the equal elements we can thus walk the sub-tree formed by the children which have σ set. To advance the elements equal to the root, we can: - Process the elements in the deepest level of the σ sub-tree in any order. For each of them: -- Advance the element. Let the new value be C. -- The children P, R of the element must have ¬σ(P), ¬σ(R). Overwrite ρ(P) and ρ(R) with top*. P and R now form ep-heaps leading to top. -- Use the algorithm to combine P, R and C into a new ep-heap leading to top. -- Let its root be Q. Because the heap leads to top we have ¬σ(Q). - Continue with the next deepest level of the hierarchy until the top is reached. \* The algorithm never actually uses the values of the parent function ρ, only the fact that σ is set correctly with respect to that parent. This means this step does not need to be actually performed. {panel} Updated [the branch|https://github.com/apache/cassandra/compare/trunk...blambov:8915-mergeiterator] to use this solution and cleaned the code a bit. Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor Labels: compaction, performance Fix For: 3.x The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}}
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528493#comment-14528493 ] Branimir Lambov commented on CASSANDRA-8915: bq. My optimisation... We could instead perform a merge, but even with this imperfect implementation the result is typically half as many comparisons as NoEqual, and never more. I think this is better served by other heap types (e.g. Fibonacci), which I was planning to explore at a later time. Do you prefer to do this now? You have convinced me, though, that avoiding that second comparison for {{consume}} is something that must be done, so I added a [{{FlagEqual}} variation|https://github.com/apache/cassandra/blob/545aa9c66e568b853673ea40da05f13bd3c86a73/src/java/org/apache/cassandra/utils/MergeIteratorFlagEqual.java] of the non-collapsing approach that flags equality when it finds it. The flag is then sufficient for {{consume}} and skipping over equal elements in the heap, providing equal comparison complexity with {{AllEqual}} but without the need to remove/readd iterators. The benchmarks show it to be on par with {{AllEqual}} in the random cases, and better than {{NoEqual}} in the limited overlap scenarios. Also, I don't see any links in your comment. Could I also look at your experiments? Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor Fix For: 3.x The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528503#comment-14528503 ] Benedict commented on CASSANDRA-8915: - My mistake, I thought I linked it. My version with updated tests and optimization is [here|https://github.com/belliottsmith/cassandra/tree/mergeiterator-comparison] That sounds like a much simpler solution; if we can get even 50% of the benefit with that approach I'm happy. I'll take a look shortly Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor Fix For: 3.x The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523753#comment-14523753 ] Benedict commented on CASSANDRA-8915: - Oh, one other minor quibble: would you mind applying the project codestyle to your IDE? Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor Fix For: 3.x The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523745#comment-14523745 ] Benedict commented on CASSANDRA-8915: - First off, I agree that the MergeIteratorAllEqual is unfortunately complex, though I haven't yet attempted to fully reason about its behaviour. Are we certain there's no way to reduce its complexity? If you don't think it can be improved complexity-wise from its current position, I'll find some time to chew on it myself as well, so we have at least had two distinct attempts to reduce that burden. However from a performance point of view, I have two major quibbles: * There is a simple optimisation of AllEqual, which I have introduced, and it is now universally faster (or as fast) * I don't _think_ that's the best characterisation of LCS (please feel free to prove me wrong, though, as I may have a poor mental model of LCS) My optimisation as stands is actually pretty rubbish; it could be done much better. What I do is construct a new heap from the equal candidates, and then push down the top element and bubble up the remainder. All this does really is compress the equal items, push down the one most likely to stay at the head, and expand the remainder from the smallest item upwards so minimizing the number of swaps. We could instead perform a merge, but even with this imperfect implementation the result is typically half as many comparisons as NoEqual, and never more. As to LCS: to my knowledge, it only ever compacts two levels together, but within each level there is no compaction/comparison, they are simply concatenated together. Between the levels there's no guarantee of overlap, and the values will typically be randomly distributed since we default to hashed partitioning. Given each level is 10 times larger than the previous, we can also expect that the number of _runs_ of equality are very low. I've introduced simulation of varying numbers of levels being compacted together, along with varying numbers of L0 sstables in the mix, and with varying levels of overlap between the levels (100%, 50%, 0%; here I mean percentage of values in a level which definitely occur in a lower level). These all come out as either the same or faster under both versions of AllEqual. Of course the statement likely stands for a variety of workloads with many CQL row overwrites, so catering for the case of many runs of overlaps is still an important one. bq. but please do not hesitate to disagree or bring other concerns/questions/scenarios to the discussion. ditto :) Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor Fix For: 3.x The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514183#comment-14514183 ] Branimir Lambov commented on CASSANDRA-8915: [~benedict]: I invested some time exploring the linked list of colliding iterators idea. [Here|https://github.com/apache/cassandra/compare/trunk...blambov:mergeiterator-comparison] you can find a branch that includes and can be used to compare four versions of the iterator: - {{MergeIteratorNoEqual}}: my initial implementation, without collapsing equal iterators - {{MergeIteratorAllEqual}}: fully implements collapse of equal iterators - {{MergeIterator.ManyToOne}}: mixture of the two above, only collapsing when it is easy - {{MergeIteratorComparisonTest.MergeIteratorPQ}}: the old implementation What I can tell from the performance tests: - AllEqual is noticeably faster for randomly spaced inputs, as well as for completely non-colliding iterators - NoEqual, on the other hand, is better when the overlap is organized in sections (as in LeveledCompaction) - the mixture is always slower than either The explanation is that AllEqual can avoid the equality comparisons in {{consume}} (halving the number of comparisons in the no-overlap case), but if there are collisions it must pay the price when advancing multiple equal iterators, having to place all but one at the end of the heap and possibly having to pull them up all the way to the top. The latter is not a disadvantage for random data, but IMHO Cassandra's use cases do produce or impose structure where it would be. As the AllEqual implementation is also quite a bit more complicated and harder to follow, the code I would choose to continue with is the original, NoEqual version, but please do not hesitate to disagree or bring other concerns/questions/scenarios to the discussion. Other heap structures may improve this further, perhaps this can be revisited when there is need to go further and/or we have more time on our hands? [~Stefania]: Good stuff! It is orthogonal to this improvement-- it will work just as well or even better here, in practically the same way that you have implemented it. The bound-to-actual transition will only sink it down from the top, which is not going to be wasting much, practically nothing if the bound is good enough. We don't want to open iterators before they have reached the top as opening is incomparably slower than any waste the mergeIterator can have. Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516046#comment-14516046 ] Stefania commented on CASSANDRA-8915: - Great to hear it's orthogonal, thanks! Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394244#comment-14394244 ] Stefania commented on CASSANDRA-8915: - In case you guys have not seen it yet, please check the changes proposed by CASSANDRA-8180, specifically this comment here: https://issues.apache.org/jira/browse/CASSANDRA-8180?focusedCommentId=14381674page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14381674. The idea is that there will be two type of candidates, one greedy that knows its first value as it is the case right now. Another one, lazy, that gets compared based on a less accurate lower bound. What this means is that once this lazy candidate is picked, only then will it access the iterator to determine the exact first value, which could be much higher that the initial lower bound. The way I implemented this with the present implementation of the merge iterator is to add the lazy candidate back to the priority queue after it has calculated its first accurate value. It's not very elegant however and it is kind of wasteful. If too complex to merge both approaches in one algorithm, we can always specialize a separate merge iterator implementation for supporting lazy candidates. Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394250#comment-14394250 ] Benedict commented on CASSANDRA-8915: - I perhaps should have commented when I first saw the link. It should be quite viable to merge the behaviours; the Candidate just needs to have a flag indicating if the value is real or not, and to just discard the not-real values it encounters. Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393517#comment-14393517 ] Benedict commented on CASSANDRA-8915: - Suggestion: convert each Candidate into a linked-list of colliding iterators, so that consume() becomes guaranteed O(1). In the event of many equal items, this would incur only linear costs, and only on push down, rather than logarithmic costs on both push down and advance. This would particularly help the partition level (sstable/memtable) merge, as we are likely to encounter the same DecoratedKey many times. Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358578#comment-14358578 ] Benedict commented on CASSANDRA-8915: - For this test, I would suggest running with a small memtable size (to amplify compaction costs), and running stress against a schema very large partition with multiple clustering components, and using the revisit parameter, and the sorted visitation order, to spread each partition sequentially across multiple flushes. Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358560#comment-14358560 ] Branimir Lambov commented on CASSANDRA-8915: Patch is available [here|https://github.com/apache/cassandra/compare/trunk...blambov:8915-mergeiterator]. The code turned out to be somewhat more complicated than I expected because of the need to support lazy advance of the underlying iterators. The new iterator is between several times and 10-20% faster depending on the overlap of the inputs and the complexity of comparisons. I wasn't able to demonstrate the effect in cstar_perf (see [this|http://cstar.datastax.com/graph?stats=fb747232-c819-11e4-91b4-42010af0688fmetric=op_rateoperation=1_write] but also [this|http://cstar.datastax.com/graph?stats=ad2dbdb6-c843-11e4-a8b4-42010af0688fmetric=op_rateoperation=1_write]). Any suggestions on how to test compaction performance? At the moment the patch fails {{BlacklistingCompactionsTest}} because of CASSANDRA-8960. Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348795#comment-14348795 ] Branimir Lambov commented on CASSANDRA-8915: Yes, what I'm suggesting is a replacement of the priority queue with a custom version that supports {{swapHead}}. It actually doesn't have to be a full PQ implementation, {{swapHead}} is the only operation needed (the initial heap construction can be done by sorting). Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348800#comment-14348800 ] Benedict commented on CASSANDRA-8915: - +1 then Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8915) Improve MergeIterator performance
[ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348732#comment-14348732 ] Benedict commented on CASSANDRA-8915: - It sounds like you're suggesting a full priority queue implementation with a swapHead() (or similar) method that heapifies the new value (possibly terminating immediately), but could you confirm? This seems preferable to me, but it could be taken that this proposal is for a simple wrapper that just extracts the head of the queue and wraps an existing implementation, which would only help situations with mostly disjoint files (so any single file with wide overlap would result in close to prior performance). Improve MergeIterator performance - Key: CASSANDRA-8915 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915 Project: Cassandra Issue Type: Improvement Reporter: Branimir Lambov Assignee: Branimir Lambov Priority: Minor The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged). This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation. This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction). -- This message was sent by Atlassian JIRA (v6.3.4#6332)