[
https://issues.apache.org/jira/browse/CASSANDRA-9988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16116265#comment-16116265
]
Benedict commented on CASSANDRA-9988:
-------------------------------------
bq. very bad average performance
Could you clarify exactly what the graph is showing? Is it what we agreed to
test, i.e. a full consumption of the iterator via calls to {{next(<key>)}} ?
Because I'm pretty sure the "average performance" of binarysearch in this case
is worse by a much larger margin.
We really need a range of data points to make a decision on, so we can put the
whole thing to bed for good.
We can also (if we care) tune exp. search's variance by setting the first
distance it jumps. We currently jump to the immediately following item - if we
jump 4 items first, we take between 3 and 7 comparisons, as opposed to between
1 and 9. However, there is advantage to higher variance, since the lower cost
operations are going to be encountered more likely when we consume multiple
items from the iterator.
bq. optimized to search the first and second value
exp. search is optimised for any _sequence_ of values. i.e., if we want just a
random 8 items in a 32 item iterator, we should expect exp. search to win on
average. *If you would like, you could modify your test overall to simply
search for a random number of items, randomly selected from the iterator's
contents.* This should favour exp. search, esp. as the number of random items
increases. In real systems, selections are more likely to bias towards
adjacent items than purely random, so any such advantage will probably be
understated.
bq. complicate code, hard to maintain
I did indicate my changes assumed your implementation was valid (I did not
attempt to verify it, and the original looks to be the source of the bug).
Since there was minimal effort on both our parts to just test the hypothesis, I
don't think this should be a major concern. This code snippet is trivial to
add perfect test coverage for, so we could randomly generate code until it
worked if we wanted.
bq. Dir.DESC will have ... performance degradation
Clearly, the algorithm as stands would not be viable for DESC, which I did
indicate in my prior comment. I had hoped we could simply validate its
behaviour and then decide if it is worth writing one that is agnostic to
direction.
bq. I'm not familiar with the whole cassandra code base, so not sure how often
we search for the adjacent value
As previously indicated, I'm pretty sure we, for _every query_, consume
iterators in a manner that touches adjacent (or nearly) items for most of an
iterator's contents. We do this for result row serialization and
deserialization, and we do it for query column selection. i.e., we do this a
great deal. That is, unless my recollection is faulty or the codebase has
changed significantly.
> Introduce leaf-only iterator
> ----------------------------
>
> Key: CASSANDRA-9988
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9988
> Project: Cassandra
> Issue Type: Sub-task
> Reporter: Benedict
> Assignee: Jay Zhuang
> Priority: Minor
> Labels: patch
> Fix For: 4.0
>
> Attachments: 9988-3tests.png, 9988-data.png, 9988-result2.png,
> 9988-result3.png, 9988-result.png, 9988-test-result3.png,
> 9988-test-result-expsearch.xlsx, 9988-test-result-raw.png,
> 9988-test-result.xlsx, 9988-trunk-new.txt, 9988-trunk-new-update.txt,
> trunk-9988.txt
>
>
> In many cases we have small btrees, small enough to fit in a single leaf
> page. In this case it _may_ be more efficient to specialise our iterator.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]