[ 
https://issues.apache.org/jira/browse/CASSANDRA-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13948721#comment-13948721
 ] 

Benedict edited comment on CASSANDRA-6933 at 3/27/14 1:04 AM:
--------------------------------------------------------------

bq. OTOH for the common case of "select *" [the Thrift equivalent anyway] 
either is going to be worse than just linearly scanning from i.

Note that I always do a comparison of the next element before applying my next 
optimisation, so we're no worse than a linear scan

I had wanted to do gradually increasing ranges of 1, sqrt(n), n to ensure 
overhead stayed barely measurable and performance optimal, but I think the 
current option is a close enough approximation. Possibly tracking the max 
number skipped would also be fine as an alternative.

I must admit my math is rusty for proving this, but if you think through some 
obvious worst cases, the performance of the current algorithm cannot degrade 
much beyond bsearch: we never re-scan the same range, and we immediately fall 
back to full bsearch over the remaining range; since we double the previous 
space, the worst case behaviour is something like having doubling + 1 distances 
between the keys we want to select, so that we always do the most extra work 
possible. In this case there are lg2 N extra ranges looked at, each approx. lg 
N, 1/2 lg N, 1/4 lg N, etc comparisons. Which is roughly 2 lg N extra 
comparisons. So it's the cost of doing two extra bsearch over the whole range.

On the other hand in the worst case without the optimisation we're possibly 
doing lg N + lg (N - 1) + lg (N - 2) ... extra comparisons... which I don't 
know how to sum, but is certainly more.



was (Author: benedict):
bq. OTOH for the common case of "select *" [the Thrift equivalent anyway] 
either is going to be worse than just linearly scanning from i.

Note that I always do a comparison of the next element before applying my next 
optimisation.

I had wanted to do gradually increasing ranges of 1, sqrt(n), n to ensure 
overhead stayed barely measurable and performance optimal, but I think the 
current option is a close enough approximation. Possibly tracking the max 
number skipped would also be fine as an alternative.

I must admit my math is rusty for proving this, but if you think through some 
obvious worst cases, the performance of the current algorithm cannot degrade 
much beyond bsearch: we never re-scan the same range, and we immediately fall 
back to full bsearch over the remaining range; since we double the previous 
space, the worst case behaviour is something like having doubling + 1 distances 
between the keys we want to select, so that we always do the most extra work 
possible. In this case there are lg2 N extra ranges looked at, each approx. lg 
N, 1/2 lg N, 1/4 lg N, etc comparisons. Which is roughly 2 lg N extra 
comparisons. So it's the cost of doing two extra bsearch over the whole range.

On the other hand in the worst case without the optimisation we're possibly 
doing lg N + lg (N - 1) + lg (N - 2) ... extra comparisons... which I don't 
know how to sum, but is certainly more.


> Optimise Read Comparison Costs in collectTimeOrderedData
> --------------------------------------------------------
>
>                 Key: CASSANDRA-6933
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6933
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>            Priority: Minor
>              Labels: performance
>             Fix For: 2.1
>
>
> Introduce a new SearchIterator construct, which can be obtained from a 
> ColumnFamily, which permits efficiently iterating a subset of the cells in 
> ascending order. Essentially, it saves the previously visited position and 
> searches from there, but also tries to avoid searching the whole remaining 
> space if possible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to