[jira] [Comment Edited] (CASSANDRA-6933) Optimise Read Comparison Costs in collectTimeOrderedData

Benedict (JIRA) Wed, 26 Mar 2014 18:09:07 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13948721#comment-13948721
 ]


Benedict edited comment on CASSANDRA-6933 at 3/27/14 1:07 AM:
--------------------------------------------------------------

bq. OTOH for the common case of "select *" [the Thrift equivalent anyway] 
either is going to be worse than just linearly scanning from i.

Note that I always do a comparison of the next element before applying my next 
optimisation, so we're no worse than a linear scan

I had wanted to do gradually increasing ranges of 1, sqrt\(n\), n to ensure 
overhead stayed barely measurable and performance optimal, but I think the 
current option is a close enough approximation. Possibly tracking the max 
number skipped would also be fine as an alternative.

I must admit my math is rusty for proving this, but if you think through some 
obvious worst cases, the performance of the current algorithm cannot degrade 
much beyond bsearch: we never re-scan the same range, and we immediately fall 
back to full bsearch over the remaining range; since we double the previous 
space, the worst case behaviour is something like having doubling + 1 distances 
between the keys we want to select, so that we always do the most extra work 
possible. In this case there are lg2 N extra ranges looked at, each approx. lg 
N, lg 0.5N, lg 0.25N, etc comparisons. Which is < 2 lg N extra comparisons. So 
it's less than the cost of doing two extra bsearch over the whole range.

On the other hand in the worst case without the optimisation we're possibly 
doing lg N + lg (N - 1) + lg (N - 2) ... extra comparisons... which I don't 
know how to sum, but is certainly more.



was (Author: benedict):
bq. OTOH for the common case of "select *" [the Thrift equivalent anyway] 
either is going to be worse than just linearly scanning from i.

Note that I always do a comparison of the next element before applying my next 
optimisation, so we're no worse than a linear scan

I had wanted to do gradually increasing ranges of 1, sqrt\(n\), n to ensure 
overhead stayed barely measurable and performance optimal, but I think the 
current option is a close enough approximation. Possibly tracking the max 
number skipped would also be fine as an alternative.

I must admit my math is rusty for proving this, but if you think through some 
obvious worst cases, the performance of the current algorithm cannot degrade 
much beyond bsearch: we never re-scan the same range, and we immediately fall 
back to full bsearch over the remaining range; since we double the previous 
space, the worst case behaviour is something like having doubling + 1 distances 
between the keys we want to select, so that we always do the most extra work 
possible. In this case there are lg2 N extra ranges looked at, each approx. lg 
N, 1/2 lg N, 1/4 lg N, etc comparisons. Which is roughly 2 lg N extra 
comparisons. So it's the cost of doing two extra bsearch over the whole range.

On the other hand in the worst case without the optimisation we're possibly 
doing lg N + lg (N - 1) + lg (N - 2) ... extra comparisons... which I don't 
know how to sum, but is certainly more.


> Optimise Read Comparison Costs in collectTimeOrderedData
> --------------------------------------------------------
>
>                 Key: CASSANDRA-6933
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6933
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>            Priority: Minor
>              Labels: performance
>             Fix For: 2.1
>
>
> Introduce a new SearchIterator construct, which can be obtained from a 
> ColumnFamily, which permits efficiently iterating a subset of the cells in 
> ascending order. Essentially, it saves the previously visited position and 
> searches from there, but also tries to avoid searching the whole remaining 
> space if possible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (CASSANDRA-6933) Optimise Read Comparison Costs in collectTimeOrderedData

Reply via email to