[jira] [Commented] (CASSANDRA-6933) Optimise Read Comparison Costs in collectTimeOrderedData

Benedict (JIRA) Wed, 02 Apr 2014 14:42:05 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958211#comment-13958211
 ]


Benedict commented on CASSANDRA-6933:
-------------------------------------

Note that I am discussing the _worst case_ behaviour here. i.e. the maximal 
degradation of behaviour is just about no worse than a couple of extra binary 
search, and realistically it is likely to be less than one binary search extra 
cost (across the range we query everytime without the optimisation) - so it 
only has to get it right approximately once in order to be better. I definitely 
think it is the correct behaviour, given that the potential upside is higher 
than the maximal downside. It's not really designed for catching "uniformly 
distributed" - it just assumes it for simple modelling purposes. Uniformly 
distributed is actually a misnomer anyway - this would imply dividing the 
search space by the names we're searching for; in reality it simply needs to 
pick a sensible number to search over that is smaller than the whole range, so 
it simply picks twice the last delta. 

Basically, it's predicated on the fact that you must by definition rarely have 
to search the entire range for the answer you want, since we expect the range 
should contain multiple answers we're looking for (if it doesn't, we'll finish 
early anyway and all optimisations are moot). All we want to do is reduce the 
number of times we do a full binary search - any sensible number (sqrt(n), 
max(diff), last(diff)) are all suitable - so long as we pick one of them and 
only search a subrange.

The numbers I gave above I think demonstrated this clearly - it offers almost 
no increased cost even when it gets it completely wrong, but it reduces the 
worst case algorithmic complexity of the total work to O(lg2(m).lg2(n)) from m 
lg(n) - which is an order of magnitude reduction. This is definitely a good 
idea.

> Optimise Read Comparison Costs in collectTimeOrderedData
> --------------------------------------------------------
>
>                 Key: CASSANDRA-6933
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6933
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>            Priority: Minor
>              Labels: performance
>             Fix For: 2.1
>
>         Attachments: 6933-v3.txt
>
>
> Introduce a new SearchIterator construct, which can be obtained from a 
> ColumnFamily, which permits efficiently iterating a subset of the cells in 
> ascending order. Essentially, it saves the previously visited position and 
> searches from there, but also tries to avoid searching the whole remaining 
> space if possible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-6933) Optimise Read Comparison Costs in collectTimeOrderedData

Reply via email to