Benedict commented on CASSANDRA-9988:

bq. very bad average performance

Could you clarify exactly what the graph is showing?  Is it what we agreed to 
test, i.e. a full consumption of the iterator via calls to {{next(<key>)}} ?  
Because I'm pretty sure the "average performance" of binarysearch in this case 
is worse by a much larger margin.

We really need a range of data points to make a decision on, so we can put the 
whole thing to bed for good.

We can also (if we care) tune exp. search's variance by setting the first 
distance it jumps.  We currently jump to the immediately following item - if we 
jump 4 items first, we take between 3 and 7 comparisons, as opposed to between 
1 and 9.  However, there is advantage to higher variance, since the lower cost 
operations are going to be encountered more likely when we consume multiple 
items from the iterator. 

bq. optimized to search the first and second value

exp. search is optimised for any _sequence_ of values.  i.e., if we want just a 
random 8 items in a 32 item iterator, we should expect exp. search to win on 
average.  *If you would like, you could modify your test overall to simply 
search for a random number of items, randomly selected from the iterator's 
contents.*  This should favour exp. search, esp. as the number of random items 
increases.  In real systems, selections are more likely to bias towards 
adjacent items than purely random, so any such advantage will probably be 

bq. complicate code, hard to maintain

I did indicate my changes assumed your implementation was valid (I did not 
attempt to verify it, and the original looks to be the source of the bug).  
Since there was minimal effort on both our parts to just test the hypothesis, I 
don't think this should be a major concern.  This code snippet is trivial to 
add perfect test coverage for, so we could randomly generate code until it 
worked if we wanted.

bq. Dir.DESC will have ... performance degradation 

Clearly, the algorithm as stands would not be viable for DESC, which I did 
indicate in my prior comment.  I had hoped we could simply validate its 
behaviour and then decide if it is worth writing one that is agnostic to 

bq. I'm not familiar with the whole cassandra code base, so not sure how often 
we search for the adjacent value

As previously indicated, I'm pretty sure we, for _every query_, consume 
iterators in a manner that touches adjacent (or nearly) items for most of an 
iterator's contents.  We do this for result row serialization and 
deserialization, and we do it for query column selection.  i.e., we do this a 
great deal.  That is, unless my recollection is faulty or the codebase has 
changed significantly.

> Introduce leaf-only iterator
> ----------------------------
>                 Key: CASSANDRA-9988
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9988
>             Project: Cassandra
>          Issue Type: Sub-task
>            Reporter: Benedict
>            Assignee: Jay Zhuang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>         Attachments: 9988-3tests.png, 9988-data.png, 9988-result2.png, 
> 9988-result3.png, 9988-result.png, 9988-test-result3.png, 
> 9988-test-result-expsearch.xlsx, 9988-test-result-raw.png, 
> 9988-test-result.xlsx, 9988-trunk-new.txt, 9988-trunk-new-update.txt, 
> trunk-9988.txt
> In many cases we have small btrees, small enough to fit in a single leaf 
> page. In this case it _may_ be more efficient to specialise our iterator.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to