[ 
https://issues.apache.org/jira/browse/HBASE-9778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13923650#comment-13923650
 ] 

Lars Hofhansl commented on HBASE-9778:
--------------------------------------

Thanks Ram. Yes on all fronts.

When we have 100 cols and select col98 only, a seek is more efficient, so we'd 
do N next()'s for nothing before we seek anyway.

The main point of this option would be to get good improvement when the next 
column/version of interest can be reached with a few next()'s while limiting 
the downside. We can always seek (0), or never seek (MAXINT), and can tune 
everything in between. I assume most folks would set this somewhere between 5 
and 10... The upside of saving a seek a large, the downside of a few extra 
next()'s is not so bad.

Many seeks that just skip to the next KV are bad, 1000 next()'s are bad, but 5 
next()'s + 1 reseek is not much worse than a reseek.
Client with more information about their data (such a Phoenix, or certain time 
series databases, etc) can use that to set a good value here.

Of course the absolute worst case would be when this is set to 5 and all 
columns are 6 KVs away from each other (due to versions of intermediary 
columns). We'd do 5 next()'s and then a seek.

Lemme see how I can phrase that better in the Javadoc.

(I had thought about only doing this at the StoreFile level and issue next()'s 
only when we the key we're looking for falls into the same block, but at the 
time when we reach the StoreFileScanner we have done most of the work for seek 
anyway, so it turned out to be not very helpful)


> Avoid seeking to next column in ExplicitColumnTracker when possible
> -------------------------------------------------------------------
>
>                 Key: HBASE-9778
>                 URL: https://issues.apache.org/jira/browse/HBASE-9778
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>         Attachments: 9778-0.94-v2.txt, 9778-0.94-v3.txt, 9778-0.94-v4.txt, 
> 9778-0.94-v5.txt, 9778-0.94.txt, 9778-trunk-v2.txt, 9778-trunk-v3.txt, 
> 9778-trunk.txt
>
>
> The issue of slow seeking in ExplicitColumnTracker was brought up by 
> [~vrodionov] on the dev list.
> My idea here is to avoid the seeking if we know that there aren't many 
> versions to skip.
> How do we know? We'll use the column family's VERSIONS setting as a hint. If 
> VERSIONS is set to 1 (or maybe some value < 10) we'll avoid the seek and call 
> SKIP repeatedly.
> HBASE-9769 has some initial number for this approach:
> Interestingly it depends on which column(s) is (are) selected.
> Some numbers: 4m rows, 5 cols each, 1 cf, 10 bytes values, VERSIONS=1, 
> everything filtered at the server with a ValueFilter. Everything measured in 
> seconds.
> Without patch:
> ||Wildcard||Col 1||Col 2||Col 4||Col 5||Col 2+4||
> |6.4|8.5|14.3|14.6|11.1|20.3|
> With patch:
> ||Wildcard||Col 1||Col 2||Col 4||Col 5||Col 2+4||
> |6.4|8.4|8.9|9.9|6.4|10.0|
> Variation here was +- 0.2s.
> So with this patch scanning is 2x faster than without in some cases, and 
> never slower. No special hint needed, beyond declaring VERSIONS correctly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to