[
https://issues.apache.org/jira/browse/HBASE-9778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13801457#comment-13801457
]
Lars Hofhansl commented on HBASE-9778:
--------------------------------------
No, it would only be worse if there are many versions. What this avoids is the
SEEK_NEXT_COL, which seeks to the next column anticipating an unknown number of
versions. If there are only a few versions per column the reseek is more
expensive.
However there is no sure way to know the exact number of versions. Using
MAX_VERSIONS as a hint works in all cases but the outliers - for example when a
single column is updated frequently, in that case the memstore will be filled
with many versions of the same column and we should seek to the next column.
Your example would shows another current shortcoming, though. If we only wanted
to select 1 or 2 columns of the 10000 the tracker would go over the 10000
column and issue 10000 seeks. It should rather seek directly to the next column
it cares about.
> Avoid seeking to next column in ExplicitColumnTracker when possible
> -------------------------------------------------------------------
>
> Key: HBASE-9778
> URL: https://issues.apache.org/jira/browse/HBASE-9778
> Project: HBase
> Issue Type: Bug
> Reporter: Lars Hofhansl
> Assignee: Lars Hofhansl
> Attachments: 9778-0.94.txt, 9778-0.94-v2.txt, 9778-0.94-v3.txt,
> 9778-trunk.txt, 9778-trunk-v2.txt, 9778-trunk-v3.txt
>
>
> The issue of slow seeking in ExplicitColumnTracker was brought up by
> [~vrodionov] on the dev list.
> My idea here is to avoid the seeking if we know that there aren't many
> versions to skip.
> How do we know? We'll use the column family's VERSIONS setting as a hint. If
> VERSIONS is set to 1 (or maybe some value < 10) we'll avoid the seek and call
> SKIP repeatedly.
> HBASE-9769 has some initial number for this approach:
> Interestingly it depends on which column(s) is (are) selected.
> Some numbers: 4m rows, 5 cols each, 1 cf, 10 bytes values, VERSIONS=1,
> everything filtered at the server with a ValueFilter. Everything measured in
> seconds.
> Without patch:
> ||Wildcard||Col 1||Col 2||Col 4||Col 5||Col 2+4||
> |6.4|8.5|14.3|14.6|11.1|20.3|
> With patch:
> ||Wildcard||Col 1||Col 2||Col 4||Col 5||Col 2+4||
> |6.4|8.4|8.9|9.9|6.4|10.0|
> Variation here was +- 0.2s.
> So with this patch scanning is 2x faster than without in some cases, and
> never slower. No special hint needed, beyond declaring VERSIONS correctly.
--
This message was sent by Atlassian JIRA
(v6.1#6144)