[
https://issues.apache.org/jira/browse/HBASE-9000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chao Shi updated HBASE-9000:
----------------------------
Attachment: hbase-9000-port-fb.patch
The attached patch is a port of linear seek code from 0.89-fb branch (with
minor changes). I'm not sure if 20 should be a good default value for the max
number of linear seeks.
Benchmark result:
||operation||trunk||w/ patch||
|reseek to next row|5.92 us|6.71 us|
|reseek to next column|3.735 us|0.569 us|
Configuration:
rows: 100000
columns per row: 10
versions: 3
size of row-key: 8
size of qualifier: 8
size of value: 8
bq. In all fairness, we should not divide the runtime by the number of ops. The
whole point of seeking is to reduce the number of ops
In fact, the cost of next is listed here only for reference (e.g. tune the
limit of linear seeks) and should not be compared to costs of reseeks. In our
use case that scan a single row with very large offset and small limit, the
cost of a single reseek is more meaningful, as we can directly multiple it by
offset. I can understand that in some other cases, the total time may be more
important.
In any cases, the goal of the benchmark program is to evaluate the performance
gain with linear search, where we can compare these numbers w/ and w/o patch.
The percentage of improvement does not change.
I like the [~lhofhansl]'s idea of passing a hint from ScanQueryMatcher, which
should also benefit StoreFileScanner. I think we can also save some statistic
information at the time a HFile is written, such as the average #versions or
#columns, which can help us to determine if a "reseek to next row" is really
far enough for a reseek.
> Linear reseek in Memstore
> -------------------------
>
> Key: HBASE-9000
> URL: https://issues.apache.org/jira/browse/HBASE-9000
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 0.89-fb
> Reporter: Shane Hogan
> Priority: Minor
> Fix For: 0.89-fb
>
> Attachments: hbase-9000-benchmark-program.patch,
> hbase-9000-port-fb.patch
>
>
> This is to address the linear reseek in MemStoreScanner. Currently reseek
> iterates over the kvset and the snapshot linearly by just calling next
> repeatedly. The new solution is to do this linear seek up to a configurable
> maximum amount of times then if the seek is not yet complete fall back to
> logarithmic seek.
--
This message was sent by Atlassian JIRA
(v6.1#6144)