[
https://issues.apache.org/jira/browse/HBASE-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121352#comment-13121352
]
[email protected] commented on HBASE-4465:
------------------------------------------------------
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2180/#review2360
-----------------------------------------------------------
Ship it!
This is some really awesome work, Mikhail and Liyin. You guys have taken our
codez to a new level with all the changes you guys are making. And the fake KV
idea is super elegant. Nice job! This is going to be a major performance win
across so many applications.
Bring back the old get!
- Jonathan
On 2011-10-05 18:00:03, Mikhail Bautin wrote:
bq.
bq. -----------------------------------------------------------
bq. This is an automatically generated e-mail. To reply, visit:
bq. https://reviews.apache.org/r/2180/
bq. -----------------------------------------------------------
bq.
bq. (Updated 2011-10-05 18:00:03)
bq.
bq.
bq. Review request for hbase.
bq.
bq.
bq. Summary
bq. -------
bq.
bq. Previously, if we had several StoreFiles for a column family in a region,
we would seek in each of them and only then merge the results, even though the
row/column we are looking for might only be in the most recent (and the
smallest) file. Now we prioritize our reads from those files so that we check
the most recent file first. This is done by doing a "lazy seek" which pretends
that the next value in the StoreFile is (seekRow, seekColumn,
lastTimestampInStoreFile), which is earlier in the KV order than anything that
might actually occur in the file. So if we don't find the result in earlier
files, that fake KV will bubble up to the top of the KV heap and a real seek
will be done. This is expected to significantly reduce the amount of disk IO
(as of 09/22/2011 we are doing dark launch testing and measurement).
bq.
bq. This is joint work with Liyin Tang – huge thanks to him for many helpful
discussions on this and the idea of putting fake KVs with the highest timestamp
of the StoreFile in the scanner priority queue.
bq.
bq.
bq. This addresses bug HBASE-4465.
bq. https://issues.apache.org/jira/browse/HBASE-4465
bq.
bq.
bq. Diffs
bq. -----
bq.
bq. src/main/java/org/apache/hadoop/hbase/KeyValue.java aa34006
bq.
src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java
94ddce7
bq. src/main/java/org/apache/hadoop/hbase/regionserver/ColumnCount.java
1be0280
bq. src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java
b8d33e8
bq. src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java
fbcd276
bq. src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 035f765
bq.
src/main/java/org/apache/hadoop/hbase/regionserver/NonLazyKeyValueScanner.java
PRE-CREATION
bq. src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java
dad278a
bq. src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
abb5931
bq. src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java
31bfea7
bq. src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java
64a6e3e
bq. src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java
8ad5aab
bq. src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksRead.java
b3beabb
bq. src/test/java/org/apache/hadoop/hbase/regionserver/TestMemStore.java
9d2b2a7
bq.
bq. Diff: https://reviews.apache.org/r/2180/diff
bq.
bq.
bq. Testing
bq. -------
bq.
bq. All unit tests should be passing now. Will rebase and re-run again just in
case.
bq.
bq.
bq. Thanks,
bq.
bq. Mikhail
bq.
bq.
> Lazy-seek optimization for StoreFile scanners
> ---------------------------------------------
>
> Key: HBASE-4465
> URL: https://issues.apache.org/jira/browse/HBASE-4465
> Project: HBase
> Issue Type: Improvement
> Reporter: Mikhail Bautin
> Assignee: Mikhail Bautin
> Labels: optimization, seek
> Fix For: 0.89.20100924, 0.94.0
>
>
> Previously, if we had several StoreFiles for a column family in a region, we
> would seek in each of them and only then merge the results, even though the
> row/column we are looking for might only be in the most recent (and the
> smallest) file. Now we prioritize our reads from those files so that we check
> the most recent file first. This is done by doing a "lazy seek" which
> pretends that the next value in the StoreFile is (seekRow, seekColumn,
> lastTimestampInStoreFile), which is earlier in the KV order than anything
> that might actually occur in the file. So if we don't find the result in
> earlier files, that fake KV will bubble up to the top of the KV heap and a
> real seek will be done. This is expected to significantly reduce the amount
> of disk IO (as of 09/22/2011 we are doing dark launch testing and
> measurement).
> This is joint work with Liyin Tang -- huge thanks to him for many helpful
> discussions on this and the idea of putting fake KVs with the highest
> timestamp of the StoreFile in the scanner priority queue.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira