[
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276211#comment-13276211
]
Phabricator commented on HBASE-5987:
------------------------------------
tedyu has commented on the revision "[jira][89-fb] [HBASE-5987] HFileBlockIndex
improvement".
INLINE COMMENTS
src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:411 'is to
keep' -> 'keeps'
src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:415 'it
means it' -> 'it means that'
src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:205
Please add javadoc for the last three parameters
src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:208 Can
this method be named getDataBlockInfo() ?
For 'seekTo', I think DataBlock would be the target, not DataBlockInfo.
See comment below w.r.t. naming of DataBlockInfo
src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:196
'other attributes' -> 'additional attributes' ?
src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:293 'Only
' can be removed.
src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockInfo.java:2 No year,
please.
src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:306 Can
we use builder pattern to fill out nextIndexedKey ?
src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockInfo.java:26 Would
HFileBlockWithInfo be a better name ?
src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:480 Should
this be '< 0' ?
src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:2
Please remove year.
src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:44
Please add test category.
REVISION DETAIL
https://reviews.facebook.net/D3237
To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu
> HFileBlockIndex improvement
> ---------------------------
>
> Key: HBASE-5987
> URL: https://issues.apache.org/jira/browse/HBASE-5987
> Project: HBase
> Issue Type: Improvement
> Reporter: Liyin Tang
> Assignee: Liyin Tang
> Attachments: D3237.1.patch,
> screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when
> multiple requests are reading the same block of data or index.
> From the profiling, one of the causes is the IdLock contention which has been
> addressed in HBASE-5898.
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex
> about the data block location for each target key value during the scan
> process(reSeekTo), even though the target key value has already been in the
> current data block. This issue will cause certain index block very HOT,
> especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the
> HFileScanner would know the start key value of next data block. So if the
> target key value for the scan(reSeekTo) is "smaller" than that start kv of
> next data block, it means the target key value has a very high possibility in
> the current data block (if not in current data block, then the start kv of
> next data block should be returned. +Indexing on the start key has some
> defects here+) and it shall NOT query the HFileBlockIndex in this case. On
> the contrary, if the target key value is "bigger", then it shall query the
> HFileBlockIndex. This improvement shall help to reduce the hotness of
> HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block
> Cache lookup.
> Secondary, we propose to push this idea a little further that the
> HFileBlockIndex shall index on the last key value of each data block instead
> of indexing on the start key value. The motivation is to solve the HBASE-4443
> issue (avoid seeking to "previous" block when key you are interested in is
> the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of
> the data block N. There is no way for sure the target key value is in the
> data block N or N-1. So it has to seek from data block N-1. However, if the
> block index is based on the last key value for each data block and the target
> key value is beween the last key value of data block N-1 and data block N,
> then the target key value is supposed be data block N for sure.
> As long as HBase only supports the forward scan, the last key value makes
> more sense to be indexed on than the start key value.
> Thanks Kannan and Mikhail for the insightful discussions and suggestions.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira