Liyin Tang created HBASE-5987:
---------------------------------
Summary: HFileBlockIndex improvement
Key: HBASE-5987
URL: https://issues.apache.org/jira/browse/HBASE-5987
Project: HBase
Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
Recently we find out a performance problem that it is quite slow when multiple
requests are reading the same block of data or index.
>From the profiling, one of the causes is the IdLock contention which has been
>addressed in HBASE-5898.
Another issue is that the HFileScanner will keep asking the HFileBlockIndex
about the data block location for each target key value during the scan
process(reSeekTo), even though the target key value has already been in the
current data block. This issue will cause certain index block very HOT,
especially when it is a sequential scan.
To solve this issue, we propose the following solutions:
First, we propose to lookahead for one more block index so that the
HFileScanner would know the start key value of next data block. So if the
target key value for the scan(reSeekTo) is "smaller" than that start kv of next
data block, it means the target key value has a very high possibility in the
current data block (if not in current data block, then the start kv of next
data block should be returned. +Indexing on the start key has some defects
here+) and it shall NOT query the HFileBlockIndex in this case. On the
contrary, if the target key value is "bigger", then it shall query the
HFileBlockIndex. This improvement shall help to reduce the hotness of
HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block
Cache lookup.
Secondary, we propose to push this idea a little further that the
HFileBlockIndex shall index on the last key value of each data block instead of
indexing on the start key value. The motivation is to solve the HBASE-4443
issue (avoid seeking to "previous" block when key you are interested in is the
first one of a block) as well as +the defects mentioned above+.
For example, if the target key value is "smaller" than the start key value of
the data block N. There is no way for sure the target key value is in the data
block N or N-1. So it has to seek from data block N-1. However, if the block
index is based on the last key value for each data block and the target key
value is beween the last key value of data block N-1 and data block N, then the
target key value is supposed be data block N for sure.
As long as HBase only supports the forward scan, the last key value makes more
sense to be indexed on than the start key value.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira