[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276941#comment-13276941
 ] 

Phabricator commented on HBASE-5987:
------------------------------------

mbautin has commented on the revision "[jira][89-fb] [HBASE-5987] 
HFileBlockIndex improvement".

  Looks good! A few minor comments inline. Also please submit the diff with 
lint (using "arc diff --preview" instead of "arc diff --only")/

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/HConstants.java:545 Please add a 
comment that the actual value is irrelevant because this is always compared by 
reference.
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:437-440 
This documentation is still confusing. Is i "the ith position", or is the 
actual key "the ith position"? I would say i is the "position" and the returned 
key is the "key at the ith position".
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:413 Clarify 
the meaning of "is equal", i.e. that it must be exactly the same object, not 
just an equal byte array.
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:63 
This is unnecessary (we don't use compression by default).
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:77 
It is not "schemMetricSnapshot", it is "schemaMetricSnapshot" ("schem" is not a 
word).

REVISION DETAIL
  https://reviews.facebook.net/D3237

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu

                
> HFileBlockIndex improvement
> ---------------------------
>
>                 Key: HBASE-5987
>                 URL: https://issues.apache.org/jira/browse/HBASE-5987
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Liyin Tang
>            Assignee: Liyin Tang
>         Attachments: D3237.1.patch, D3237.2.patch, 
> screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when 
> multiple requests are reading the same block of data or index. 
> From the profiling, one of the causes is the IdLock contention which has been 
> addressed in HBASE-5898. 
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
> about the data block location for each target key value during the scan 
> process(reSeekTo), even though the target key value has already been in the 
> current data block. This issue will cause certain index block very HOT, 
> especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the 
> HFileScanner would know the start key value of next data block. So if the 
> target key value for the scan(reSeekTo) is "smaller" than that start kv of 
> next data block, it means the target key value has a very high possibility in 
> the current data block (if not in current data block, then the start kv of 
> next data block should be returned. +Indexing on the start key has some 
> defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
> the contrary, if the target key value is "bigger", then it shall query the 
> HFileBlockIndex. This improvement shall help to reduce the hotness of 
> HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
> Cache lookup.
> Secondary, we propose to push this idea a little further that the 
> HFileBlockIndex shall index on the last key value of each data block instead 
> of indexing on the start key value. The motivation is to solve the HBASE-4443 
> issue (avoid seeking to "previous" block when key you are interested in is 
> the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of 
> the data block N. There is no way for sure the target key value is in the 
> data block N or N-1. So it has to seek from data block N-1. However, if the 
> block index is based on the last key value for each data block and the target 
> key value is beween the last key value of data block N-1 and data block N, 
> then the target key value is supposed be data block N for sure. 
> As long as HBase only supports the forward scan, the last key value makes 
> more sense to be indexed on than the start key value. 
> Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to