optimize/avoid seeking to "previous" block when key you are interested in is
the first one of a block
-----------------------------------------------------------------------------------------------------
Key: HBASE-4443
URL: https://issues.apache.org/jira/browse/HBASE-4443
Project: HBase
Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
This issue primarily affects cases when you are storing large blobs, i.e. when
blocks contain small number of keys, and the chances of the key you are looking
for being the first block of a key is higher.
Say, you are looking for "row/col", and "row/col/ts=5" is the latest version of
the key in the HFile and is at the beginning of block X.
The search for the key is done by looking for "row/col/TS=Long.MAX_VAL", but
this will land us in block X-1 (because ts=Long.MAX_VAL sorts ahead of ts=5);
only to find that there is no matching "row/col" in block X-1, and then we'll
advance to block X to return the value.
Seems like we should be able to optimize this somehow.
Some possibilities:
1) Suppose we track that the file contains no deletes, and if the CF setting
has MAX_VERSIONS=1, we can know for sure that block X - 1 does not contain any
relevant data, and directly position the seek to block X. [This will also
require the memstore flusher to remove extra versions if MAX_VERSION=1 and not
allow the file to contain duplicate entries for the same ROW/COL.] Tracking
deletes will also avoid in many cases, the seek to the top of the row to look
for DeleteFamily.
2) Have a dense index (1 entry per KV in the index; this might be ok for large
object case since index vs. data ratio will still be low).
3) Have the index contain the last KV of each block also in addition to the
first KV. This doubles the size of the index though.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira