[
https://issues.apache.org/jira/browse/HBASE-14509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945304#comment-14945304
]
stack commented on HBASE-14509:
-------------------------------
bq. We could add a method to filter, which is passed an HFile or a FileInfo or
something, and based on that gets to decide whether to include the HFile or
not.
Is filter Interface, like CP, operating at too high a level for the ruling
in/out of hfile?
bq, The other question is whether HFile is too large of a unit.
On whether an hfile is too large a unit, block is the next natural construct; a
BF of CQ per block so can skip blocks at a time? The sparse index would go
into the current block index as ancillary data rather than add at the head of a
data block... We already load the hfile index.... BF per CQ or min/max could be
part of this?
bq. Or we punt and just add the building blocks:
Sounds like extra config/options to me... so no (smile). Could we start small?
Add extra generic info on index -- a BF or min/max -- just so we can skip
blocks as we scan? min/max in hfile would be useful too... so could skip whole
hfile (would be rare event but great when it happens)
> Configurable sparse indexes?
> ----------------------------
>
> Key: HBASE-14509
> URL: https://issues.apache.org/jira/browse/HBASE-14509
> Project: HBase
> Issue Type: Brainstorming
> Reporter: Lars Hofhansl
>
> This idea just popped up today and I wanted to record it for discussion:
> What if we kept sparse column indexes per region or HFile or per configurable
> range?
> I.e. For any given CQ we record the lowest and highest value for a particular
> range (HFile, Region, or a custom range like the Phoenix guide post).
> By tweaking the size of these ranges we can control the size of the index, vs
> its selectivity.
> For example if we kept it by HFile we can almost instantly decide whether we
> need scan a particular HFile at all to find a particular value in a Cell.
> We can also collect min/max values for each n MB of data, for example when we
> can the region the first time. Assuming ranges are large enough we can always
> keep the index in memory together with the region.
> Kind of a sparse local index. Might much easier than the buddy region stuff
> we've been discussing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)