[ 
https://issues.apache.org/jira/browse/HBASE-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925706#action_12925706
 ] 

Todd Lipcon commented on HBASE-3162:
------------------------------------

Another way of attacking this problem is a bit more general - something I've 
been thinking about for a while but don't think I ever posted.

Right now we have the ability to do bloom filters on store files to decide 
whether a key exists in the file. It would be useful to add the ability to do a 
bloom filter on a _function_ of the key. In other words, right now, we check:
{code}
if (bloomFilter.mightContain(key)) { look in file }
{code}
but instead we could check:
{code}
if (bloomFilter.mightContain(function(key))) { look in file }
{code}
so that the current implementation is just the special case where the function 
is the identity function.

Getting back to the JIRA at hand, the idea is the following: if you are 
sharding your counters by time, then the key would contain some time 
information. EG you might have the counter pageid_1234_20101027_hits to track 
page views for a given day. With current blooms we'd end up with a lot of bits 
in the bloom filter to have a good false positive rate, but if instead the 
blooms were on just the "20101027" portion of the key, there would be very few 
unique values and thus we can get near 100% hit rate with very little overhead.

Thoughts?

> Add TimeRange support into Increment to optimize for counters that are 
> partitioned on time
> ------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3162
>                 URL: https://issues.apache.org/jira/browse/HBASE-3162
>             Project: HBase
>          Issue Type: Improvement
>          Components: client, regionserver
>    Affects Versions: 0.90.0
>            Reporter: Jonathan Gray
>            Priority: Minor
>
> In many use cases of increments, a given counter is only incremented during a 
> specific window of time (ie. the counters are partitioned/sharded by time).
> With this kind of schema, you are constantly creating new counters.  When a 
> new counter is "created" (incremented the first time) you will always end up 
> looking at a block from every file in the region because no previous value 
> will exist.  However, with the new TimeRange optimizations that skip files if 
> they don't contain values of the TimeRange you're interested in, we could 
> utilize that information to optimize the Get within the increment.
> This would be optional and an addition to the Increment class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to