[jira] [Commented] (ACCUMULO-652) support block-based filtering within RFile

Christopher Tubbs (JIRA) Fri, 23 Jan 2015 14:36:54 -0800

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290141#comment-14290141
 ]


Christopher Tubbs commented on ACCUMULO-652:
--------------------------------------------

Okay, for now, I've just dropped the fixVersion.

> support block-based filtering within RFile
> ------------------------------------------
>
>                 Key: ACCUMULO-652
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-652
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Adam Fuchs
>            Assignee: Adam Fuchs
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> If we keep some stats about what is in an RFile block, we might be able to 
> efficiently [O(log N)], with high probability, implement filters that 
> currently require linear table scans. Two use cases of this include timestamp 
> range filtering (i.e. give me everything from last Tuesday) and cell-level 
> security filtering (i.e. give me everything that I can see with my 
> authorizations).
> For the timestamp range filter, we can keep minimum and maximum timestamps 
> across all keys used in a block within the index entry for that block. For 
> the cell-level security filter, we can keep an aggregate label. This could be 
> done using a simplified disjunction of all of the labels in the block. The 
> extra block statistics information can propagate up the index hierarchy as 
> well, giving nice performance characteristics for finding the next matching 
> entry in a file.
> In general, this is a heuristic technique that is good if data tends to 
> naturally cluster in blocks with respect to the way it is queried. Testing 
> its efficacy will require closely emulating real-world use cases -- tests 
> like the continuous ingest test will not be sufficient. We will have to test 
> for a few things:
> # The cost for storing the extra stats in the index are not too expensive.
> # The performance benefit for common use cases is significant.
> # We shouldn't introduce any unacceptable worst-case behavior, like bloating 
> the index to ridiculous proportions for any data set.
> Eventually this will all need to be exposed through the Iterator API to be 
> useful, which will be another ticket. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ACCUMULO-652) support block-based filtering within RFile

Reply via email to