Adam Fuchs created ACCUMULO-652:
-----------------------------------
Summary: support block-based filtering within RFile
Key: ACCUMULO-652
URL: https://issues.apache.org/jira/browse/ACCUMULO-652
Project: Accumulo
Issue Type: Bug
Reporter: Adam Fuchs
Assignee: Adam Fuchs
If we keep some stats about what is in an RFile block, we might be able to
efficiently [O(log N)], with high probability, implement filters that currently
require linear table scans. Two use cases of this include timestamp range
filtering (i.e. give me everything from last Tuesday) and cell-level security
filtering (i.e. give me everything that I can see with my authorizations).
For the timestamp range filter, we can keep minimum and maximum timestamps
across all keys used in a block within the index entry for that block. For the
cell-level security filter, we can keep an aggregate label. This could be done
using a simplified disjunction of all of the labels in the block. The extra
block statistics information can propagate up the index hierarchy as well,
giving nice performance characteristics for finding the next matching entry in
a file.
In general, this is a heuristic technique that is good if data tends to
naturally cluster in blocks with respect to the way it is queried. Testing its
efficacy will require closely emulating real-world use cases -- tests like the
continuous ingest test will not be sufficient. We will have to test for a few
things:
# The cost for storing the extra stats in the index are not too expensive.
# The performance benefit for common use cases is significant.
# We shouldn't introduce any unacceptable worst-case behavior, like bloating
the index to ridiculous proportions for any data set.
Eventually this will all need to be exposed through the Iterator API to be
useful, which will be another ticket.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira