[
https://issues.apache.org/jira/browse/ACCUMULO-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13408156#comment-13408156
]
Adam Fuchs commented on ACCUMULO-652:
-------------------------------------
Status:
* RFile index now includes block stats for column visibility and overall
timestamp range
* System-level iterators and Filters now implement the pass-through optional
filter interface (Filterer)
* ColumnVisibility has been updated to support "or" and "and" operations, and
some basic simplifications to save space
* Unit tests exist and succeed for new MultiLevelIndex and RFile implementations
Remaining work:
# Test backwards compatibility of RFile running on previous index versions
# Write a functional test that only works if the block stat filtering is in
place for both timestamp range and column visibility filters
# Run continuous ingest and wikisearch ingest performance comparisons between
the 652 branch and trunk
# Plumb in the new aggregate column visibility size limit variable
# Use a better encoding for aggregate column visibilities to save space in the
index blocks
# Need a better name for the Filterer interface so as not to confuse it with
Filter
# Need to handle the case in which RFile Readers are reused between scans (or
make this contingent on them not being reused)
# As Todd suggests, we should look into extending the block stats that are
collected to include basic information about included column qualifiers,
values, and maybe column families.
# For future extensibility and backwards compatibility, we should look into
abstracting the block stats provider and making it pluggable
> support block-based filtering within RFile
> ------------------------------------------
>
> Key: ACCUMULO-652
> URL: https://issues.apache.org/jira/browse/ACCUMULO-652
> Project: Accumulo
> Issue Type: Bug
> Reporter: Adam Fuchs
> Assignee: Adam Fuchs
>
> If we keep some stats about what is in an RFile block, we might be able to
> efficiently [O(log N)], with high probability, implement filters that
> currently require linear table scans. Two use cases of this include timestamp
> range filtering (i.e. give me everything from last Tuesday) and cell-level
> security filtering (i.e. give me everything that I can see with my
> authorizations).
> For the timestamp range filter, we can keep minimum and maximum timestamps
> across all keys used in a block within the index entry for that block. For
> the cell-level security filter, we can keep an aggregate label. This could be
> done using a simplified disjunction of all of the labels in the block. The
> extra block statistics information can propagate up the index hierarchy as
> well, giving nice performance characteristics for finding the next matching
> entry in a file.
> In general, this is a heuristic technique that is good if data tends to
> naturally cluster in blocks with respect to the way it is queried. Testing
> its efficacy will require closely emulating real-world use cases -- tests
> like the continuous ingest test will not be sufficient. We will have to test
> for a few things:
> # The cost for storing the extra stats in the index are not too expensive.
> # The performance benefit for common use cases is significant.
> # We shouldn't introduce any unacceptable worst-case behavior, like bloating
> the index to ridiculous proportions for any data set.
> Eventually this will all need to be exposed through the Iterator API to be
> useful, which will be another ticket.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira