[
https://issues.apache.org/jira/browse/ACCUMULO-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Keith Turner resolved ACCUMULO-516.
-----------------------------------
Resolution: Fixed
> Column family search with sparse files is painfully long
> --------------------------------------------------------
>
> Key: ACCUMULO-516
> URL: https://issues.apache.org/jira/browse/ACCUMULO-516
> Project: Accumulo
> Issue Type: Bug
> Components: tserver
> Affects Versions: 1.4.0, 1.3.5
> Reporter: John Vines
> Assignee: Keith Turner
> Priority: Critical
> Fix For: 1.5.0, 1.4.1
>
>
> Background: a tablet with 3 files, coming in at ~500MB, 200MB, and ~20MB. One
> of the files (I believe smallest) did not have the column of interest at all.
> Running a query filtering on a column family/qualifier pair. I can scan the
> entirety of the table in ~30 minutes. I aborted a scan for just that column
> after 2 hours.
> Cause: Keith and I investigated, major compacting the tablet brought a column
> scan down to under 7 minutes. Dumping the largest file and grepping for the
> column of interest resulted in a large dead spot for that column which took
> minutes to grep over. After looking it over, the problem is how we do column
> family filtering. We handle colf filtering below the multi-iterator, which
> handles the merge read between multiple files. We do it at this level because
> we keep column info in the RFile metadata for quick filtering of entire
> files. The problem here is one of the files has that column, but does not
> have any relevant data in a large period. So every time we seek, which is for
> each batch of the query, we go down to the multi-iterator and seek for the
> first hit of the column(s) of interest. This means we are constantly spending
> minutes grabbing a key of interest to us which is substantially far down in
> the stack, such that we won't merge read it for many, MANY batches.
> Proposed Solution: Split the column family filter into two seperate pieces.
> Keep the RFile optimized portion, as it can only occur at this level. But
> move the actual column family filter for files with that column above the
> MultiIterator. This will prevent this constant repetition of a large, painful
> seek.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira