Column family search with sparse files is painfully long
--------------------------------------------------------
Key: ACCUMULO-516
URL: https://issues.apache.org/jira/browse/ACCUMULO-516
Project: Accumulo
Issue Type: Bug
Components: tserver
Affects Versions: 1.3.5, 1.4.0
Reporter: John Vines
Assignee: Keith Turner
Priority: Critical
Fix For: 1.4.1
Background: a tablet with 3 files, coming in at ~500MB, 200MB, and ~20MB. One
of the files (I believe smallest) did not have the column of interest at all.
Running a query filtering on a column family/qualifier pair. I can scan the
entirety of the table in ~30 minutes. I aborted a scan for just that column
after 2 hours.
Cause: Keith and I investigated, major compacting the tablet brought a column
scan down to under 7 minutes. Dumping the largest file and grepping for the
column of interest resulted in a large dead spot for that column which took
minutes to grep over. After looking it over, the problem is how we do column
family filtering. We handle colf filtering below the multi-iterator, which
handles the merge read between multiple files. We do it at this level because
we keep column info in the RFile metadata for quick filtering of entire files.
The problem here is one of the files has that column, but does not have any
relevant data in a large period. So every time we seek, which is for each batch
of the query, we go down to the multi-iterator and seek for the first hit of
the column(s) of interest. This means we are constantly spending minutes
grabbing a key of interest to us which is substantially far down in the stack,
such that we won't merge read it for many, MANY batches.
Proposed Solution: Split the column family filter into two seperate pieces.
Keep the RFile optimized portion, as it can only occur at this level. But move
the actual column family filter for files with that column above the
MultiIterator. This will prevent this constant repetition of a large, painful
seek.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira