[ 
https://issues.apache.org/jira/browse/ACCUMULO-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Turner updated ACCUMULO-516:
----------------------------------

    Fix Version/s: 1.5.0
    
> Column family search with sparse files is painfully long
> --------------------------------------------------------
>
>                 Key: ACCUMULO-516
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-516
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.4.0, 1.3.5
>            Reporter: John Vines
>            Assignee: Keith Turner
>            Priority: Critical
>             Fix For: 1.5.0, 1.4.1
>
>
> Background: a tablet with 3 files, coming in at ~500MB, 200MB, and ~20MB. One 
> of the files (I believe smallest) did not have the column of interest at all. 
> Running a query filtering on a column family/qualifier pair. I can scan the 
> entirety of the table in ~30 minutes. I aborted a scan for just that column 
> after 2 hours.
> Cause: Keith and I investigated, major compacting the tablet brought a column 
> scan down to under 7 minutes. Dumping the largest file and grepping for the 
> column of interest resulted in a large dead spot for that column which took 
> minutes to grep over. After looking it over, the problem is how we do column 
> family filtering. We handle colf filtering below the multi-iterator, which 
> handles the merge read between multiple files. We do it at this level because 
> we keep column info in the RFile metadata for quick filtering of entire 
> files. The problem here is one of the files has that column, but does not 
> have any relevant data in a large period. So every time we seek, which is for 
> each batch of the query, we go down to the multi-iterator and seek for the 
> first hit of the column(s) of interest. This means we are constantly spending 
> minutes grabbing a key of interest to us which is substantially far down in 
> the stack, such that we won't merge read it for many, MANY batches.
> Proposed Solution: Split the column family filter into two seperate pieces. 
> Keep the RFile optimized portion, as it can only occur at this level. But 
> move the actual column family filter for files with that column above the 
> MultiIterator. This will prevent this constant repetition of a large, painful 
> seek.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to