[ 
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215789#comment-13215789
 ] 

Mikhail Bautin commented on HBASE-5416:
---------------------------------------

@Max: if you scan the 'flag' column family first, find the rows that you are 
interested in, and query only those rows from the 'snap' column family, you 
will avoid the slowness from scanning every row in 'snap'. With proper 
batching, the two-pass approach should work fine if you don't need atomicity.

The problem with such deep changes to the scanner framework is that it would 
require comprehensive new unit tests. The included unit test only writes three 
rows and does not really check the new feature (or the old functionality) on a 
large scale. Take a look at TestMultiColumnScanner and TestSeekOptimizations. 
We will need something at least as comprehensive as those tests for this 
improvement, probably even a multithreaded test case to ensure we don't break 
atomicity. If we do not do that testing now, we will still have to do it before 
the next stable release, but it would be unfair to pass the hidden costs of 
testing to those who don't need this particular optimization right now but will 
soon need a stable system for another production release.
                
> Improve performance of scans with some kind of filters.
> -------------------------------------------------------
>
>                 Key: HBASE-5416
>                 URL: https://issues.apache.org/jira/browse/HBASE-5416
>             Project: HBase
>          Issue Type: Improvement
>          Components: filters, performance, regionserver
>    Affects Versions: 0.90.4
>            Reporter: Max Lapan
>            Assignee: Max Lapan
>         Attachments: 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, 
> Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that 
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from 
> the subset of these CFs, data from CFs, not checked by a filter is not needed 
> on a filter stage. Only when we decided to include current row. And in such 
> case we can significantly reduce amount of IO performed by a scan, by loading 
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
> megabytes) and is used to filter large entries from snap. Snap is very large 
> (10s of GB) and it is quite costly to scan it. If we needed only rows with 
> some flag specified, we use SingleColumnValueFilter to limit result to only 
> small subset of region. But current implementation is loading both CFs to 
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to 
> specify which CF is needed to it's operation. In HRegion, we separate all 
> scanners into two groups: needed for filter and the rest (joined). When new 
> row is considered, only needed data is loaded, filter applied, and only if 
> filter accepts the row, rest of data is loaded. At our data, this speeds up 
> such kind of scans 30-50 times. Also, this gives us the way to better 
> normalize the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to