[jira] [Created] (HBASE-5416) Improve performance of scans with some kind of filters.

Max Lapan (Created) (JIRA) Thu, 16 Feb 2012 12:27:28 -0800

Improve performance of scans with some kind of filters.
-------------------------------------------------------


                 Key: HBASE-5416
                 URL: https://issues.apache.org/jira/browse/HBASE-5416
             Project: HBase
          Issue Type: Improvement
          Components: filters, performance, regionserver
    Affects Versions: 0.94.0
            Reporter: Max Lapan
            Assignee: Max Lapan


When the scan is performed, whole row is loaded into result list, after that 
filter (if exists) is applied to detect that row is needed.

But when scan is performed on several CFs and filter checks only data from the 
subset of these CFs, data from CFs, not checked by a filter is not needed on a 
filter stage. Only when we decided to include current row. And in such case we 
can significantly reduce amount of IO performed by a scan, by loading only 
values, actually checked by a filter.

For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
megabytes) and is used to filter large entries from snap. Snap is very large 
(10s of GB) and it is quite costly to scan it. If we needed only rows with some 
flag specified, we use SingleColumnValueFilter to limit result to only small 
subset of region. But current implementation is loading both CFs to perform 
scan, when only small subset is needed.

Attached patch adds one routine to Filter interface to allow filter to specify 
which CF is needed to it's operation. In HRegion, we separate all scanners into 
two groups: needed for filter and the rest (joined). When new row is 
considered, only needed data is loaded, filter applied, and only if filter 
accepts the row, rest of data is loaded. At our data, this speeds up such kind 
of scans 30-50 times. Also, this gives us the way to better normalize the data 
into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HBASE-5416) Improve performance of scans with some kind of filters.

Reply via email to