[ 
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Lapan updated HBASE-5416:
-----------------------------

    Attachment: Filtered_scans_v7.patch

Implemented benchmark of joined scanners.
You can run it with {{mvn test -P localTests --Dtest=TestJoinedScanners}}. It 
lasts for about an hour, so, don't foreget to increase 
{{forkedProcessTimeoutInSeconds}} it pom.xml file.

On my notebook I got the following output:

{quote}
2012-06-29 22:12:00,182 INFO  [main] regionserver.TestJoinedScanners(102): Make 
100000 rows, total size = 9765.0 MB
2012-06-29 22:56:51,231 INFO  [main] regionserver.TestJoinedScanners(128): Data 
generated in 2691.048310914 seconds
2012-06-29 23:03:03,865 INFO  [main] regionserver.TestJoinedScanners(152): Slow 
scanner finished in 372.634075184 seconds, got 1000 rows
2012-06-29 23:04:02,443 INFO  [main] regionserver.TestJoinedScanners(172): 
Joined scanner finished in 58.577552657 seconds, got 1000 rows
2012-06-29 23:09:41,837 INFO  [main] regionserver.TestJoinedScanners(195): Slow 
scanner finished in 339.394307354 seconds, got 1000 rows
{quote}

I run slow scanners test twice to be sure that it's not a cache effect. So, 
it's about 5.7 times speedup on this toy data.

                
> Improve performance of scans with some kind of filters.
> -------------------------------------------------------
>
>                 Key: HBASE-5416
>                 URL: https://issues.apache.org/jira/browse/HBASE-5416
>             Project: HBase
>          Issue Type: Improvement
>          Components: filters, performance, regionserver
>    Affects Versions: 0.90.4
>            Reporter: Max Lapan
>            Assignee: Max Lapan
>         Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, 
> Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, 
> Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, 
> Filtered_scans_v7.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that 
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from 
> the subset of these CFs, data from CFs, not checked by a filter is not needed 
> on a filter stage. Only when we decided to include current row. And in such 
> case we can significantly reduce amount of IO performed by a scan, by loading 
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
> megabytes) and is used to filter large entries from snap. Snap is very large 
> (10s of GB) and it is quite costly to scan it. If we needed only rows with 
> some flag specified, we use SingleColumnValueFilter to limit result to only 
> small subset of region. But current implementation is loading both CFs to 
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to 
> specify which CF is needed to it's operation. In HRegion, we separate all 
> scanners into two groups: needed for filter and the rest (joined). When new 
> row is considered, only needed data is loaded, filter applied, and only if 
> filter accepts the row, rest of data is loaded. At our data, this speeds up 
> such kind of scans 30-50 times. Also, this gives us the way to better 
> normalize the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to