[
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13532037#comment-13532037
]
Ted Yu commented on HBASE-5416:
-------------------------------
Based on patch v7, I got the following result on MacBook:
{code}
grep 'scanner finished in' ../testJoinedScanners-output.txt
2012-12-13 20:09:26,809 INFO [main] regionserver.TestJoinedScanners(172): Slow
scanner finished in 112.792634 seconds, got 100 rows
2012-12-13 20:10:15,726 INFO [main] regionserver.TestJoinedScanners(172):
Joined scanner finished in 48.915989 seconds, got 100 rows
2012-12-13 20:10:33,006 INFO [main] regionserver.TestJoinedScanners(172): Slow
scanner finished in 17.280432 seconds, got 100 rows
2012-12-13 20:10:38,514 INFO [main] regionserver.TestJoinedScanners(172):
Joined scanner finished in 5.508207 seconds, got 100 rows
2012-12-13 20:10:51,095 INFO [main] regionserver.TestJoinedScanners(172): Slow
scanner finished in 12.580323 seconds, got 100 rows
2012-12-13 20:11:00,517 INFO [main] regionserver.TestJoinedScanners(172):
Joined scanner finished in 9.422024 seconds, got 100 rows
2012-12-13 20:11:22,650 INFO [main] regionserver.TestJoinedScanners(172): Slow
scanner finished in 22.132854 seconds, got 100 rows
2012-12-13 20:11:31,890 INFO [main] regionserver.TestJoinedScanners(172):
Joined scanner finished in 9.23955 seconds, got 100 rows
2012-12-13 20:11:34,421 INFO [main] regionserver.TestJoinedScanners(172): Slow
scanner finished in 2.531598 seconds, got 100 rows
2012-12-13 20:11:36,694 INFO [main] regionserver.TestJoinedScanners(172):
Joined scanner finished in 2.272578 seconds, got 100 rows
2012-12-13 20:11:39,197 INFO [main] regionserver.TestJoinedScanners(172): Slow
scanner finished in 2.502777 seconds, got 100 rows
2012-12-13 20:11:58,269 INFO [main] regionserver.TestJoinedScanners(172):
Joined scanner finished in 19.071438 seconds, got 100 rows
2012-12-13 20:12:01,043 INFO [main] regionserver.TestJoinedScanners(172): Slow
scanner finished in 2.774262 seconds, got 100 rows
2012-12-13 20:12:03,317 INFO [main] regionserver.TestJoinedScanners(172):
Joined scanner finished in 2.273745 seconds, got 100 rows
2012-12-13 20:12:05,981 INFO [main] regionserver.TestJoinedScanners(172): Slow
scanner finished in 2.664124 seconds, got 100 rows
2012-12-13 20:12:08,574 INFO [main] regionserver.TestJoinedScanners(172):
Joined scanner finished in 2.593234 seconds, got 100 rows
2012-12-13 20:12:11,130 INFO [main] regionserver.TestJoinedScanners(172): Slow
scanner finished in 2.555977 seconds, got 100 rows
2012-12-13 20:12:13,381 INFO [main] regionserver.TestJoinedScanners(172):
Joined scanner finished in 2.250275 seconds, got 100 rows
2012-12-13 20:12:15,721 INFO [main] regionserver.TestJoinedScanners(172): Slow
scanner finished in 2.340003 seconds, got 100 rows
2012-12-13 20:12:18,075 INFO [main] regionserver.TestJoinedScanners(172):
Joined scanner finished in 2.354218 seconds, got 100 rows
{code}
I am running the test on Linux.
Will take another look at the patch and test result tomorrow.
> Improve performance of scans with some kind of filters.
> -------------------------------------------------------
>
> Key: HBASE-5416
> URL: https://issues.apache.org/jira/browse/HBASE-5416
> Project: HBase
> Issue Type: Improvement
> Components: Filters, Performance, regionserver
> Affects Versions: 0.90.4
> Reporter: Max Lapan
> Assignee: Max Lapan
> Fix For: 0.96.0
>
> Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt,
> Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch,
> Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch,
> Filtered_scans_v7.patch, HBASE-5416-v7-rebased.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from
> the subset of these CFs, data from CFs, not checked by a filter is not needed
> on a filter stage. Only when we decided to include current row. And in such
> case we can significantly reduce amount of IO performed by a scan, by loading
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of
> megabytes) and is used to filter large entries from snap. Snap is very large
> (10s of GB) and it is quite costly to scan it. If we needed only rows with
> some flag specified, we use SingleColumnValueFilter to limit result to only
> small subset of region. But current implementation is loading both CFs to
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to
> specify which CF is needed to it's operation. In HRegion, we separate all
> scanners into two groups: needed for filter and the rest (joined). When new
> row is considered, only needed data is loaded, filter applied, and only if
> filter accepts the row, rest of data is loaded. At our data, this speeds up
> such kind of scans 30-50 times. Also, this gives us the way to better
> normalize the data into separate columns by optimizing the scans performed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira