[
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13532465#comment-13532465
]
Ted Yu commented on HBASE-5416:
-------------------------------
{code}
KeyValueHeap storeHeap = null;
+ KeyValueHeap joinedHeap = null;
{code}
Add a comment explaining the role of joinedHeap.
joinedHeap is only referenced in RegionScannerImpl. So it can be private.
{code}
+ // state flag which indicates when joined heap data gather interrupted due
to scan limits
{code}
'data gather interrupted' -> 'data gathering is interrupted'
{code}
+ // Here we separate all scanners into two lists - first is scanners,
+ // providing data required by the filter to operate (scanners list) and
{code}
'first is scanners, providing' -> 'first are scanners that provide'
{code}
+ * @return true if limit reached, false ovewise.
{code}
'ovewise' -> 'otherwise'
{code}
+ if (isEmptyRow /* TODO: || filterRow() is gone in trunk */) {
{code}
Can the TODO be removed ?
{code}
}
+ else {
{code}
nit: lift else to follow }
{code}
+ // As the data obtained from two independent heaps, we need to
{code}
'the data obtained' -> 'the data is obtained'
{code}
+ // Result list population was interrupted by limits, we need
to restart it on next() invocation.
{code}
nit: wrap long line.
{code}
+ // the case when SingleValueExcludeFilter is used.
{code}
'SingleValueExcludeFilter' -> 'SingleColumnValueExcludeFilter'
{code}
+ private KeyValue peekKv() {
+ return this.joinedHeapHasMoreData ? this.joinedHeap.peek() :
this.storeHeap.peek();
+ }
{code}
The above method is only called once. Consider merging the body into caller.
{code}
+ public void testScanner_JoinedScannersAndLimits() throws IOException {
{code}
nit: JoinedScannersAndLimits -> JoinedScannersWithLimits
For TestJoinedScanners.java, remove year in license header.
{code}
+ LOG.info("Make " + Long.toString(rows_to_insert) + " rows, total size =
" + Float.toString(rows_to_insert * large_bytes / 1024 / 1024) + " MB");
...
+ LOG.info("Data generated in " + Double.toString((System.nanoTime() -
time) / 1000000000.0) + " seconds");
...
+ SingleColumnValueFilter flt = new SingleColumnValueFilter(cf_essential,
col_name, CompareFilter.CompareOp.EQUAL, flag_yes);
...
+ + " scanner finished in " + Double.toString(timeSec) + " seconds, got "
+ Long.toString(rows_count/2) + " rows");
{code}
nit: wrap long line
> Improve performance of scans with some kind of filters.
> -------------------------------------------------------
>
> Key: HBASE-5416
> URL: https://issues.apache.org/jira/browse/HBASE-5416
> Project: HBase
> Issue Type: Improvement
> Components: Filters, Performance, regionserver
> Affects Versions: 0.90.4
> Reporter: Max Lapan
> Assignee: Max Lapan
> Fix For: 0.96.0
>
> Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt,
> Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch,
> Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch,
> Filtered_scans_v7.patch, HBASE-5416-v7-rebased.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from
> the subset of these CFs, data from CFs, not checked by a filter is not needed
> on a filter stage. Only when we decided to include current row. And in such
> case we can significantly reduce amount of IO performed by a scan, by loading
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of
> megabytes) and is used to filter large entries from snap. Snap is very large
> (10s of GB) and it is quite costly to scan it. If we needed only rows with
> some flag specified, we use SingleColumnValueFilter to limit result to only
> small subset of region. But current implementation is loading both CFs to
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to
> specify which CF is needed to it's operation. In HRegion, we separate all
> scanners into two groups: needed for filter and the rest (joined). When new
> row is considered, only needed data is loaded, filter applied, and only if
> filter accepts the row, rest of data is loaded. At our data, this speeds up
> such kind of scans 30-50 times. Also, this gives us the way to better
> normalize the data into separate columns by optimizing the scans performed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira