[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.

ramkrishna.s.vasudevan (JIRA) Thu, 20 Dec 2012 10:05:13 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537208#comment-13537208
 ]


ramkrishna.s.vasudevan commented on HBASE-5416:
-----------------------------------------------

Ok groked up the patch.
{code}
  if (!scan.getAllowLazyCfLoading()
          || this.filter == null || 
this.filter.isFamilyEssential(entry.getKey())) {
{code}

Move the this.filter == null as first condition.  Because when you don have 
filters then the entire joinedHeap is not going to used right?
{code}
 correct_row = this.joinedHeap.seek(KeyValue.createFirstOnRow(currentRow, 
offset, length));
{code}
So here we move on to the KV just before the row we got in the current next() 
call?
After this suppose due to limits it says that joinedHeapHasMoreData =true, now 
when the next call comes
{code}
else if (joinedHeapHasMoreData) {
          joinedHeapHasMoreData =
            populateResult(this.joinedHeap, limit, currentRow, offset, length, 
metric);
          return true;
{code}
I think we should get the return val from the populateResult and if it returns 
a false we may need to check if we have reached the stopRow or not right?
Filters need not be checked anyway.  

So one thing is if i say in my Scan that i need LazyLoading but my filter is 
NOT of the type SCVF and the ones that implement isFamilyEssential then it goes 
thro normal flow.  May be this we need to document clearly as user may think 
that setting that property is going to give him a better optimized scan.

Reg, the TestHRegion testcases.  Actually the testcases does not test the 
behaviour of joinedScanners.  Is it intended? But the testcase names suggests 
it tests joinedScanners.  
I will leave it to other scan experts in deciding whether this can go in.  
Overall a very good improvment.

Thanks to Max, Sergey and Ted.

                
> Improve performance of scans with some kind of filters.
> -------------------------------------------------------
>
>                 Key: HBASE-5416
>                 URL: https://issues.apache.org/jira/browse/HBASE-5416
>             Project: HBase
>          Issue Type: Improvement
>          Components: Filters, Performance, regionserver
>    Affects Versions: 0.90.4
>            Reporter: Max Lapan
>            Assignee: Sergey Shelukhin
>             Fix For: 0.96.0
>
>         Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, 
> Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, 
> Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, 
> Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, 
> HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that 
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from 
> the subset of these CFs, data from CFs, not checked by a filter is not needed 
> on a filter stage. Only when we decided to include current row. And in such 
> case we can significantly reduce amount of IO performed by a scan, by loading 
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
> megabytes) and is used to filter large entries from snap. Snap is very large 
> (10s of GB) and it is quite costly to scan it. If we needed only rows with 
> some flag specified, we use SingleColumnValueFilter to limit result to only 
> small subset of region. But current implementation is loading both CFs to 
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to 
> specify which CF is needed to it's operation. In HRegion, we separate all 
> scanners into two groups: needed for filter and the rest (joined). When new 
> row is considered, only needed data is loaded, filter applied, and only if 
> filter accepts the row, rest of data is loaded. At our data, this speeds up 
> such kind of scans 30-50 times. Also, this gives us the way to better 
> normalize the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.

Reply via email to