[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

Raymond Liu (JIRA) Mon, 04 Mar 2013 18:03:15 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13592938#comment-13592938
 ]


Raymond Liu commented on HBASE-4433:
------------------------------------

To figure out how much overhead the seek will have. I read a few more code. My 
table is major compacted. And it seems that under this situation. The lazy seek 
approaching doesn't help. since there are only 1 scanner involved. Still each 
time this scanner will go through a lazy seek, then add to heap , sort, poll 
out , for a second real seek. it introduce one extra lazy seek and construction 
of a second fake key for seek. And the best path should be go direct seek 
without this lazy seek when there are only 1 storefilescanner is involved ( or 
1 storefilescanner + 1 memstorescanner?). And I tweak the code a little bit to 
find out how much it will impact the result. it show to me the scan time is 
reduced from 260s to 240s for include_and_seek, though still far from 190s for 
include then seek since there are still one seek involved which is expensive 
than next.

However I find it hard to do thing right if you want to switch from lazy seek 
to non_lazy seek later. try to read more code to find a solution.
                
> avoid extra next (potentially a seek) if done with column/row
> -------------------------------------------------------------
>
>                 Key: HBASE-4433
>                 URL: https://issues.apache.org/jira/browse/HBASE-4433
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Kannan Muthukkaruppan
>             Fix For: 0.92.0
>
>
> [Noticed this in 89, but quite likely true of trunk as well.]
> When we are done with the requested column(s) the code still does an extra 
> next() call before it realizes that it is actually done. This extra next() 
> call could potentially result in an unnecessary extra block load. This is 
> likely to be especially bad for CFs where the KVs are large blobs where each 
> KV may be occupying a block of its own. So the next() can often load a new 
> unrelated block unnecessarily.
> --
> For the simple case of reading say the top-most column in a row in a single 
> file, where each column (KV) was say a block of its own-- it seems that we 
> are reading 3 blocks, instead of 1 block!
> I am working on a simple patch and with that the number of seeks is down to 
> 2. 
> [There is still an extra seek left.  I think there were two levels of 
> extra/unnecessary next() we were doing without actually confirming that the 
> next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
> diff avoids. I think the other is at hfs.next() (at the storefile scanner 
> level) that's happening whenever a HFile scanner servers out a data-- and 
> perhaps that's the additional seek that we need to avoid. But I want to 
> tackle this optimization first as the two issues seem unrelated.]
> -- 
> The basic idea of the patch I am working on/testing is as follows. The 
> ExplicitColumnTracker currently returns "INCLUDE" to the ScanQueryMatcher if 
> the KV needs to be included and then if done, only in the the next call it 
> returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
> when ExplicitColumnTracker knows it is done with a particular column/row, the 
> patch attempts to combine the INCLUDE code and done hint into a single match 
> code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

Reply via email to