[jira] Commented: (HADOOP-1439) Add endRow parameter to HClient#obtainScanner

James Kennedy (JIRA) Wed, 27 Jun 2007 14:22:46 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508658
 ]


James Kennedy commented on HADOOP-1439:
---------------------------------------

Michael suggested that when finished, Hadoop-1531, RowFilters, may be used to 
achieve the above functionality.

As the RowFilter impl is right now, using a regexp on each key encountered may 
be an expensive way to do it.

In the above example, even if the endRow functionality works, how do you know 
where the end row is? how do you know when you leave the google domain?

It seems to me that there may be several restrictions a user may want to apply 
to row-keys:
1) Specify a range. Use start/end keys assuming you know what they are.
2) Specify a range, use a start key and a "page size".  This is useful for 
retrieving data in pages, e.g. displaying to UI as user clicks next/last page.
3) Specify a criteria. e.g. regular expressions or more basic string comparison.

Fortunately my RowFilterInterface design can be used to generalize the above.  
In the Google example, I could create a custom RowFilter implementation that 
can do domain name comparison more efficiently than general regular expression 
matching.  Pass that via the client as you would any other RowFilter impl.  
Only thing to make sure of is that the custom impl is in the classpath of the 
HRegionServer too.

For start/end range, you could have a custom RowFilter that checks for an exact 
match on the end key. But this won't be as efficient as an explicit endRow 
parameter because:
A) when RowFilter is not null, HRegion#HScanner is always going to have a 
little more overhead even if the filter() implementation itself always just 
returns false.
B) The filter isn't currently designed to stop the scanner when a certain 
criteria is reached. When it encounters the endRow, it will just loop through 
the rest of the rows, filtering them all out, until it reaches the end of the 
HRegion.

I think start/page range has the same issues.  Only difference is that it 
requires scan-lifetime state to count number of (unfiltered?) rows encountered. 
 Still requires stop condition trigger.

If i add that stop condition trigger functionality to the RowFilterInterface 
and update HScanner to use it. We could have a number of built-in RowFilter 
implementations that deal with restrictions like those above.

WRT simple restrictions like start/end/page parameters there will still be a, 
perhaps small, trade-off between performance and generality depending on if we 
implement them independently or via RowFilterInterface.











> Add endRow parameter to HClient#obtainScanner
> ---------------------------------------------
>
>                 Key: HADOOP-1439
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1439
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: stack
>            Priority: Minor
>
> Currently the HClient#obtainScanner looks like this:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text 
> startRow) throws IOException;
> {code}
> Add an overload that allows specification of endRow:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text 
> startRow, Text endRow) throws IOException;
> {code}
> Use Case: Table contains the whole web.  Client just wants to scan google's 
> pages.  Currently, client could cut off the scanner as soon as the row key 
> leaves the google domain but cleaner if {{HScannerInterface#next()}} returns 
> false

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1439) Add endRow parameter to HClient#obtainScanner

Reply via email to