[
https://issues.apache.org/jira/browse/HADOOP-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508658
]
James Kennedy commented on HADOOP-1439:
---------------------------------------
Michael suggested that when finished, Hadoop-1531, RowFilters, may be used to
achieve the above functionality.
As the RowFilter impl is right now, using a regexp on each key encountered may
be an expensive way to do it.
In the above example, even if the endRow functionality works, how do you know
where the end row is? how do you know when you leave the google domain?
It seems to me that there may be several restrictions a user may want to apply
to row-keys:
1) Specify a range. Use start/end keys assuming you know what they are.
2) Specify a range, use a start key and a "page size". This is useful for
retrieving data in pages, e.g. displaying to UI as user clicks next/last page.
3) Specify a criteria. e.g. regular expressions or more basic string comparison.
Fortunately my RowFilterInterface design can be used to generalize the above.
In the Google example, I could create a custom RowFilter implementation that
can do domain name comparison more efficiently than general regular expression
matching. Pass that via the client as you would any other RowFilter impl.
Only thing to make sure of is that the custom impl is in the classpath of the
HRegionServer too.
For start/end range, you could have a custom RowFilter that checks for an exact
match on the end key. But this won't be as efficient as an explicit endRow
parameter because:
A) when RowFilter is not null, HRegion#HScanner is always going to have a
little more overhead even if the filter() implementation itself always just
returns false.
B) The filter isn't currently designed to stop the scanner when a certain
criteria is reached. When it encounters the endRow, it will just loop through
the rest of the rows, filtering them all out, until it reaches the end of the
HRegion.
I think start/page range has the same issues. Only difference is that it
requires scan-lifetime state to count number of (unfiltered?) rows encountered.
Still requires stop condition trigger.
If i add that stop condition trigger functionality to the RowFilterInterface
and update HScanner to use it. We could have a number of built-in RowFilter
implementations that deal with restrictions like those above.
WRT simple restrictions like start/end/page parameters there will still be a,
perhaps small, trade-off between performance and generality depending on if we
implement them independently or via RowFilterInterface.
> Add endRow parameter to HClient#obtainScanner
> ---------------------------------------------
>
> Key: HADOOP-1439
> URL: https://issues.apache.org/jira/browse/HADOOP-1439
> Project: Hadoop
> Issue Type: Improvement
> Components: contrib/hbase
> Reporter: stack
> Assignee: stack
> Priority: Minor
>
> Currently the HClient#obtainScanner looks like this:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text
> startRow) throws IOException;
> {code}
> Add an overload that allows specification of endRow:
> {code}
> public synchronized HScannerInterface obtainScanner(Text[] columns, Text
> startRow, Text endRow) throws IOException;
> {code}
> Use Case: Table contains the whole web. Client just wants to scan google's
> pages. Currently, client could cut off the scanner as soon as the row key
> leaves the google domain but cleaner if {{HScannerInterface#next()}} returns
> false
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.