Cedric Ho wrote:
Hi all,
I am using 0.18.0 and have successfully used data from hbase table as
input to my map/reduce job.
I wonder how to specify a subset of records from a table instead of
taking all records as input.
Such as a range of the row keys or maybe by specific values of certain columns.
You'll have to subclass the TableInputFormat.
There is an example in the javadoc on subclassing TIF:
http://hadoop.apache.org/hbase/docs/r0.18.0/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html
(Sorry, the example is mangled. Do a get of the html source to see
non-garbled code).
The example shows you how to set a filter. Filters can filter on rows
and values.
To work against a subset, you'd probably need to play with getSplits in
your subclass. Default, it basically eretrns as many splits as there
are regions in your table, so its the whole table always. Filters could
stop unwanted rows being returned but maybe its better if the rows
weren't considered in the first place; hence the need of getSplits
subclassing.
St.Ack