For example, I have a HBase table with 1 billion records. Each record has a
column named 'f1:testcol'. And I want to only get the records with
'f1:testcol'=0 as the input to my map function. Suppose there are 1 million
such records, I would expect this would be must faster than I get all 1
billion records into my map function and then do condition check.
By searching on this board and HBase documents, I tried to implement my own
subclass of TableInputFormat and set a ColumnValueFilter in configure
method.
public class TableInputFilterFormat extends TableInputFormat implements
JobConfigurable {
private final Log LOG = LogFactory.getLog(TableInputFilterFormat.class);
public static final String FILTER_LIST = "hbase.mapred.tablefilters";
public void configure(JobConf job) {
Path[] tableNames = FileInputFormat.getInputPaths(job);
String colArg = job.get(COLUMN_LIST);
String[] colNames = colArg.split(" ");
byte [][] m_cols = new byte[colNames.length][];
for (int i = 0; i < m_cols.length; i++) {
m_cols[i] = Bytes.toBytes(colNames[i]);
}
setInputColums(m_cols);
ColumnValueFilter filter = new
ColumnValueFilter(Bytes.toBytes("f1:testcol"),ColumnValueFilter.CompareOp.EQUAL,
Bytes.toBytes("0"));
setRowFilter(filter);
try {
setHTable(new HTable(new HBaseConfiguration(job),
tableNames[0].getName()));
} catch (Exception e) {
LOG.error(e);
}
}
}
However, The M/R job with RowFilter is much slower than the M/R job w/o
RowFilter. During the process many tasked are failed with sth like "Task
attempt_200812091733_0063_m_000019_1 failed to report status for 604
seconds. Killing!". I am wondering if RowFilter can really decrease the
record feeding from 1 billion to 1 million? If it cannot, is there any other
method to address this issue?
I am using Hadoop 0.18.2 and HBase 0.18.1.
Thank you so much in advance!
--
View this message in context:
http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p20963771.html
Sent from the HBase User mailing list archive at Nabble.com.