How to read a subset of records based on a column value in a M/R job?

tigertail Thu, 11 Dec 2008 12:18:16 -0800

For example, I have a HBase table with 1 billion records. Each record has a
column named 'f1:testcol'. And I want to only get the records with
'f1:testcol'=0 as the input to my map function. Suppose there are 1 million
such records, I would expect this would be must faster than I get all 1
billion records into my map function and then do condition check.


By searching on this board and HBase documents, I tried to implement my own
subclass of TableInputFormat and set a ColumnValueFilter in configure
method. 

public class TableInputFilterFormat extends TableInputFormat implements
    JobConfigurable {
  private final Log LOG = LogFactory.getLog(TableInputFilterFormat.class);

  public static final String FILTER_LIST = "hbase.mapred.tablefilters";

  public void configure(JobConf job) {
    Path[] tableNames = FileInputFormat.getInputPaths(job);

    String colArg = job.get(COLUMN_LIST);
    String[] colNames = colArg.split(" ");
    byte [][] m_cols = new byte[colNames.length][];
    for (int i = 0; i < m_cols.length; i++) {
      m_cols[i] = Bytes.toBytes(colNames[i]);
    }
    setInputColums(m_cols);

    ColumnValueFilter filter = new
ColumnValueFilter(Bytes.toBytes("f1:testcol"),ColumnValueFilter.CompareOp.EQUAL,
Bytes.toBytes("0"));
    setRowFilter(filter);

    try {
      setHTable(new HTable(new HBaseConfiguration(job),
tableNames[0].getName()));
    } catch (Exception e) {
      LOG.error(e);
    }
  }
}

However, The M/R job with RowFilter is much slower than the M/R job w/o
RowFilter. During the process many tasked are failed with sth like "Task
attempt_200812091733_0063_m_000019_1 failed to report status for 604
seconds. Killing!". I am wondering if RowFilter can really decrease the
record feeding from 1 billion to 1 million? If it cannot, is there any other
method to address this issue?

I am using Hadoop 0.18.2 and HBase 0.18.1.

Thank you so much in advance!

-- 
View this message in context: 
http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p20963771.html
Sent from the HBase User mailing list archive at Nabble.com.

How to read a subset of records based on a column value in a M/R job?

Reply via email to