Re: How to read a subset of records based on a column value in a M/R job?

stack Thu, 18 Dec 2008 12:51:48 -0800

tigertail wrote:

Hi St. Ack,
Thanks for your input. I ran 32 map tasks (I have 8 boxes with each 4 CPUs).
Suppose the 1M row keys are known beforehand and saved in an file, I just
read each key into a mapper and use table.getRow(key) to get the record.

So, you used something like TextInputFormat so your file was split into32 pieces? You looked at each mapper stats and it seems to process1/32nd part only?

All eight boxes are running a regionserver? How many regions in yourtalbe of 10M? When your MR that did A2. below ran, was the 'getting'distributed across the regions of the table or were you banging onsingle region of the table the whole time?


You are on hbase 0.18.0 or on hbase TRUNK?

On Q1 below, you should be able to just do gets on each individualitem. On Q3., you need to use one of the secondary indexing mechanismsif you want to avoid scanning all.


St.Ack

I
also tried to increase the # of map tasks, but it did not improve the
performance. Actually, even worse. Many tasks are failed/killed with sth
like "no response in 600 seconds."

stack-3 wrote:

For A2. below, how many map tasks? How did you split the 1M you wantedto fetch? How many of them ran concurrently?

St.Ack


tigertail wrote:

Hi, can anybody help? Hopefully the following can be helpful to make my

question clear if it was not in my last post.

A1. I created a table in HBase and then I inserted 10 million records
into

the table.A2. I ran a M/R program with totally 10 million "get by rowkey" operation

to
read the 10M records out and it took about 3 hours to finish.
A3. I also ran a M/R program which used TableMap to read the 10M records
out
and it just took 12 minutes.

Now suppose I only need to read 1 million records whose row keys are
known
beforehand (and let's suppose the worst case the 1M records are evenly

distributed in the 10M records).S1. I can use 1M "get by rowkey". But it is slow.S2. I can also simply use TableMap and only output the 10M records in the

map function but it actually read the whole table.

Q1. Is there some more efficient way to read the 1M records, WITHOUT
PASSING

THOUGH THE WHOLE TABLE?

How about if I have 1 billion records in an HBase table and I only need
to
read 1 million records in the following two scenarios.

Q2. suppose their row keys are known beforehand
Q3. or suppose these 1 million records have the same value on a column

Any input would be greatly appreciated. Thank you so much!


tigertail wrote:

For example, I have a HBase table with 1 billion records. Each record
has
a column named 'f1:testcol'. And I want to only get the records with
'f1:testcol'=0 as the input to my map function. Suppose there are 1
million such records, I would expect this would be must faster than I
get
all 1 billion records into my map function and then do condition check.

By searching on this board and HBase documents, I tried to implement my
own subclass of TableInputFormat and set a ColumnValueFilter in
configure

method.

public class TableInputFilterFormat extends TableInputFormat implements
    JobConfigurable {
  private final Log LOG =
LogFactory.getLog(TableInputFilterFormat.class);

  public static final String FILTER_LIST = "hbase.mapred.tablefilters";

  public void configure(JobConf job) {
    Path[] tableNames = FileInputFormat.getInputPaths(job);

    String colArg = job.get(COLUMN_LIST);
    String[] colNames = colArg.split(" ");
    byte [][] m_cols = new byte[colNames.length][];
    for (int i = 0; i < m_cols.length; i++) {
      m_cols[i] = Bytes.toBytes(colNames[i]);
    }
    setInputColums(m_cols);

    ColumnValueFilter filter = new
ColumnValueFilter(Bytes.toBytes("f1:testcol"),ColumnValueFilter.CompareOp.EQUAL,
Bytes.toBytes("0"));
    setRowFilter(filter);

    try {
      setHTable(new HTable(new HBaseConfiguration(job),
tableNames[0].getName()));
    } catch (Exception e) {
      LOG.error(e);
    }
  }
}

However, The M/R job with RowFilter is much slower than the M/R job w/o
RowFilter. During the process many tasked are failed with sth like "Task
attempt_200812091733_0063_m_000019_1 failed to report status for 604
seconds. Killing!". I am wondering if RowFilter can really decrease the
record feeding from 1 billion to 1 million? If it cannot, is there any
other method to address this issue?

I am using Hadoop 0.18.2 and HBase 0.18.1.

Thank you so much in advance!

Re: How to read a subset of records based on a column value in a M/R job?

Reply via email to