tigertail wrote:
Hi St. Ack,
Thanks for your input. I ran 32 map tasks (I have 8 boxes with each 4 CPUs).
Suppose the 1M row keys are known beforehand and saved in an file, I just
read each key into a mapper and use table.getRow(key) to get the record.
So, you used something like TextInputFormat so your file was split into
32 pieces? You looked at each mapper stats and it seems to process
1/32nd part only?
All eight boxes are running a regionserver? How many regions in your
talbe of 10M? When your MR that did A2. below ran, was the 'getting'
distributed across the regions of the table or were you banging on
single region of the table the whole time?
You are on hbase 0.18.0 or on hbase TRUNK?
On Q1 below, you should be able to just do gets on each individual
item. On Q3., you need to use one of the secondary indexing mechanisms
if you want to avoid scanning all.
St.Ack
I
also tried to increase the # of map tasks, but it did not improve the
performance. Actually, even worse. Many tasks are failed/killed with sth
like "no response in 600 seconds."
stack-3 wrote:
For A2. below, how many map tasks? How did you split the 1M you wanted
to fetch? How many of them ran concurrently?
St.Ack
tigertail wrote:
Hi, can anybody help? Hopefully the following can be helpful to make my
question clear if it was not in my last post.
A1. I created a table in HBase and then I inserted 10 million records
into
the table.
A2. I ran a M/R program with totally 10 million "get by rowkey" operation
to
read the 10M records out and it took about 3 hours to finish.
A3. I also ran a M/R program which used TableMap to read the 10M records
out
and it just took 12 minutes.
Now suppose I only need to read 1 million records whose row keys are
known
beforehand (and let's suppose the worst case the 1M records are evenly
distributed in the 10M records).
S1. I can use 1M "get by rowkey". But it is slow.
S2. I can also simply use TableMap and only output the 10M records in the
map function but it actually read the whole table.
Q1. Is there some more efficient way to read the 1M records, WITHOUT
PASSING
THOUGH THE WHOLE TABLE?
How about if I have 1 billion records in an HBase table and I only need
to
read 1 million records in the following two scenarios.
Q2. suppose their row keys are known beforehand
Q3. or suppose these 1 million records have the same value on a column
Any input would be greatly appreciated. Thank you so much!
tigertail wrote:
For example, I have a HBase table with 1 billion records. Each record
has
a column named 'f1:testcol'. And I want to only get the records with
'f1:testcol'=0 as the input to my map function. Suppose there are 1
million such records, I would expect this would be must faster than I
get
all 1 billion records into my map function and then do condition check.
By searching on this board and HBase documents, I tried to implement my
own subclass of TableInputFormat and set a ColumnValueFilter in
configure
method.
public class TableInputFilterFormat extends TableInputFormat implements
JobConfigurable {
private final Log LOG =
LogFactory.getLog(TableInputFilterFormat.class);
public static final String FILTER_LIST = "hbase.mapred.tablefilters";
public void configure(JobConf job) {
Path[] tableNames = FileInputFormat.getInputPaths(job);
String colArg = job.get(COLUMN_LIST);
String[] colNames = colArg.split(" ");
byte [][] m_cols = new byte[colNames.length][];
for (int i = 0; i < m_cols.length; i++) {
m_cols[i] = Bytes.toBytes(colNames[i]);
}
setInputColums(m_cols);
ColumnValueFilter filter = new
ColumnValueFilter(Bytes.toBytes("f1:testcol"),ColumnValueFilter.CompareOp.EQUAL,
Bytes.toBytes("0"));
setRowFilter(filter);
try {
setHTable(new HTable(new HBaseConfiguration(job),
tableNames[0].getName()));
} catch (Exception e) {
LOG.error(e);
}
}
}
However, The M/R job with RowFilter is much slower than the M/R job w/o
RowFilter. During the process many tasked are failed with sth like "Task
attempt_200812091733_0063_m_000019_1 failed to report status for 604
seconds. Killing!". I am wondering if RowFilter can really decrease the
record feeding from 1 billion to 1 million? If it cannot, is there any
other method to address this issue?
I am using Hadoop 0.18.2 and HBase 0.18.1.
Thank you so much in advance!