Re: How to read a subset of records based on a column value in a M/R job?

tigertail Thu, 18 Dec 2008 09:26:39 -0800

Thanks Erik,

What I want is either by row key values, or by a specific value in a column,
to quickly return a small subset without reading all records into Mapper. So
I actually have two questions :)


For the column-based search, for example, I have 1 billion people records in
the table, the row key is the "name", and there is an "age" column. Now I
want to find the records with age=30. How can I avoid to read every record
into mapper and then filter the output?

For searching by row key values, let's suppose I have 1 million people's
names. Is there a more efficient way than running 1 million times
table.getRow(name), in case the "name" strings are randomly distributed (and
hence it is useless to write a new getSplits)?

>> Did you try to only put that column in there for the rows that you want
>> to
>> get and use that as an input
>> to the MR?

I am not sure I get you there. I can use TableInputFormatBase.setInputColums
in my program to only return the "age' column, but still, I need to read
every row from the table into mapper. Or my understanding is wrong, can you
give more details on your thought?

Thanks again.



Erik Holstad wrote:
> 
> Hi Tigertail!
> Not sure if I understand you original problem correct, but it seemed to me
> that you wanted to just get
> the rows with the value 1 in a column, right?
> 
> Did you try to only put that column in there for the rows that you want to
> get and use that as an input
> to the MR?
> 
> I haven't timed my MR jobs with this approach so I'm not sure how it is
> handled internally, but maybe
> it is worth giving it a try.
> 
> Regards Erik
> 
> On Wed, Dec 17, 2008 at 8:37 PM, tigertail <[email protected]> wrote:
> 
>>
>> Hi St. Ack,
>>
>> Thanks for your input. I ran 32 map tasks (I have 8 boxes with each 4
>> CPUs).
>> Suppose the 1M row keys are known beforehand and saved in an file, I just
>> read each key into a mapper and use table.getRow(key) to get the record.
>> I
>> also tried to increase the # of map tasks, but it did not improve the
>> performance. Actually, even worse. Many tasks are failed/killed with sth
>> like "no response in 600 seconds."
>>
>>
>> stack-3 wrote:
>> >
>> > For A2. below, how many map tasks?  How did you split the 1M you wanted
>> > to fetch? How many of them ran concurrently?
>> > St.Ack
>> >
>> >
>> > tigertail wrote:
>> >> Hi, can anybody help? Hopefully the following can be helpful to make
>> my
>> >> question clear if it was not in my last post.
>> >>
>> >> A1. I created a table in HBase and then I inserted 10 million records
>> >> into
>> >> the table.
>> >> A2. I ran a M/R program with totally 10 million "get by rowkey"
>> operation
>> >> to
>> >> read the 10M records out and it took about 3 hours to finish.
>> >> A3. I also ran a M/R program which used TableMap to read the 10M
>> records
>> >> out
>> >> and it just took 12 minutes.
>> >>
>> >> Now suppose I only need to read 1 million records whose row keys are
>> >> known
>> >> beforehand (and let's suppose the worst case the 1M records are evenly
>> >> distributed in the 10M records).
>> >>
>> >> S1. I can use 1M "get by rowkey". But it is slow.
>> >> S2. I can also simply use TableMap and only output the 10M records in
>> the
>> >> map function but it actually read the whole table.
>> >>
>> >> Q1. Is there some more efficient way to read the 1M records, WITHOUT
>> >> PASSING
>> >> THOUGH THE WHOLE TABLE?
>> >>
>> >> How about if I have 1 billion records in an HBase table and I only
>> need
>> >> to
>> >> read 1 million records in the following two scenarios.
>> >>
>> >> Q2. suppose their row keys are known beforehand
>> >> Q3. or suppose these 1 million records have the same value on a column
>> >>
>> >> Any input would be greatly appreciated. Thank you so much!
>> >>
>> >>
>> >> tigertail wrote:
>> >>
>> >>> For example, I have a HBase table with 1 billion records. Each record
>> >>> has
>> >>> a column named 'f1:testcol'. And I want to only get the records with
>> >>> 'f1:testcol'=0 as the input to my map function. Suppose there are 1
>> >>> million such records, I would expect this would be must faster than I
>> >>> get
>> >>> all 1 billion records into my map function and then do condition
>> check.
>> >>>
>> >>> By searching on this board and HBase documents, I tried to implement
>> my
>> >>> own subclass of TableInputFormat and set a ColumnValueFilter in
>> >>> configure
>> >>> method.
>> >>>
>> >>> public class TableInputFilterFormat extends TableInputFormat
>> implements
>> >>>     JobConfigurable {
>> >>>   private final Log LOG =
>> >>> LogFactory.getLog(TableInputFilterFormat.class);
>> >>>
>> >>>   public static final String FILTER_LIST =
>> "hbase.mapred.tablefilters";
>> >>>
>> >>>   public void configure(JobConf job) {
>> >>>     Path[] tableNames = FileInputFormat.getInputPaths(job);
>> >>>
>> >>>     String colArg = job.get(COLUMN_LIST);
>> >>>     String[] colNames = colArg.split(" ");
>> >>>     byte [][] m_cols = new byte[colNames.length][];
>> >>>     for (int i = 0; i < m_cols.length; i++) {
>> >>>       m_cols[i] = Bytes.toBytes(colNames[i]);
>> >>>     }
>> >>>     setInputColums(m_cols);
>> >>>
>> >>>     ColumnValueFilter filter = new
>> >>>
>> ColumnValueFilter(Bytes.toBytes("f1:testcol"),ColumnValueFilter.CompareOp.EQUAL,
>> >>> Bytes.toBytes("0"));
>> >>>     setRowFilter(filter);
>> >>>
>> >>>     try {
>> >>>       setHTable(new HTable(new HBaseConfiguration(job),
>> >>> tableNames[0].getName()));
>> >>>     } catch (Exception e) {
>> >>>       LOG.error(e);
>> >>>     }
>> >>>   }
>> >>> }
>> >>>
>> >>> However, The M/R job with RowFilter is much slower than the M/R job
>> w/o
>> >>> RowFilter. During the process many tasked are failed with sth like
>> "Task
>> >>> attempt_200812091733_0063_m_000019_1 failed to report status for 604
>> >>> seconds. Killing!". I am wondering if RowFilter can really decrease
>> the
>> >>> record feeding from 1 billion to 1 million? If it cannot, is there
>> any
>> >>> other method to address this issue?
>> >>>
>> >>> I am using Hadoop 0.18.2 and HBase 0.18.1.
>> >>>
>> >>> Thank you so much in advance!
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21066895.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21077276.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: How to read a subset of records based on a column value in a M/R job?

Reply via email to