Thanks Erik, What I want is either by row key values, or by a specific value in a column, to quickly return a small subset without reading all records into Mapper. So I actually have two questions :)
For the column-based search, for example, I have 1 billion people records in the table, the row key is the "name", and there is an "age" column. Now I want to find the records with age=30. How can I avoid to read every record into mapper and then filter the output? For searching by row key values, let's suppose I have 1 million people's names. Is there a more efficient way than running 1 million times table.getRow(name), in case the "name" strings are randomly distributed (and hence it is useless to write a new getSplits)? >> Did you try to only put that column in there for the rows that you want >> to >> get and use that as an input >> to the MR? I am not sure I get you there. I can use TableInputFormatBase.setInputColums in my program to only return the "age' column, but still, I need to read every row from the table into mapper. Or my understanding is wrong, can you give more details on your thought? Thanks again. Erik Holstad wrote: > > Hi Tigertail! > Not sure if I understand you original problem correct, but it seemed to me > that you wanted to just get > the rows with the value 1 in a column, right? > > Did you try to only put that column in there for the rows that you want to > get and use that as an input > to the MR? > > I haven't timed my MR jobs with this approach so I'm not sure how it is > handled internally, but maybe > it is worth giving it a try. > > Regards Erik > > On Wed, Dec 17, 2008 at 8:37 PM, tigertail <[email protected]> wrote: > >> >> Hi St. Ack, >> >> Thanks for your input. I ran 32 map tasks (I have 8 boxes with each 4 >> CPUs). >> Suppose the 1M row keys are known beforehand and saved in an file, I just >> read each key into a mapper and use table.getRow(key) to get the record. >> I >> also tried to increase the # of map tasks, but it did not improve the >> performance. Actually, even worse. Many tasks are failed/killed with sth >> like "no response in 600 seconds." >> >> >> stack-3 wrote: >> > >> > For A2. below, how many map tasks? How did you split the 1M you wanted >> > to fetch? How many of them ran concurrently? >> > St.Ack >> > >> > >> > tigertail wrote: >> >> Hi, can anybody help? Hopefully the following can be helpful to make >> my >> >> question clear if it was not in my last post. >> >> >> >> A1. I created a table in HBase and then I inserted 10 million records >> >> into >> >> the table. >> >> A2. I ran a M/R program with totally 10 million "get by rowkey" >> operation >> >> to >> >> read the 10M records out and it took about 3 hours to finish. >> >> A3. I also ran a M/R program which used TableMap to read the 10M >> records >> >> out >> >> and it just took 12 minutes. >> >> >> >> Now suppose I only need to read 1 million records whose row keys are >> >> known >> >> beforehand (and let's suppose the worst case the 1M records are evenly >> >> distributed in the 10M records). >> >> >> >> S1. I can use 1M "get by rowkey". But it is slow. >> >> S2. I can also simply use TableMap and only output the 10M records in >> the >> >> map function but it actually read the whole table. >> >> >> >> Q1. Is there some more efficient way to read the 1M records, WITHOUT >> >> PASSING >> >> THOUGH THE WHOLE TABLE? >> >> >> >> How about if I have 1 billion records in an HBase table and I only >> need >> >> to >> >> read 1 million records in the following two scenarios. >> >> >> >> Q2. suppose their row keys are known beforehand >> >> Q3. or suppose these 1 million records have the same value on a column >> >> >> >> Any input would be greatly appreciated. Thank you so much! >> >> >> >> >> >> tigertail wrote: >> >> >> >>> For example, I have a HBase table with 1 billion records. Each record >> >>> has >> >>> a column named 'f1:testcol'. And I want to only get the records with >> >>> 'f1:testcol'=0 as the input to my map function. Suppose there are 1 >> >>> million such records, I would expect this would be must faster than I >> >>> get >> >>> all 1 billion records into my map function and then do condition >> check. >> >>> >> >>> By searching on this board and HBase documents, I tried to implement >> my >> >>> own subclass of TableInputFormat and set a ColumnValueFilter in >> >>> configure >> >>> method. >> >>> >> >>> public class TableInputFilterFormat extends TableInputFormat >> implements >> >>> JobConfigurable { >> >>> private final Log LOG = >> >>> LogFactory.getLog(TableInputFilterFormat.class); >> >>> >> >>> public static final String FILTER_LIST = >> "hbase.mapred.tablefilters"; >> >>> >> >>> public void configure(JobConf job) { >> >>> Path[] tableNames = FileInputFormat.getInputPaths(job); >> >>> >> >>> String colArg = job.get(COLUMN_LIST); >> >>> String[] colNames = colArg.split(" "); >> >>> byte [][] m_cols = new byte[colNames.length][]; >> >>> for (int i = 0; i < m_cols.length; i++) { >> >>> m_cols[i] = Bytes.toBytes(colNames[i]); >> >>> } >> >>> setInputColums(m_cols); >> >>> >> >>> ColumnValueFilter filter = new >> >>> >> ColumnValueFilter(Bytes.toBytes("f1:testcol"),ColumnValueFilter.CompareOp.EQUAL, >> >>> Bytes.toBytes("0")); >> >>> setRowFilter(filter); >> >>> >> >>> try { >> >>> setHTable(new HTable(new HBaseConfiguration(job), >> >>> tableNames[0].getName())); >> >>> } catch (Exception e) { >> >>> LOG.error(e); >> >>> } >> >>> } >> >>> } >> >>> >> >>> However, The M/R job with RowFilter is much slower than the M/R job >> w/o >> >>> RowFilter. During the process many tasked are failed with sth like >> "Task >> >>> attempt_200812091733_0063_m_000019_1 failed to report status for 604 >> >>> seconds. Killing!". I am wondering if RowFilter can really decrease >> the >> >>> record feeding from 1 billion to 1 million? If it cannot, is there >> any >> >>> other method to address this issue? >> >>> >> >>> I am using Hadoop 0.18.2 and HBase 0.18.1. >> >>> >> >>> Thank you so much in advance! >> >>> >> >>> >> >>> >> >> >> >> >> > >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21066895.html >> Sent from the HBase User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21077276.html Sent from the HBase User mailing list archive at Nabble.com.
