Hi St. Ack,

***************************************************************************
1.
Firstly I need to thank you for your last reply, which urged me to re-check
my code, and I did find a stupid problem. 

In the map function of my old code I calls 

                HTable table = new HTable(conf, this.tableName);
                RowResult rowResult = table.getRow(key);

which basically means for each row i need to create a new "connection" to
the table. This is awkward!

In my new code I only create one such "connection" during job configuration
phase,

        public void configure(JobConf job) {
                String tableName = job.get(TABLENAME);
                try
                {
                        setTable(job, tableName);
                } catch (Exception e) {
                        LOG.error(e);
                }
        }

        private HTable table;
        protected void setTable(final JobConf job, final String tableName) 
throws
Exception{
                this.table = new HTable(new HBaseConfiguration(job), tableName);
        }

and then I just call

                RowResult rowResult = this.table.getRow(msgid);

With this revision, the job runs very stable now and takes 110 minutes to
read 10M records.
So for Q1, I can read 1M records in about 11 minutes, this looks ok.

***************************************************************************
2.

I use the default FileInputFormat so yes, the file is split into 26 pieces
(not 32, don't know why) and each mapper processed about 0.31 million
(~1/32nd part of the 10M records).

Yes, all eight boxes are running a regionserver.  There are 48 regions in my
table of 10M. 

>> When your MR that did A2. below ran, was the 'getting' distributed across
>> the regions of the table or were you banging on single region of the
>> table the whole time? 
Where can I check it? Though I think it should go across all regions because
I need to read all 10M records out.

I use Hadoop 0.18.2 and HBase 0.18.1. 
Thank for the answer to Q3 too. That is what I will try soon to build a
lucene index and see if searching based on the index can speed up
column-based reading.

-- 
View this message in context: 
http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21081633.html
Sent from the HBase User mailing list archive at Nabble.com.

Reply via email to