Hi.I'm confused a little bit. Please correct me if I wrong, but MR Job is it self is "scanning" all rows in the table. The job is spread into each region server, into multiple threads. Each thread get some part of the rows that are placed in particular region server. So, the MR jobs is finished when all threads are passed over all rows. Filtering will help the MR job only to filter out non-relevant rows, but any way those rows will be checked (passed to the filter), this not helps a lot, job still passing over all rows in the table. Calling a scanner inside MR Job, will not prevent from the job to pass over all rows, it simple will make job more heavy(as i understand that). Is it correct, Michael ?
So, my question is how can I tell to MR Job to pass over some rows and not all rows. Thank You and Best Regards. Slava. On Wed, Mar 4, 2009 at 8:57 AM, stack <[email protected]> wrote: > On Tue, Mar 3, 2009 at 6:17 PM, schubert zhang <[email protected]> wrote: > > > Yes, we can tell HBase API only scan rows start with a key. > > > > Would the filtering feature help here? > > > http://hadoop.apache.org/hbase/docs/r0.19.0/api/org/apache/hadoop/hbase/filter/package-summary.html#package_description > > Scanners can be passed a filter (Read the description section on the above > url). > > > Can any expert share your ideas about: > > 1. If the rowkey is not chronological, how can I only process the newly > > added/updated rows? > > > We don't have a means of asking for versions before a timestamp, only older > (Can you add timestamp to your row key if you need this?) > > > > 2. How can I remove the old rows which are inserted three months ago? > > > > See above. > > St.Ack >
