Re: MR Job question

Slava Gorelik Wed, 04 Mar 2009 00:59:14 -0800

Hi.I'm confused a little bit.

Please correct me if I wrong, but MR Job is it self is "scanning" all rows
in the table. The job is spread into each region server, into
multiple threads. Each thread get some part of the rows that are placed in
particular region server. So, the MR jobs is finished when all
threads are passed over all rows. Filtering will help the MR job only to
filter out non-relevant rows, but any way those rows will be checked (passed
to the filter), this not helps a lot, job still passing over all rows in the
table. Calling a scanner inside MR Job, will not
prevent from the job to pass over all rows, it simple will make job
more heavy(as i understand that). Is it correct, Michael ?


So, my question is how can I tell to MR Job to pass over some rows and not
all rows.

Thank You and Best Regards.
Slava.


On Wed, Mar 4, 2009 at 8:57 AM, stack <[email protected]> wrote:

> On Tue, Mar 3, 2009 at 6:17 PM, schubert zhang <[email protected]> wrote:
>
> > Yes, we can tell HBase API only scan rows start with a key.
> >
>
> Would the filtering feature help here?
>
>
> http://hadoop.apache.org/hbase/docs/r0.19.0/api/org/apache/hadoop/hbase/filter/package-summary.html#package_description
>
> Scanners can be passed a filter (Read the description section on the above
> url).
>
>
> Can any expert share your ideas about:
> > 1. If the rowkey is not chronological, how can I only process the newly
> > added/updated rows?
>
>
> We don't have a means of asking for versions before a timestamp, only older
> (Can you add timestamp to your row key if you need this?)
>
>
> > 2. How can I remove the old rows which are inserted three months ago?
> >
>
> See above.
>
> St.Ack
>

Re: MR Job question

Reply via email to