Re: MR Job question

schubert zhang Wed, 04 Mar 2009 07:27:27 -0800

Hi Slava, I mean you should write by yourself, the mapreduce code in HBase
is just example. Please study how to code mapreduce job.
You should implement yourself:
1. how to split the input dataset, InputSplit
2. how to read each record of each split in each mapper, RecordReader
3. Implement yourself InputFormat
4. mapper and reducer class
5. how to write output record, RecordWriter
6. implement yourself OutputFormat
........



On Wed, Mar 4, 2009 at 8:45 PM, Slava Gorelik <[email protected]>wrote:

> How can you tell that ? There no interface in MR Job definition that allows
> that.Every sample of MR Job in Hbase is works like that (this is a map from
> RowCounter):
>
> public void map(ImmutableBytesWritable row, RowResult value,
>    OutputCollector<ImmutableBytesWritable, RowResult> output,
>    @SuppressWarnings("unused") Reporter reporter)
>  throws IOException {
>    boolean content = false;
>    for (Map.Entry<byte [], Cell> e: value.entrySet()) {
>      Cell cell = e.getValue();
>      if (cell != null && cell.getValue().length > 0) {
>        content = true;
>        break;
>      }
>    }
>    if (!content) {
>      return;
>    }
>
> You can't say which rows you want to get.
>
> Best Regards.
> Slava.
>
>
> On Wed, Mar 4, 2009 at 1:31 PM, schubert zhang <[email protected]> wrote:
>
> > In my job, I can tell the MR job the startRow and endRow, i.e. a row
> > range. Then my MR job can only scan the region(s) in the range, and
> should
> > not scan from begin of table or tablet/region to the end.
> >
> > So,  Slava, you should modify you code of MR job to do what you want.
> >
> > Schubert
> >
> > On Wed, Mar 4, 2009 at 4:58 PM, Slava Gorelik <[email protected]
> > >wrote:
> >
> > > Hi.I'm confused a little bit.
> > >
> > > Please correct me if I wrong, but MR Job is it self is "scanning" all
> > rows
> > > in the table. The job is spread into each region server, into
> > > multiple threads. Each thread get some part of the rows that are placed
> > in
> > > particular region server. So, the MR jobs is finished when all
> > > threads are passed over all rows. Filtering will help the MR job only
> to
> > > filter out non-relevant rows, but any way those rows will be checked
> > > (passed
> > > to the filter), this not helps a lot, job still passing over all rows
> in
> > > the
> > > table. Calling a scanner inside MR Job, will not
> > > prevent from the job to pass over all rows, it simple will make job
> > > more heavy(as i understand that). Is it correct, Michael ?
> > >
> > > So, my question is how can I tell to MR Job to pass over some rows and
> > not
> > > all rows.
> > >
> > > Thank You and Best Regards.
> > > Slava.
> > >
> > >
> > > On Wed, Mar 4, 2009 at 8:57 AM, stack <[email protected]> wrote:
> > >
> > > > On Tue, Mar 3, 2009 at 6:17 PM, schubert zhang <[email protected]>
> > > wrote:
> > > >
> > > > > Yes, we can tell HBase API only scan rows start with a key.
> > > > >
> > > >
> > > > Would the filtering feature help here?
> > > >
> > > >
> > > >
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.19.0/api/org/apache/hadoop/hbase/filter/package-summary.html#package_description
> > > >
> > > > Scanners can be passed a filter (Read the description section on the
> > > above
> > > > url).
> > > >
> > > >
> > > > Can any expert share your ideas about:
> > > > > 1. If the rowkey is not chronological, how can I only process the
> > newly
> > > > > added/updated rows?
> > > >
> > > >
> > > > We don't have a means of asking for versions before a timestamp, only
> > > older
> > > > (Can you add timestamp to your row key if you need this?)
> > > >
> > > >
> > > > > 2. How can I remove the old rows which are inserted three months
> ago?
> > > > >
> > > >
> > > > See above.
> > > >
> > > > St.Ack
> > > >
> > >
> >
>

Re: MR Job question

Reply via email to