Re: MR Job question

Slava Gorelik Wed, 04 Mar 2009 07:37:25 -0800

Hi.I understand that i need to write code, but i don't have any direction
how to do what i need, do you have any example for creating MR Job that pass
over a subset of rows ?


Thank You and Best Regards.


On Wed, Mar 4, 2009 at 5:27 PM, schubert zhang <[email protected]> wrote:

> Hi Slava, I mean you should write by yourself, the mapreduce code in HBase
> is just example. Please study how to code mapreduce job.
> You should implement yourself:
> 1. how to split the input dataset, InputSplit
> 2. how to read each record of each split in each mapper, RecordReader
> 3. Implement yourself InputFormat
> 4. mapper and reducer class
> 5. how to write output record, RecordWriter
> 6. implement yourself OutputFormat
> ........
>
>
> On Wed, Mar 4, 2009 at 8:45 PM, Slava Gorelik <[email protected]
> >wrote:
>
> > How can you tell that ? There no interface in MR Job definition that
> allows
> > that.Every sample of MR Job in Hbase is works like that (this is a map
> from
> > RowCounter):
> >
> > public void map(ImmutableBytesWritable row, RowResult value,
> >    OutputCollector<ImmutableBytesWritable, RowResult> output,
> >    @SuppressWarnings("unused") Reporter reporter)
> >  throws IOException {
> >    boolean content = false;
> >    for (Map.Entry<byte [], Cell> e: value.entrySet()) {
> >      Cell cell = e.getValue();
> >      if (cell != null && cell.getValue().length > 0) {
> >        content = true;
> >        break;
> >      }
> >    }
> >    if (!content) {
> >      return;
> >    }
> >
> > You can't say which rows you want to get.
> >
> > Best Regards.
> > Slava.
> >
> >
> > On Wed, Mar 4, 2009 at 1:31 PM, schubert zhang <[email protected]>
> wrote:
> >
> > > In my job, I can tell the MR job the startRow and endRow, i.e. a row
> > > range. Then my MR job can only scan the region(s) in the range, and
> > should
> > > not scan from begin of table or tablet/region to the end.
> > >
> > > So,  Slava, you should modify you code of MR job to do what you want.
> > >
> > > Schubert
> > >
> > > On Wed, Mar 4, 2009 at 4:58 PM, Slava Gorelik <[email protected]
> > > >wrote:
> > >
> > > > Hi.I'm confused a little bit.
> > > >
> > > > Please correct me if I wrong, but MR Job is it self is "scanning" all
> > > rows
> > > > in the table. The job is spread into each region server, into
> > > > multiple threads. Each thread get some part of the rows that are
> placed
> > > in
> > > > particular region server. So, the MR jobs is finished when all
> > > > threads are passed over all rows. Filtering will help the MR job only
> > to
> > > > filter out non-relevant rows, but any way those rows will be checked
> > > > (passed
> > > > to the filter), this not helps a lot, job still passing over all rows
> > in
> > > > the
> > > > table. Calling a scanner inside MR Job, will not
> > > > prevent from the job to pass over all rows, it simple will make job
> > > > more heavy(as i understand that). Is it correct, Michael ?
> > > >
> > > > So, my question is how can I tell to MR Job to pass over some rows
> and
> > > not
> > > > all rows.
> > > >
> > > > Thank You and Best Regards.
> > > > Slava.
> > > >
> > > >
> > > > On Wed, Mar 4, 2009 at 8:57 AM, stack <[email protected]> wrote:
> > > >
> > > > > On Tue, Mar 3, 2009 at 6:17 PM, schubert zhang <[email protected]>
> > > > wrote:
> > > > >
> > > > > > Yes, we can tell HBase API only scan rows start with a key.
> > > > > >
> > > > >
> > > > > Would the filtering feature help here?
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.19.0/api/org/apache/hadoop/hbase/filter/package-summary.html#package_description
> > > > >
> > > > > Scanners can be passed a filter (Read the description section on
> the
> > > > above
> > > > > url).
> > > > >
> > > > >
> > > > > Can any expert share your ideas about:
> > > > > > 1. If the rowkey is not chronological, how can I only process the
> > > newly
> > > > > > added/updated rows?
> > > > >
> > > > >
> > > > > We don't have a means of asking for versions before a timestamp,
> only
> > > > older
> > > > > (Can you add timestamp to your row key if you need this?)
> > > > >
> > > > >
> > > > > > 2. How can I remove the old rows which are inserted three months
> > ago?
> > > > > >
> > > > >
> > > > > See above.
> > > > >
> > > > > St.Ack
> > > > >
> > > >
> > >
> >
>

Re: MR Job question

Reply via email to