Hi.I understand that i need to write code, but i don't have any direction how to do what i need, do you have any example for creating MR Job that pass over a subset of rows ?
Thank You and Best Regards. On Wed, Mar 4, 2009 at 5:27 PM, schubert zhang <[email protected]> wrote: > Hi Slava, I mean you should write by yourself, the mapreduce code in HBase > is just example. Please study how to code mapreduce job. > You should implement yourself: > 1. how to split the input dataset, InputSplit > 2. how to read each record of each split in each mapper, RecordReader > 3. Implement yourself InputFormat > 4. mapper and reducer class > 5. how to write output record, RecordWriter > 6. implement yourself OutputFormat > ........ > > > On Wed, Mar 4, 2009 at 8:45 PM, Slava Gorelik <[email protected] > >wrote: > > > How can you tell that ? There no interface in MR Job definition that > allows > > that.Every sample of MR Job in Hbase is works like that (this is a map > from > > RowCounter): > > > > public void map(ImmutableBytesWritable row, RowResult value, > > OutputCollector<ImmutableBytesWritable, RowResult> output, > > @SuppressWarnings("unused") Reporter reporter) > > throws IOException { > > boolean content = false; > > for (Map.Entry<byte [], Cell> e: value.entrySet()) { > > Cell cell = e.getValue(); > > if (cell != null && cell.getValue().length > 0) { > > content = true; > > break; > > } > > } > > if (!content) { > > return; > > } > > > > You can't say which rows you want to get. > > > > Best Regards. > > Slava. > > > > > > On Wed, Mar 4, 2009 at 1:31 PM, schubert zhang <[email protected]> > wrote: > > > > > In my job, I can tell the MR job the startRow and endRow, i.e. a row > > > range. Then my MR job can only scan the region(s) in the range, and > > should > > > not scan from begin of table or tablet/region to the end. > > > > > > So, Slava, you should modify you code of MR job to do what you want. > > > > > > Schubert > > > > > > On Wed, Mar 4, 2009 at 4:58 PM, Slava Gorelik <[email protected] > > > >wrote: > > > > > > > Hi.I'm confused a little bit. > > > > > > > > Please correct me if I wrong, but MR Job is it self is "scanning" all > > > rows > > > > in the table. The job is spread into each region server, into > > > > multiple threads. Each thread get some part of the rows that are > placed > > > in > > > > particular region server. So, the MR jobs is finished when all > > > > threads are passed over all rows. Filtering will help the MR job only > > to > > > > filter out non-relevant rows, but any way those rows will be checked > > > > (passed > > > > to the filter), this not helps a lot, job still passing over all rows > > in > > > > the > > > > table. Calling a scanner inside MR Job, will not > > > > prevent from the job to pass over all rows, it simple will make job > > > > more heavy(as i understand that). Is it correct, Michael ? > > > > > > > > So, my question is how can I tell to MR Job to pass over some rows > and > > > not > > > > all rows. > > > > > > > > Thank You and Best Regards. > > > > Slava. > > > > > > > > > > > > On Wed, Mar 4, 2009 at 8:57 AM, stack <[email protected]> wrote: > > > > > > > > > On Tue, Mar 3, 2009 at 6:17 PM, schubert zhang <[email protected]> > > > > wrote: > > > > > > > > > > > Yes, we can tell HBase API only scan rows start with a key. > > > > > > > > > > > > > > > > Would the filtering feature help here? > > > > > > > > > > > > > > > > > > > > > > > > > http://hadoop.apache.org/hbase/docs/r0.19.0/api/org/apache/hadoop/hbase/filter/package-summary.html#package_description > > > > > > > > > > Scanners can be passed a filter (Read the description section on > the > > > > above > > > > > url). > > > > > > > > > > > > > > > Can any expert share your ideas about: > > > > > > 1. If the rowkey is not chronological, how can I only process the > > > newly > > > > > > added/updated rows? > > > > > > > > > > > > > > > We don't have a means of asking for versions before a timestamp, > only > > > > older > > > > > (Can you add timestamp to your row key if you need this?) > > > > > > > > > > > > > > > > 2. How can I remove the old rows which are inserted three months > > ago? > > > > > > > > > > > > > > > > See above. > > > > > > > > > > St.Ack > > > > > > > > > > > > > > >
