Re: help with reduce phase understanding

Jean-Daniel Cryans Fri, 01 Aug 2008 06:49:23 -0700

It was committed late last night so it's fixed in TRUNK. Another big issue
got fixed so there is a good chance that we see a release candidate 2 soon.


Pavel, FYI, doing a row count is really non-trivial in HBase. Doing a scan
over all rows may take more than one hour because it's not distributed (it's
one row after the other). So mapred is well suited for that.

J-D

On Fri, Aug 1, 2008 at 9:40 AM, Yair Even-Zohar <[EMAIL PROTECTED]>wrote:

> Actually, there is a RowCounter under the mapred package. There is a bug
> in the 0.2.0 candidate release but this was fixed yesterday. You may
> wwant to check the new one (see
> https://issues.apache.org/jira/browse/HBASE-791)
>
> I would have done so but I probably have a bigger hdfs problem on my
> cluster :-)
>
> Thanks
> -Yair
> -----Original Message-----
> From: Pavel [mailto:[EMAIL PROTECTED]
> Sent: Friday, August 01, 2008 8:37 AM
> To: [email protected]
> Subject: Re: help with reduce phase understanding
>
> Thank you a lot for your answer Jean-Daniel. I think now I understand
> how
> that scenario works.
>
> I have another scenario (probably not doable with mapred thing though) -
> I
> need to get total rows count for whole table. I think I could use
> Reporter
> to increment a counter in map phase, but how can I get the counter value
> saved into 'results' table after all? Can you please advice how can I
> achieve that? Also, what is preferred way to get table row count?
>
> Thank you for your help!
> Pavel
>
> 2008/8/1 Jean-Daniel Cryans <[EMAIL PROTECTED]>
>
> > Pavel,
> >
> > Since each map processes only one region, that a row is only stored in
> one
> > region and that all intermediate keys from a given mapper goes to a
> single
> > reducer, there will be no stale data in this situation.
> >
> > J-D
> >
> > On Wed, Jul 30, 2008 at 10:09 AM, Pavel <[EMAIL PROTECTED]> wrote:
> >
> > > Hi,
> > >
> > > I feel lack of mapreduce approach understanding and would like to
> ask
> > some
> > > questions (mainly on its reduce part). Below is reduce job that gets
> > values
> > > count for given row key and inserts resulting value into other table
> > using
> > > the same row key.
> > >
> > > What makes me doubt is that I cannot figure out how would that code
> work
> > if
> > > there're several redurers are running. Is it possible that they will
> > > process
> > > values for same row key and as consequence write stale data into the
> > table?
> > > Say reducerA has counted total for 5 messages while reducerB for 3
> > > messages,
> > > would that all end up with 8 value in resulting table?
> > >
> > > Thank you.
> > > Pavel
> > >
> > > public class MessagesTableReduce extends TableReduce<Text,
> LongWritable>
> > {
> > >
> > >    public void reduce(Text key, Iterator<LongWritable> values,
> > >            OutputCollector<Text, MapWritable> output, Reporter
> reporter)
> > >            throws IOException {
> > >
> > >        System.out.println("REDUCE: processing messages for author: "
> +
> > > key.toString());
> > >
> > >        int total = 0;
> > >        while (values.hasNext()) {
> > >            values.next();
> > >            total++;
> > >        }
> > >
> > >        MapWritable map = new MapWritable();
> > >        map.put(new Text("messages:sent"), new
> > > ImmutableBytesWritable(String.valueOf(total).getBytes()));
> > >        output.collect(key, map);
> > >    }
> > > }
> > >
> >
>

Re: help with reduce phase understanding

Reply via email to