Re: MR job for creating splits

Something Something Sat, 12 May 2012 23:19:47 -0700

Is there no way to find out inside a single reducer how many records were
created by all the Mappers?  I tried several ways but nothing works.  For
example, I tried this:


reporter.getCounter(Task.Counter.REDUCE_INPUT_RECORDS).getValue();

It's not working for me.  Should this have worked?  Am I just doing
something dumb?  I would rather not create another MR job just to count #
of lines.


On Sat, May 12, 2012 at 7:07 PM, Bryan Beaudreault <bbeaudrea...@hubspot.com
> wrote:

> I did a very similar approach and it worked fine for me.  Just spot check
> the regions after to make sure they look lexicographically sorted.  I used
> ImmutableBytesWritable as my key, and the default hadoop sorting for that
> turned out to sort lexicographically as required.  Our hbase rows varied in
> size, so instead of doing a count of the number of rows, we tallied up the
> KeyValue.getLenght() for each KeyValue in a row until the size reached a
> certain limit.
>
> On Sat, May 12, 2012 at 7:21 PM, Something Something <
> mailinglist...@gmail.com> wrote:
>
> > Hello,
> >
> > This is really a MapReduce question, but the output from this will be
> used
> > to create regions for an HBase table.  Here's what I want to do:
> >
> > Take an input file that contains data about users.
> > Sort this file by a key (which consists of a few fields from the row)
> > After every x # of rows write the key.
> >
> >
> > Here's how I was going to structure my MapReduce:
> >
> > public Splitter {
> >
> >    static int counter;
> >
> >    private Mapper {
> >        map() {
> >            Build key by concatenating fields
> >            Write key
> >            increment counter;
> >        }
> >    }
> >
> >    //  # of reducers will be set to 1.  My understanding is that this
> will
> > send the lines to reducer in sorted order one at a time - is this a
> correct
> > assumption?
> >    private Reducer {
> >         static long i;
> >         reduce() {
> >             static long splitSize = counter / 300;  //  300 is region
> size
> >             if (i == 0 || i == splitSize) {
> >                 Write key;  // this will be used as a 'startkey'.
> >                  i = 0;
> >             }
> >             i++;
> >         }
> >    }
> > }
> >
> > To summarize, there are 2 questions:
> >
> > 1)  I am passing # of rows processed by Mapper to Reducer via a static
> > counter.  Would this work?  Is there a better way?
> > 2)  If I set # of reducers to 1, would the lines be sent to reducer in
> > sorted order one at a time?
> >
> > Thanks in advance for the help.
> >
>

Re: MR job for creating splits

Reply via email to