Re: Hadoop Streaming

Moritz Krog Wed, 14 Jul 2010 02:20:46 -0700

Does that Perl script also work when I use multiple reducer tasks?

Anyway, this isn't really what I was looking for, because I intended to use
my own reducer. On top of that, I also need the intermediate data run more
than one time through the reducer. I was just hoping there is some way to
make streaming output the intermediate data as k -> list(v) somehow.
I could of course work in iterations, where I use the Perl reducer in the
first iteration and use the results from that in later iterations... but it
does sound like a lot of unnecessary work.


On Wed, Jul 14, 2010 at 10:51 AM, Alex Kozlov <[email protected]> wrote:

> You can use the following perl script as a reducer:
>
> ===
> #!/usr/bin/perl
>
> $,="\t";
>
> while (<>) {
>    my ($key, $value) = split($,, $_, 2);
>    if ($lastkey eq $key) {
>      push @values, $value;
>    } else {
>      print $lastkey, join(",", @values) if defined($lastkey);
>      $lastkey = $key;
>      @values = ($value);
>    }
> }
>
> print $lastkey, join(",", @values) if defined($lastkey) and @values > 0;
> ===
>
> Alex K
>
>
> On Wed, Jul 14, 2010 at 1:17 AM, Moritz Krog <[email protected]
> >wrote:
>
> > First of all thanks  for the quick answer :)
> >
> > is there any way to configure the job in such a way, that I get the key
> ->
> > value list? I specifically need exactly this behavior.. it's crucial to
> > what
> > I want to do with Hadoop..
> >
> >
> > On Wed, Jul 14, 2010 at 10:06 AM, Amareshwari Sri Ramadasu <
> > [email protected]> wrote:
> >
> > > In streaming, the combined values are given to reducer as <key, value>
> > > pairs again, so you don't see key and list of values.
> > > I think it is done in that way to be symmetrical with mapper, though I
> > > don't know exact reason.
> > >
> > > Thanks
> > > Amareshwari
> > >
> > > On 7/14/10 1:05 PM, "Moritz Krog" <[email protected]> wrote:
> > >
> > > Hi everyone,
> > >
> > > I'm pretty new to Hadoop and generally avoiding Java everywhere I can,
> so
> > > I'm getting started with Hadoop streaming and python mapper and
> reducer.
> > > From what I read in the mapreduce tutorial, mapper an reducer can be
> > > plugged
> > > into Hadoop via the "-mapper" and "-reducer" options on job start. I
> was
> > > wondering what the input for the reducer would look like, so I ran a
> > Hadoop
> > > job using my own mapper but /bin/cat as reducer. As you can see, the
> > output
> > > of the job is ordered, but the keys haven't been combined:
> > >
> > > {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
> > > 'person'}   107488
> > > {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
> > > 'person'}   95560
> > > {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
> > > 'person'}   95562
> > >
> > > I would have expected something like:
> > >
> > > {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
> > > 'person'}   95560, 95562, 107488
> > >
> > > my understanding from the tutorial was, that this reduction is a part
> of
> > > the
> > > shuffle and sort phase. Or do I need to use a combiner to get that
> done?
> > > Does Hadoop streaming even do this, or do I need to use a native java
> > > class?
> > >
> > > Best,
> > > Moritz
> > >
> > >
> >
>

Re: Hadoop Streaming

Reply via email to