Does that Perl script also work when I use multiple reducer tasks? Anyway, this isn't really what I was looking for, because I intended to use my own reducer. On top of that, I also need the intermediate data run more than one time through the reducer. I was just hoping there is some way to make streaming output the intermediate data as k -> list(v) somehow. I could of course work in iterations, where I use the Perl reducer in the first iteration and use the results from that in later iterations... but it does sound like a lot of unnecessary work.
On Wed, Jul 14, 2010 at 10:51 AM, Alex Kozlov <[email protected]> wrote: > You can use the following perl script as a reducer: > > === > #!/usr/bin/perl > > $,="\t"; > > while (<>) { > my ($key, $value) = split($,, $_, 2); > if ($lastkey eq $key) { > push @values, $value; > } else { > print $lastkey, join(",", @values) if defined($lastkey); > $lastkey = $key; > @values = ($value); > } > } > > print $lastkey, join(",", @values) if defined($lastkey) and @values > 0; > === > > Alex K > > > On Wed, Jul 14, 2010 at 1:17 AM, Moritz Krog <[email protected] > >wrote: > > > First of all thanks for the quick answer :) > > > > is there any way to configure the job in such a way, that I get the key > -> > > value list? I specifically need exactly this behavior.. it's crucial to > > what > > I want to do with Hadoop.. > > > > > > On Wed, Jul 14, 2010 at 10:06 AM, Amareshwari Sri Ramadasu < > > [email protected]> wrote: > > > > > In streaming, the combined values are given to reducer as <key, value> > > > pairs again, so you don't see key and list of values. > > > I think it is done in that way to be symmetrical with mapper, though I > > > don't know exact reason. > > > > > > Thanks > > > Amareshwari > > > > > > On 7/14/10 1:05 PM, "Moritz Krog" <[email protected]> wrote: > > > > > > Hi everyone, > > > > > > I'm pretty new to Hadoop and generally avoiding Java everywhere I can, > so > > > I'm getting started with Hadoop streaming and python mapper and > reducer. > > > From what I read in the mapreduce tutorial, mapper an reducer can be > > > plugged > > > into Hadoop via the "-mapper" and "-reducer" options on job start. I > was > > > wondering what the input for the reducer would look like, so I ran a > > Hadoop > > > job using my own mapper but /bin/cat as reducer. As you can see, the > > output > > > of the job is ordered, but the keys haven't been combined: > > > > > > {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type': > > > 'person'} 107488 > > > {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type': > > > 'person'} 95560 > > > {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type': > > > 'person'} 95562 > > > > > > I would have expected something like: > > > > > > {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type': > > > 'person'} 95560, 95562, 107488 > > > > > > my understanding from the tutorial was, that this reduction is a part > of > > > the > > > shuffle and sort phase. Or do I need to use a combiner to get that > done? > > > Does Hadoop streaming even do this, or do I need to use a native java > > > class? > > > > > > Best, > > > Moritz > > > > > > > > >
