[nlpatumd] Adventures with Hadoop and Perl

ted_pedersen Sun, 02 May 2010 13:07:07 -0700

Greetings all,

For some time now I've been pondering how to conveniently take advantage of 
multi-core computers, especially for many of our Perl programs. For a while I 
thought about trying to do *something* with MPI, but the overhead of wrapping 
Perl up in MPI was just kind of scary so that never went anywhere.

More recently I started using Hadoop, which turns out to support Perl
surprisingly well. I say surprising since Hadoop is written in Java and most of
the examples you see for it are in Java, and I think until recently you could
only run Java programs using Hadoop.

For example, one of the standard examples you see with Hadoop is a Java program
that counts words in text. While the code works, it is (in my view) rather
ghastly stuff.

http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Example%3A+WordCount+v1.0

Thinking that I had to write code like WordCount.java in order to use Hadoop
really put me off the idea almost completely.

But, it turns out in more recent versions Hadoop has come to support streaming
: http://wiki.apache.org/hadoop/HadoopStreaming

That is any programming language that reads and writes to standard output can
be used with Hadoop. Of course that includes Perl, and quite a few other ones.

The trick with Hadoop appears to be figuring out what should be your "mapper"
and your "reducer". The mapper generates data that the reducer operates on (ie
performs a reduction operation upon). Hadoop is an open source implementation
of Google's MapReduce framework, so that's where this terminology comes from.
http://labs.google.com/papers/mapreduce.html

Now, it's fairly easy to write a simple wordcount program in Perl. Here's one
that is broken into two parts (which turn out to be a mapper and a reducer).

Here's my mapper (called wc.pl)

#!/usr/bin/perl -w

# read in a file of text
# print it out word by word

while (<>) {

@line = split;
foreach $word (@line) {
print "$word\n";
}
}

And here's my reducer (called c.pl)

#!/usr/bin/perl -w

# read in a stream of words, one one per line
# store them in a hash and accumulate counts

while (<>) {
chomp;
$seen{$_}++;
}

foreach $key (keys %seen) {
print "$seen{$key} $key\n";
}

Those of you familiar with Perl will see that wc.pl just takes an input stream
and outputs it one string per line. c.pl takes in a stream of input (one string
per line) and counts up the number of time each unique string occurs.

You can run this from the command line as ...

perl wc.pl input.txt | perl c.pl

It's this pipe construction that turns out to be the key to constructing Perl
programs that run in Hadoop. The program to the left of the pipe is the mapper,
and the program to the right is the reducer.

Then, once you've got that set up you can invoke Hadoop as follows...

hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/
hadoop-0.20.1+169.68-streaming.jar \
-input input -output output \
-mapper /home/cs/tpederse/wc.pl \
-reduce /home/cs/tpederse/c.pl

This will spawn multiple instances of wc.pl and c.pl, and also take care of
dividing your data up amongst them. It's all pretty transparent, and it really
does seem to run things in parallel.

Now, there is a little bit you need to do in order to get your data into the
Hadoop Distributed file system (HDFS), so I've put a very simple and probably
not well documented example in a tar file called WordCount-Stream.tar that is
available as a file in the Google Group. This is not meant to be considered any
kind of release, just (maybe) a helping hand if anyone is interested in trying
a few things with Hadoop.

Finally, I'm using the Cloudera distribution of Hadoop (CDH2), which was fairly
easy to install and configure (apt-get from their repository did most of the
work).

http://archive.cloudera.com/docs/cdh.html

In any case, this has been kind of fun, and it's kind of encouraging in that it
seems like a relatively easy way to break up a very large Perl job (like
counting words or ngrams) and run it on a multi-core machine without really
having to modify in order to have it run some portions in parallel.

If anyone is interested in trying to do some of these same things I'm happy to
try and answer questions, although most of what I know may already be revealed
above. And if you know more and have used Perl and Hadoop (or any streaming
language) I would be really curious to know more about what you've done.

Enjoy,
Ted

[nlpatumd] Adventures with Hadoop and Perl

Reply via email to