Greetings all,

For some time now I've been pondering how to conveniently take advantage of 
multi-core computers, especially for many of our Perl programs. For a while I 
thought about trying to do *something* with MPI, but the overhead of wrapping 
Perl up in MPI was just kind of scary so that never went anywhere.

More recently I started using Hadoop, which turns out to support Perl 
surprisingly well. I say surprising since Hadoop is written in Java and most of 
the examples you see for it are in Java, and I think until recently you could 
only run Java programs using Hadoop. 

For example, one of the standard examples you see with Hadoop is a Java program 
that counts words in text. While the code works, it is (in my view) rather 
ghastly stuff. 

http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Example%3A+WordCount+v1.0

Thinking that I had to write code like WordCount.java  in order to use Hadoop 
really put me off the idea almost completely.

But, it turns out in more recent versions Hadoop has come to support streaming 
: http://wiki.apache.org/hadoop/HadoopStreaming

That is any programming language that reads and writes to standard output can 
be used with Hadoop. Of course that includes Perl, and quite a few other ones. 

The trick with Hadoop appears to be figuring out what should be your "mapper" 
and your "reducer". The mapper generates data that the reducer operates on (ie 
performs a reduction operation upon). Hadoop is an open source implementation 
of Google's MapReduce framework, so that's where this terminology comes from. 
http://labs.google.com/papers/mapreduce.html

Now, it's fairly easy to write a simple wordcount program in Perl. Here's one 
that is broken into two parts (which turn out to be a mapper and a reducer).

Here's my mapper (called wc.pl)

#!/usr/bin/perl -w

# read in a file of text
# print it out word by word

while (<>) {

       @line = split;
       foreach $word (@line) {
               print "$word\n";
       }
}


And here's my reducer (called c.pl)

#!/usr/bin/perl -w

# read in a stream of words, one one per line
# store them in a hash and accumulate counts

while (<>) {
       chomp;
       $seen{$_}++;
}

foreach $key (keys %seen) {
       print "$seen{$key} $key\n";
}

Those of you familiar with Perl will see that wc.pl just takes an input stream 
and outputs it one string per line. c.pl takes in a stream of input (one string 
per line) and counts up the number of time each unique string occurs. 

You can run this from the command line as ...

perl wc.pl input.txt | perl c.pl

It's this pipe construction that turns out to be the key to constructing Perl 
programs that run in Hadoop. The program to the left of the pipe is the mapper, 
and the program to the right is the reducer. 

Then, once you've got that set up you can invoke Hadoop as follows...

hadoop  jar /usr/lib/hadoop-0.20/contrib/streaming/
hadoop-0.20.1+169.68-streaming.jar \
-input input -output output  \
-mapper /home/cs/tpederse/wc.pl \
-reduce /home/cs/tpederse/c.pl

This will spawn multiple instances of wc.pl and c.pl, and also take care of 
dividing your data up amongst them. It's all pretty transparent, and it really 
does seem to run things in parallel. 

Now, there is a little bit you need to do in order to get your data into the 
Hadoop Distributed file system (HDFS), so I've put a very simple and probably 
not well documented example in a tar file called WordCount-Stream.tar that is 
available as a file in the Google Group. This is not meant to be considered any 
kind of release, just (maybe) a helping hand if anyone is interested in trying 
a few things with Hadoop. 

Finally, I'm using the Cloudera distribution of Hadoop (CDH2), which was fairly 
easy to install and configure (apt-get from their repository did most of the 
work). 

http://archive.cloudera.com/docs/cdh.html

In any case, this has been kind of fun, and it's kind of encouraging in that it 
seems like a relatively easy way to break up a very large Perl job (like 
counting words or ngrams) and run it on a multi-core machine without really 
having to modify in order to have it run some portions in parallel.

If anyone is interested in trying to do some of these same things I'm happy to 
try and answer questions, although most of what I know may already be revealed 
above. And if you know more and have used Perl and Hadoop (or any streaming 
language) I would be really curious to know more about what you've done.

Enjoy,
Ted

Reply via email to