Greetings all, For some time now I've been pondering how to conveniently take advantage of multi-core computers, especially for many of our Perl programs. For a while I thought about trying to do *something* with MPI, but the overhead of wrapping Perl up in MPI was just kind of scary so that never went anywhere.
More recently I started using Hadoop, which turns out to support Perl surprisingly well. I say surprising since Hadoop is written in Java and most of the examples you see for it are in Java, and I think until recently you could only run Java programs using Hadoop. For example, one of the standard examples you see with Hadoop is a Java program that counts words in text. While the code works, it is (in my view) rather ghastly stuff. http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Example%3A+WordCount+v1.0 Thinking that I had to write code like WordCount.java in order to use Hadoop really put me off the idea almost completely. But, it turns out in more recent versions Hadoop has come to support streaming : http://wiki.apache.org/hadoop/HadoopStreaming That is any programming language that reads and writes to standard output can be used with Hadoop. Of course that includes Perl, and quite a few other ones. The trick with Hadoop appears to be figuring out what should be your "mapper" and your "reducer". The mapper generates data that the reducer operates on (ie performs a reduction operation upon). Hadoop is an open source implementation of Google's MapReduce framework, so that's where this terminology comes from. http://labs.google.com/papers/mapreduce.html Now, it's fairly easy to write a simple wordcount program in Perl. Here's one that is broken into two parts (which turn out to be a mapper and a reducer). Here's my mapper (called wc.pl) #!/usr/bin/perl -w # read in a file of text # print it out word by word while (<>) { @line = split; foreach $word (@line) { print "$word\n"; } } And here's my reducer (called c.pl) #!/usr/bin/perl -w # read in a stream of words, one one per line # store them in a hash and accumulate counts while (<>) { chomp; $seen{$_}++; } foreach $key (keys %seen) { print "$seen{$key} $key\n"; } Those of you familiar with Perl will see that wc.pl just takes an input stream and outputs it one string per line. c.pl takes in a stream of input (one string per line) and counts up the number of time each unique string occurs. You can run this from the command line as ... perl wc.pl input.txt | perl c.pl It's this pipe construction that turns out to be the key to constructing Perl programs that run in Hadoop. The program to the left of the pipe is the mapper, and the program to the right is the reducer. Then, once you've got that set up you can invoke Hadoop as follows... hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/ hadoop-0.20.1+169.68-streaming.jar \ -input input -output output \ -mapper /home/cs/tpederse/wc.pl \ -reduce /home/cs/tpederse/c.pl This will spawn multiple instances of wc.pl and c.pl, and also take care of dividing your data up amongst them. It's all pretty transparent, and it really does seem to run things in parallel. Now, there is a little bit you need to do in order to get your data into the Hadoop Distributed file system (HDFS), so I've put a very simple and probably not well documented example in a tar file called WordCount-Stream.tar that is available as a file in the Google Group. This is not meant to be considered any kind of release, just (maybe) a helping hand if anyone is interested in trying a few things with Hadoop. Finally, I'm using the Cloudera distribution of Hadoop (CDH2), which was fairly easy to install and configure (apt-get from their repository did most of the work). http://archive.cloudera.com/docs/cdh.html In any case, this has been kind of fun, and it's kind of encouraging in that it seems like a relatively easy way to break up a very large Perl job (like counting words or ngrams) and run it on a multi-core machine without really having to modify in order to have it run some portions in parallel. If anyone is interested in trying to do some of these same things I'm happy to try and answer questions, although most of what I know may already be revealed above. And if you know more and have used Perl and Hadoop (or any streaming language) I would be really curious to know more about what you've done. Enjoy, Ted