Mahout on Elastic MapReduce

Stephen Green Tue, 14 Apr 2009 07:16:34 -0700

I told some folks here at work that I would give a talk on Mahout forour reading group and decided that I would use it as an opportunity totry Amazon's Elastic MapReduce (EMR).

I downloaded and untarred Hadoop 0.18.3, which is the version thatAmazon claims they have running so that I could try things out here.I can start up Hadoop and sucessfully run a KMeans cluster on thesynthetic control data using the instructions on the wiki and thefollowing command line:

bin/hadoop jar ~/Projects/EC2/mahout-0.1/examples/target/mahout-examples-0.1.joborg.apache.mahout.clustering.syntheticcontrol.kmeans.Job input/testdata output org.apache.mahout.utils.EuclideanDistanceMeasure 80 550.5 10

I realize there's a shorter invocation, but I'm trying to figure outwhat Amazon needs to run this, so I'm pulled the default argumentsfrom the KMeans job.

Now, on Amazon, you can specify a jar file that gets run with "bin/hadoop jar" and you also specify the arguments that will be used withthat jar file.

The trick is that the input and output data need to be in S3 bucketsand you need to specify the locations with S3 native URIs. I used thecommand line interface to EMR to create a job like so:


elastic-mapreduce -v --create --name KMeans --num-instances 1 \
    --jar s3n://mahout-code/mahout-examples-0.1.job \

--main-classorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job \

    --arg s3n://mahout-input/testdata \
    --arg s3n://mahout-output \
    --arg org.apache.mahout.utils.EuclideanDistanceMeasure \
    --arg 80 --arg 55 --arg 0.5 --arg 10

But this fails with the message: Steps completed with errors. Turnsout you can have the EMR infrastructure dump the logs for the tasksand looking at the stderr for step 1 I see:

java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,expected: hdfs://domU-12-31-39-00-ED-51.compute-1

.internal:9000

at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)atorg.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)atorg.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)atorg.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)

        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)

atorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:77)atorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:44)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

atsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)atsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

If I replace the s3n URI for the output with just mahout-output thecode appears to run without incident (at least the log output lookslike the log output from my local run.) Unfortunately, the HDFSinstance into which it's put disappears in a puff of smoke when thejob finishes running.

Now, I am by no means a Hadoop expert, but it seems like if it canload the data from an s3n input URI, then it probably has the rightclasses in there to do that (in fact, it looks like the jets3t jar isin the .job file three times!), so it seems like the KMeans job frommahout should be happy to use an s3n output URI, but I'm clearlymisunderstanding something here.

One of the EMR samples is a Java DNA sequence matching thing(CloudBurst), which seems to work fine with an s3n URI for theoutput. The setup for it's output looks like the following:


                Path oPath = new Path(outpath);
                FileOutputFormat.setOutputPath(conf, oPath);
                System.err.println("  Removing old results");
                FileSystem.get(conf).delete(oPath);

where "conf" is of type org.apache.hadoop.mapred.JobConf. This is abit different than what happens in the KMeans job:


    Path outPath = new Path(output);
    client.setConf(conf);
    FileSystem dfs = FileSystem.get(conf);
    if (dfs.exists(outPath))
      dfs.delete(outPath, true);

Trying to use the CloudBurst idiom in the KMeans job produced no joy.Any help would be greatly appreciated.


Steve Green
--
Stephen Green                      //   [email protected]
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Mahout on Elastic MapReduce

Reply via email to