I told some folks here at work that I would give a talk on Mahout for
our reading group and decided that I would use it as an opportunity to
try Amazon's Elastic MapReduce (EMR).
I downloaded and untarred Hadoop 0.18.3, which is the version that
Amazon claims they have running so that I could try things out here.
I can start up Hadoop and sucessfully run a KMeans cluster on the
synthetic control data using the instructions on the wiki and the
following command line:
bin/hadoop jar ~/Projects/EC2/mahout-0.1/examples/target/mahout-
examples-0.1.job
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job input/
testdata output org.apache.mahout.utils.EuclideanDistanceMeasure 80 55
0.5 10
I realize there's a shorter invocation, but I'm trying to figure out
what Amazon needs to run this, so I'm pulled the default arguments
from the KMeans job.
Now, on Amazon, you can specify a jar file that gets run with "bin/
hadoop jar" and you also specify the arguments that will be used with
that jar file.
The trick is that the input and output data need to be in S3 buckets
and you need to specify the locations with S3 native URIs. I used the
command line interface to EMR to create a job like so:
elastic-mapreduce -v --create --name KMeans --num-instances 1 \
--jar s3n://mahout-code/mahout-examples-0.1.job \
--main-class
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job \
--arg s3n://mahout-input/testdata \
--arg s3n://mahout-output \
--arg org.apache.mahout.utils.EuclideanDistanceMeasure \
--arg 80 --arg 55 --arg 0.5 --arg 10
But this fails with the message: Steps completed with errors. Turns
out you can have the EMR infrastructure dump the logs for the tasks
and looking at the stderr for step 1 I see:
java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,
expected: hdfs://domU-12-31-39-00-ED-51.compute-1
.internal:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:
320)
at
org
.apache
.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:
84)
at
org
.apache
.hadoop
.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
at
org
.apache
.hadoop
.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
at
org
.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:
77)
at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:
44)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
at
sun
.reflect
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
If I replace the s3n URI for the output with just mahout-output the
code appears to run without incident (at least the log output looks
like the log output from my local run.) Unfortunately, the HDFS
instance into which it's put disappears in a puff of smoke when the
job finishes running.
Now, I am by no means a Hadoop expert, but it seems like if it can
load the data from an s3n input URI, then it probably has the right
classes in there to do that (in fact, it looks like the jets3t jar is
in the .job file three times!), so it seems like the KMeans job from
mahout should be happy to use an s3n output URI, but I'm clearly
misunderstanding something here.
One of the EMR samples is a Java DNA sequence matching thing
(CloudBurst), which seems to work fine with an s3n URI for the
output. The setup for it's output looks like the following:
Path oPath = new Path(outpath);
FileOutputFormat.setOutputPath(conf, oPath);
System.err.println(" Removing old results");
FileSystem.get(conf).delete(oPath);
where "conf" is of type org.apache.hadoop.mapred.JobConf. This is a
bit different than what happens in the KMeans job:
Path outPath = new Path(output);
client.setConf(conf);
FileSystem dfs = FileSystem.get(conf);
if (dfs.exists(outPath))
dfs.delete(outPath, true);
Trying to use the CloudBurst idiom in the KMeans job produced no joy.
Any help would be greatly appreciated.
Steve Green
--
Stephen Green // [email protected]
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692