Re: Mahout on Elastic MapReduce

Stephen Green Tue, 14 Apr 2009 13:02:48 -0700


On Apr 14, 2009, at 2:41 PM, Stephen Green wrote:

On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote:
Hi Stephen,

You are out on the bleeding edge with EMR.
Yeah, but the view is lovely from here!
I've been able to run the kmeans example directly on a small EC2cluster that I started up myself (using the Hadoop src/contrib/ec2scripts). I have not yet tried EMR (just got an account yesterday),but I see that it requires you to have your data in S3 as opposedto HDFS.
The job first runs the InputDriver to copy the raw test data intoMahout Vector external representation after deleting any pre-existing output files. It looks to me like the two delete()snippets you show are pretty equivalent. If you have no pre-existing output directory, the Mahout snippet won't attempt todelete it.
I managed to figure that out :-) I'm pretty comfortable with theideas behind MapReduce, but being confronted with my first Job is abit more daunting than I expected.
I too am at a loss to explain what you are seeing. If you can postmore results I can try to help you read the tea leaves...
I noticed that the CloudBurst job just deleted the directory withoutchecking for existence and so I tried the same thing with Mahout:
java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,expected: hdfs://domU-12-31-38-00-6C-86.compute-1
.internal:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)atorg.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)atorg.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)atorg.apache.hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:210)atorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)atorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:46)
So no joy there.
Should I see if I can isolate this as an s3n problem? I suppose Icould try running the Hadoop job locally with it reading and writingthe data from S3 and see if it suffers from the same problem. Atleast then I could debug inside Hadoop.
Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3nproblem it might have been fixed already. That doesn't help muchrunning on EMR, I guess.
I'm also going to start a run on EMR that does away with the wholeexists/delete check and see if that works.

Following up to myself (my wife will tell you that I talk to myself!)I removed a number of the exists/delete checks: inCanopyClusteringJob, CanopyDriver, KMeansDriver, and ClusterDriver.This allowed the jobs to progress, but they died the death a littlelater with the following exception (and a few more, I can send thewhole log if you like):

java.lang.IllegalArgumentException: Wrong FS: s3n://mahoutput/canopies/part-00000, expected: hdfs://domU-12-31-39-00-A5-44.compute-1.internal:9000at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)atorg.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)atorg.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)atorg.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:695)at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1420)at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1415)atorg.apache.mahout.clustering.canopy.ClusterMapper.configure(ClusterMapper.java:69)atorg.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)atorg.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)atorg.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)atorg.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)atorg.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)

        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)

at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)

Looking at the exception message there, I would almost swear that itthings the whole s3n path is the name of a FS that it doesn't knowabout, but that might just be a bad message. This message repeats afew times (retrying failed mappers, I guess?) and then the job fails.

One thing that occurred to me: the mahout examples job has the hadoop0.19.1 core jar in it. Could I be seeing some kind of version skewbetween the hadoop in the job file and the one on EMR? Although itworked fine with a local 0.18.3, so maybe not.

I'm going to see if I can get the stock Mahout to run with s3n inputsand outputs tomorrow and I'll let you all know how that goes.


Steve
--
Stephen Green                      //   [email protected]
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Reply via email to