On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote:

Hi Stephen,

You are out on the bleeding edge with EMR.

Yeah, but the view is lovely from here!

I've been able to run the kmeans example directly on a small EC2 cluster that I started up myself (using the Hadoop src/contrib/ec2 scripts). I have not yet tried EMR (just got an account yesterday), but I see that it requires you to have your data in S3 as opposed to HDFS.

The job first runs the InputDriver to copy the raw test data into Mahout Vector external representation after deleting any pre- existing output files. It looks to me like the two delete() snippets you show are pretty equivalent. If you have no pre-existing output directory, the Mahout snippet won't attempt to delete it.

I managed to figure that out :-) I'm pretty comfortable with the ideas behind MapReduce, but being confronted with my first Job is a bit more daunting than I expected.

I too am at a loss to explain what you are seeing. If you can post more results I can try to help you read the tea leaves...

I noticed that the CloudBurst job just deleted the directory without checking for existence and so I tried the same thing with Mahout:

java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output, expected: hdfs://domU-12-31-38-00-6C-86.compute-1
.internal:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java: 320) at org .apache .hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java: 84) at org .apache .hadoop .dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140) at org .apache .hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:210) at org .apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java: 83) at org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java: 46)

So no joy there.

Should I see if I can isolate this as an s3n problem? I suppose I could try running the Hadoop job locally with it reading and writing the data from S3 and see if it suffers from the same problem. At least then I could debug inside Hadoop.

Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n problem it might have been fixed already. That doesn't help much running on EMR, I guess.

I'm also going to start a run on EMR that does away with the whole exists/delete check and see if that works.

Thanks for the help, and I'll let you know how I get on.

Steve
--
Stephen Green                      //   [email protected]
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692



Reply via email to