This is a fairly uninformed observation, but: the error seems to be from Hadoop. It seems to say that it understands hdfs:, but not s3n:, and that makes sense to me. Do we expect Hadoop understands how to read from S3? I would expect not. (Though, you point to examples that seem to overcome this just fine?)
When I have integrated code with stuff stored on S3, I have always had to write extra glue code to copy from S3 to a local file system, do work, then copy back. On Tue, Apr 14, 2009 at 9:01 PM, Stephen Green <[email protected]> wrote: > > On Apr 14, 2009, at 2:41 PM, Stephen Green wrote: > >> >> On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote: >> >>> Hi Stephen, >>> >>> You are out on the bleeding edge with EMR. >> >> Yeah, but the view is lovely from here! >> >>> I've been able to run the kmeans example directly on a small EC2 cluster >>> that I started up myself (using the Hadoop src/contrib/ec2 scripts). I have >>> not yet tried EMR (just got an account yesterday), but I see that it >>> requires you to have your data in S3 as opposed to HDFS. >>> >>> The job first runs the InputDriver to copy the raw test data into Mahout >>> Vector external representation after deleting any pre-existing output files. >>> It looks to me like the two delete() snippets you show are pretty >>> equivalent. If you have no pre-existing output directory, the Mahout snippet >>> won't attempt to delete it. >> >> I managed to figure that out :-) I'm pretty comfortable with the ideas >> behind MapReduce, but being confronted with my first Job is a bit more >> daunting than I expected. >> >>> I too am at a loss to explain what you are seeing. If you can post more >>> results I can try to help you read the tea leaves... >> >> I noticed that the CloudBurst job just deleted the directory without >> checking for existence and so I tried the same thing with Mahout: >> >> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output, >> expected: hdfs://domU-12-31-38-00-6C-86.compute-1 >> .internal:9000 >> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320) >> at >> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84) >> at >> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140) >> at >> org.apache.hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:210) >> at >> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83) >> at >> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:46) >> >> So no joy there. >> >> Should I see if I can isolate this as an s3n problem? I suppose I could >> try running the Hadoop job locally with it reading and writing the data from >> S3 and see if it suffers from the same problem. At least then I could debug >> inside Hadoop. >> >> Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n >> problem it might have been fixed already. That doesn't help much running on >> EMR, I guess. >> >> I'm also going to start a run on EMR that does away with the whole >> exists/delete check and see if that works. > > Following up to myself (my wife will tell you that I talk to myself!) I > removed a number of the exists/delete checks: in CanopyClusteringJob, > CanopyDriver, KMeansDriver, and ClusterDriver. This allowed the jobs to > progress, but they died the death a little later with the following > exception (and a few more, I can send the whole log if you like): > > java.lang.IllegalArgumentException: Wrong FS: > s3n://mahoutput/canopies/part-00000, expected: > hdfs://domU-12-31-39-00-A5-44.compute-1.internal:9000 > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320) > at > org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84) > at > org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140) > at > org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408) > at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:695) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1420) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1415) > at > org.apache.mahout.clustering.canopy.ClusterMapper.configure(ClusterMapper.java:69) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) > at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223) > at > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) > > Looking at the exception message there, I would almost swear that it things > the whole s3n path is the name of a FS that it doesn't know about, but that > might just be a bad message. This message repeats a few times (retrying > failed mappers, I guess?) and then the job fails. > > One thing that occurred to me: the mahout examples job has the hadoop > 0.19.1 core jar in it. Could I be seeing some kind of version skew between > the hadoop in the job file and the one on EMR? Although it worked fine with > a local 0.18.3, so maybe not. > > I'm going to see if I can get the stock Mahout to run with s3n inputs and > outputs tomorrow and I'll let you all know how that goes. > > Steve > -- > Stephen Green // [email protected] > Principal Investigator \\ http://blogs.sun.com/searchguy > Aura Project // Voice: +1 781-442-0926 > Sun Microsystems Labs \\ Fax: +1 781-442-1692 > > > >
