A bit more progress. I asked about this problem on Amazon's EMR forums. Here's the thread:

http://developer.amazonwebservices.com/connect/thread.jspa?threadID=30945

The answer from Amazon was:

This appears to be an issue with Mahout. This exception is fairly common and matches the pattern of "Wrong FS: s3n://*/, expected: hdfs://*:9000". This occurs when you try and use an S3N path with HDFS. Typically this occurs because the code asks for the wrong FileSystem.

This could happen because a developer used the wrong static method on Hadoop's FileSystem class:

http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/fs/FileSystem.html

If you call FileSystem.get(Configuration conf) you'll get an instance of the cluster's default file system, which in our case is HDFS. Instead, if you have a URI and want a reference to the FileSystem that URI points to, you should call the method FileSystem.get(URI uri, Configuration conf).


He offered a solution that involved using DistCp to copy data from S3 to HDFS and then back again, but since I have the Mahout source, I decided to pursue things a bit further. I went into the source and modified the places where the filesystem is fetched to do the following:

    FileSystem dfs = FileSystem.get(outPath.toUri(), conf);

(there were 3 places that I changed it, but I expect there are more lying around.) This is the idiom used by the CloudBurst example on EMR.

Making this change fixes the exception that I was getting, but I'm now getting a different exception:

java.lang.NullPointerException
at org .apache .hadoop.fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java: 310) at org .apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java: 83) at org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java: 45)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun .reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 39) at sun .reflect .DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

(The line numbers in kmeans.Job are weird because I added logging.)

If the Hadoop on EMR is really 0.18.3, then the null pointer here is the store in the NativeS3FileSystem. But there's another problem: I deleted the output path before I started the run, so the existence check should have failed and dfs.delete never should have been called. I added a bit of logging to the KMeans job and here's what it says about the output path:

2009-04-16 14:04:35,757 INFO org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): dfs: c
lass org.apache.hadoop.fs.s3native.NativeS3FileSystem

So it got the right output file system type.

2009-04-16 14:04:35,758 INFO org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): s3n://
mahout-output/ exists: true

Shouldn't dfs.exists(outPath) fail for a non-existent path? And didn't the store have to exist (i.e., be non-null) for it to figure this out? I guess this really is starting to verge into base hadoop territory.

I'm rapidly getting to the point where I need to solve this one just to prove to myself that I can get it to run!

Steve
--
Stephen Green                      //   [email protected]
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692



Reply via email to