A bit more progress. I asked about this problem on Amazon's EMR
forums. Here's the thread:
http://developer.amazonwebservices.com/connect/thread.jspa?threadID=30945
The answer from Amazon was:
This appears to be an issue with Mahout. This exception is fairly
common and matches the pattern of "Wrong FS: s3n://*/, expected:
hdfs://*:9000". This occurs when you try and use an S3N path with
HDFS. Typically this occurs because the code asks for the wrong
FileSystem.
This could happen because a developer used the wrong static method
on Hadoop's FileSystem class:
http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/fs/FileSystem.html
If you call FileSystem.get(Configuration conf) you'll get an
instance of the cluster's default file system, which in our case is
HDFS. Instead, if you have a URI and want a reference to the
FileSystem that URI points to, you should call the method
FileSystem.get(URI uri, Configuration conf).
He offered a solution that involved using DistCp to copy data from S3
to HDFS and then back again, but since I have the Mahout source, I
decided to pursue things a bit further. I went into the source and
modified the places where the filesystem is fetched to do the following:
FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
(there were 3 places that I changed it, but I expect there are more
lying around.) This is the idiom used by the CloudBurst example on EMR.
Making this change fixes the exception that I was getting, but I'm now
getting a different exception:
java.lang.NullPointerException
at
org
.apache
.hadoop.fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java:
310)
at
org
.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:
83)
at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:
45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
at
sun
.reflect
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
(The line numbers in kmeans.Job are weird because I added logging.)
If the Hadoop on EMR is really 0.18.3, then the null pointer here is
the store in the NativeS3FileSystem. But there's another problem: I
deleted the output path before I started the run, so the existence
check should have failed and dfs.delete never should have been
called. I added a bit of logging to the KMeans job and here's what it
says about the output path:
2009-04-16 14:04:35,757 INFO
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): dfs: c
lass org.apache.hadoop.fs.s3native.NativeS3FileSystem
So it got the right output file system type.
2009-04-16 14:04:35,758 INFO
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): s3n://
mahout-output/ exists: true
Shouldn't dfs.exists(outPath) fail for a non-existent path? And
didn't the store have to exist (i.e., be non-null) for it to figure
this out? I guess this really is starting to verge into base hadoop
territory.
I'm rapidly getting to the point where I need to solve this one just
to prove to myself that I can get it to run!
Steve
--
Stephen Green // [email protected]
Principal Investigator \\ http://blogs.sun.com/searchguy
Aura Project // Voice: +1 781-442-0926
Sun Microsystems Labs \\ Fax: +1 781-442-1692