Re: Mahout on Elastic MapReduce

Stephen Green Thu, 16 Apr 2009 07:28:21 -0700

A bit more progress. I asked about this problem on Amazon's EMRforums. Here's the thread:


http://developer.amazonwebservices.com/connect/thread.jspa?threadID=30945


The answer from Amazon was:

This appears to be an issue with Mahout. This exception is fairlycommon and matches the pattern of "Wrong FS: s3n://*/, expected:hdfs://*:9000". This occurs when you try and use an S3N path withHDFS. Typically this occurs because the code asks for the wrongFileSystem.
This could happen because a developer used the wrong static methodon Hadoop's FileSystem class:
http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/fs/FileSystem.html
If you call FileSystem.get(Configuration conf) you'll get aninstance of the cluster's default file system, which in our case isHDFS. Instead, if you have a URI and want a reference to theFileSystem that URI points to, you should call the methodFileSystem.get(URI uri, Configuration conf).

He offered a solution that involved using DistCp to copy data from S3to HDFS and then back again, but since I have the Mahout source, Idecided to pursue things a bit further. I went into the source andmodified the places where the filesystem is fetched to do the following:


    FileSystem dfs = FileSystem.get(outPath.toUri(), conf);

(there were 3 places that I changed it, but I expect there are morelying around.) This is the idiom used by the CloudBurst example on EMR.

Making this change fixes the exception that I was getting, but I'm nowgetting a different exception:


java.lang.NullPointerException

atorg.apache.hadoop.fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java:310)atorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)atorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:45)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

atsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)atsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

(The line numbers in kmeans.Job are weird because I added logging.)

If the Hadoop on EMR is really 0.18.3, then the null pointer here isthe store in the NativeS3FileSystem. But there's another problem: Ideleted the output path before I started the run, so the existencecheck should have failed and dfs.delete never should have beencalled. I added a bit of logging to the KMeans job and here's what itsays about the output path:

2009-04-16 14:04:35,757 INFOorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): dfs: c

lass org.apache.hadoop.fs.s3native.NativeS3FileSystem

So it got the right output file system type.

2009-04-16 14:04:35,758 INFOorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): s3n://

mahout-output/ exists: true

Shouldn't dfs.exists(outPath) fail for a non-existent path? Anddidn't the store have to exist (i.e., be non-null) for it to figurethis out? I guess this really is starting to verge into base hadoopterritory.

I'm rapidly getting to the point where I need to solve this one justto prove to myself that I can get it to run!


Steve
--
Stephen Green                      //   [email protected]
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692

Re: Mahout on Elastic MapReduce

Reply via email to