Re: Mahout on Elastic MapReduce

Sean Owen Tue, 14 Apr 2009 13:20:21 -0700

This is a fairly uninformed observation, but: the error seems to be
from Hadoop. It seems to say that it understands hdfs:, but not s3n:,
and that makes sense to me. Do we expect Hadoop understands how to
read from S3? I would expect not. (Though, you point to examples that
seem to overcome this just fine?)


When I have integrated code with stuff stored on S3, I have always had
to write extra glue code to copy from S3 to a local file system, do
work, then copy back.

On Tue, Apr 14, 2009 at 9:01 PM, Stephen Green <[email protected]> wrote:
>
> On Apr 14, 2009, at 2:41 PM, Stephen Green wrote:
>
>>
>> On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote:
>>
>>> Hi Stephen,
>>>
>>> You are out on the bleeding edge with EMR.
>>
>> Yeah, but the view is lovely from here!
>>
>>> I've been able to run the kmeans example directly on a small EC2 cluster
>>> that I started up myself (using the Hadoop src/contrib/ec2 scripts). I have
>>> not yet tried EMR (just got an account yesterday), but I see that it
>>> requires you to have your data in S3 as opposed to HDFS.
>>>
>>> The job first runs the InputDriver to copy the raw test data into Mahout
>>> Vector external representation after deleting any pre-existing output files.
>>> It looks to me like the two delete() snippets you show are pretty
>>> equivalent. If you have no pre-existing output directory, the Mahout snippet
>>> won't attempt to delete it.
>>
>> I managed to figure that out :-)  I'm pretty comfortable with the ideas
>> behind MapReduce, but being confronted with my first Job is a bit more
>> daunting than I expected.
>>
>>> I too am at a loss to explain what you are seeing. If you can post more
>>> results I can try to help you read the tea leaves...
>>
>> I noticed that the CloudBurst job just deleted the directory without
>> checking for existence and so I tried the same thing with Mahout:
>>
>> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,
>> expected: hdfs://domU-12-31-38-00-6C-86.compute-1
>> .internal:9000
>>       at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
>>       at
>> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
>>       at
>> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
>>       at
>> org.apache.hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:210)
>>       at
>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)
>>       at
>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:46)
>>
>> So no joy there.
>>
>> Should I see if I can isolate this as an s3n problem?  I suppose I could
>> try running the Hadoop job locally with it reading and writing the data from
>> S3 and see if it suffers from the same problem.  At least then I could debug
>> inside Hadoop.
>>
>> Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n
>> problem it might have been fixed already.  That doesn't help much running on
>> EMR, I guess.
>>
>> I'm also going to start a run on EMR that does away with the whole
>> exists/delete check and see if that works.
>
> Following up to myself (my wife will tell you that I talk to myself!)  I
> removed a number of the exists/delete checks:  in CanopyClusteringJob,
> CanopyDriver, KMeansDriver, and ClusterDriver.  This allowed the jobs to
> progress, but they died the death a little later with the following
> exception (and a few more, I can send the whole log if you like):
>
> java.lang.IllegalArgumentException: Wrong FS:
> s3n://mahoutput/canopies/part-00000, expected:
> hdfs://domU-12-31-39-00-A5-44.compute-1.internal:9000
>        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
>        at
> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
>        at
> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
>        at
> org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
>        at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:695)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1420)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1415)
>        at
> org.apache.mahout.clustering.canopy.ClusterMapper.configure(ClusterMapper.java:69)
>        at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>        at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
>        at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
>        at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>        at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
>        at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
>
> Looking at the exception message there, I would almost swear that it things
> the whole s3n path is the name of a FS that it doesn't know about, but that
> might just be a bad message.  This message repeats a few times (retrying
> failed mappers, I guess?) and then the job fails.
>
> One thing that occurred to me:  the mahout examples job has the hadoop
> 0.19.1 core jar in it.  Could I be seeing some kind of version skew between
> the hadoop in the job file and the one on EMR?  Although it worked fine with
> a local 0.18.3, so maybe not.
>
> I'm going to see if I can get the stock Mahout to run with s3n inputs and
> outputs tomorrow and I'll let you all know how that goes.
>
> Steve
> --
> Stephen Green                      //   [email protected]
> Principal Investigator             \\   http://blogs.sun.com/searchguy
> Aura Project                       //   Voice: +1 781-442-0926
> Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
>
>
>
>

Re: Mahout on Elastic MapReduce

Reply via email to