Mahout on Elastic MapReduce (MAHOUT) edited by Stephen Green
      Page: 
http://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce
   Changes: 
http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=116559&originalVersion=5&revisedVersion=6






Content:
---------------------------------------------------------------------

This page details the set of steps that was necessary to get an example of 
k-Means clustering running on Amazon's Elastic MapReduce (EMR).  The aim here 
was simply to get something running, but it should provide a good head start if 
you want to run something else.  I started out on the [QuickStart] page and 
went from there.  Along the way, I encountered some problems and posted to the 
Amazon EMR forums to get some help.  The [resulting 
thread|http://developer.amazonwebservices.com/connect/thread.jspa?threadID=30945&tstart=15]
 might have some useful information if you're having trouble.

h1. Getting Started

   * Get yourself an EMR account.  If you're already using EC2, then you can do 
this from [Amazon's AWS Managment Console|https://console.aws.amazon.com/], 
which has a tab for running EMR.
   * Get the [ElasticFox|https://addons.mozilla.org/en-US/firefox/addon/11626] 
and [S3Fox|https://addons.mozilla.org/en-US/firefox/search?q=s3fox&cat=all] 
Firefox extensions.  These will make it easy to monitor running EMR instances, 
upload code and data, and download results.
   * Download the Ruby command line client for EMR.  You can do things from the 
GUI, but when you're in the midst of trying to get something running, the CLI 
client will make life a lot easier.
   * Have a look at [Common Problems Running Job 
Flows|http://developer.amazonwebservices.com/connect/thread.jspa?messageID=124694&#124694]
 and [Developing and Debugging Job 
Flows|http://developer.amazonwebservices.com/connect/message.jspa?messageID=124695#124695]
 in the EMR forum at Amazon.  They were tremendously useful.
   * Make sure that you're up to date with the Mahout source.  The fix for 
[Issue 118|http://issues.apache.org/jira/browse/MAHOUT-118] is required to get 
things running when you're sending output to an S3 bucket.
   * Build the Mahout core and examples.

Note that the Hadoop that's running on EMR is a modified version of Hadoop 
0.18.3 that includes some back-ported stuff from later versions of Hadoop.  The 
EMR GUI in the AWS Management Console provides a number of examples of using 
EMR, and you might want to try running one of these to get started.

One big gotcha that I discovered is that the S3N file system for Hadoop has a 
couple of weird cases that boil down to the following advice:  if you're naming 
a directory in an s3n URI, make sure that it ends in a slash and you should not 
try to use a top-level S3 bucket name as the place where your Mahout output 
will be going, you should always include a subdirectory.

h1. Uploading Code and Data

I decided that I would use separate S3 buckets for the Mahout code, the input 
for the clustering (I used the synthetic control data, you can find it easily 
from the [QuickStart] page), and the output of the clustering.  

I used S3Fox to make two buckets: {{mahout-code}} and {{mahout-input}} and then 
uploaded {{mahout-examples-0.1.job}} to the {{mahout-code}} bucket.  I copied 
the {{synthetic_control.data}} file to the {{mahout-input}} bucket.  You don't 
actually need to make an output bucket, as the Mahout examples will create one 
if the one you specify doesn't exist.

h1. Running k-means Clustering

EMR offers two modes for running MapReduce jobs.  The first is a "streaming" 
mode where you provide the source for single-step mapper and reducer functions 
(you can use languages other than Java for this).  The second mode is called 
"Custom Jar" and it gives you full control over the job steps that will run.  
This is the mode that we need to use to run Mahout.  

In order to run in Custom Jar mode, you need to look at the example that you 
want to run (in my case, 
{{org.apache.mahout.clustering.syntheticcontrol.kmeans.Job}}) and figure out 
the arguments that you need to provide to the job.  Essentially, you need to 
know the command line that you would give to bin/hadoop in order to run the 
job, including whatever parameters the job needs to run.  Certainly, you could 
provide a jar for EMR that has a {{Main-Class}} attribute and make sure that 
the default {{main}} for that class will do the right thing, but that won't 
work with the default Mahout examples, because they expect the input and output 
to be in HDFS and not S3.

h2. Using the GUI

The EMR GUI is an easy way to start up a Custom Jar run, but it doesn't have 
the full functionality of the CLI.  Basically, you tell the GUI where in S3 the 
jar file is using a Hadoop s3n URI like 
{{s3n://mahout-code/mahout-examples-0.1.job}}.  The GUI will check and make 
sure that the given file exists, which is a nice sanity check.  You can then 
provide the arguments for the job just as you would on the command line.  The 
arguments for the k-means job that I wanted to run were as follows:

{noformat}
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job 
s3n://mahout-input/synthetic_control.data 
s3n://mahout-output/kmeans/ 
org.apache.mahout.utils.EuclideanDistanceMeasure 80 55 0.5 10
{noformat}

You can see what the configuration screen looks like here: 
!mahout.png|thumbnail!

The main failing with the GUI mode is that you can only specify a single job to 
run, and you can't run another job in the same set of instances.  Recall that 
on AWS you pay for partial hours at the hourly rate, so if your job fails in 
the first 10 seconds, you pay for the full hour and if you try again, you're 
going to paying for another hour.

Because of this, I strongly suggest using the CLI once you're minmally familiar 
with EMR.

h2. Using the CLI

If you're in development mode, and trying things out, EMR allows you to set up 
a set of instances and leave them running.  Once you've done this, you can add 
job steps to the set of instances as you like.  This solves the "10 second 
failure" problem that I described above and lets you get full value for your 
EMR dollar.  Amazon has pretty good [documentation for the 
CLI|http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?CHAP_RunningaJob.html],
 which you'll need to read to figure out how to do things like set up your AWS 
credentials for the EMR CLI.

You can start up a job flow that will keep running using an invocation like the 
following:

{noformat}
./elastic-mapreduce --create --alive \
   --log-uri s3n://mahout-logs/ --key_pair aura \
   --num-instances 4 --name kmeans
{noformat}

Note that I've named the job flow {{kmeans}}, I've told it to use 4 small 
instances, and I'll have the logs stored in an S3 bucket at the end of the run. 
 This call returns the name of the job flow, and you'll need that for 
subsequent calls to add steps to the job flow. 

Let's list our job flows:

{noformat}
[stgr...@dhcp-ubur02-74-153 14:16:15 emr]$ ./elastic-mapreduce --list
j-3JB4UF7CQQ025     WAITING        ec2-174-129-90-97.compute-1.amazonaws.com    
kmeans
{noformat}

Everything's started up, and it's waiting for us to add a step to the job.  
When I started the job flow, I specified a key pair that I created earlier so 
that I can log into the master while the job flow is running:

{noformat}
[stgr...@dhcp-ubur02-74-153 14:14:01 emr]$ ssh 
[email protected]
Linux domU-12-31-39-00-84-13 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 
2008 i686
--------------------------------------------------------------------------------

Welcome to Amazon Elastic MapReduce running Hadoop 0.18.3 and Debian/Lenny.
 
Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check
/mnt/var/log/hadoop/steps for diagnosing step failures.

The Hadoop UI can be accessed via the command: lynx http://localhost:9100/

--------------------------------------------------------------------------------
Last login: Fri May  1 18:12:13 2009 from 192.18.128.5
had...@domu-12-31-39-00-84-13:~$ 
{noformat}

Let's add a step to run the same job that I ran via the GUI:

{noformat}
./elastic-mapreduce -j j-3JB4UF7CQQ025 \
   --jar s3n://mahout-code/mahout-examples-0.1.job \
   --main-class org.apache.mahout.clustering.syntheticcontrol.kmeans.Job \
   --arg s3n://mahout-input/synthetic_control.data \
   --arg s3n://mahout-output/kmeans/ \
   --arg org.apache.mahout.utils.EuclideanDistanceMeasure \
   --arg 80 --arg 55 --arg 0.5 --arg 10
{noformat}

When you do this, the job flow goes into the {{RUNNING}} state for a while and 
then returns to {{WAITING}} once the step has finished.  You can use the CLI or 
the GUI to monitor the step while it runs.  Once you've finished with your job 
flow, you can shut it down the following way:

{noformat}
./elastic-mapreduce -j j-3JB4UF7CQQ025 --terminate
{noformat}

and go look in your S3 buckets to find your output and logs (if things didn't 
go very well!)










---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences
   http://cwiki.apache.org/confluence/users/viewnotifications.action

If you think it was sent incorrectly contact one of the administrators
   http://cwiki.apache.org/confluence/administrators.action

If you want more information on Confluence, or have a bug to report see
   http://www.atlassian.com/software/confluence


Reply via email to