[CONF] Apache Mahout > Mahout on Amazon EC2

confluence Sun, 13 Mar 2011 11:33:23 -0700

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Mahout on Amazon EC2 
(https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Amazon+EC2)



Edited by Timothy Potter:
---------------------------------------------------------------------
Amazon EC2 is a compute-on-demand platform sold by Amazon.com that allows users 
to purchase one or more host machines on an hourly basis and execute 
applications.  Since Hadoop can run on EC2, it is also possible to run Mahout 
on EC2.  The following sections will detail how to create a Hadoop cluster from 
the ground up. Alternatively, you can use an existing Hadoop AMI, in which 
case, please see [Use an Existing Hadoop AMI].

  
h1. Prerequisites

To run Mahout on EC2 you need to start up a Hadoop cluster on one or more 
instances of a Hadoop-0.20.2 compatible Amazon Machine Instance (AMI). 
Unfortunately, there do not currently exist any public AMIs that support 
Hadoop-0.20.2; you will have to create one. The following steps begin with a 
public Cloudera Ubuntu AMI that comes with Java installed on it. You could use 
any other AMI with Java installed or you could use a clean AMI and install Java 
yourself. These instructions assume some familiarity with Amazon EC2 concepts 
and terminology. See the Amazon EC2 User Guide, in References below.

# From the [AWS Management 
Console|https://console.aws.amazon.com/ec2/home#c=EC2&s=Home]/AMIs, start the 
following AMI (_ami-8759bfee_)
{code}
cloudera-ec2-hadoop-images/cloudera-hadoop-ubuntu-20090623-x86_64.manifest.xml 
{code}
# From the AWS Console/Instances, select the instance and right-click 'Connect" 
to get the connect string which contains your <instance public DNS name>
{code}
> ssh -i <gsg-keypair.pem> root@<instance public DNS name>
{code}
# In the root home directory evaluate:
{code}
# apt-get update
# apt-get upgrade  // This is optional, but probably advisable since the AMI is 
over a year old.
# apt-get install python-setuptools
# easy_install "simplejson==2.0.9"
# easy_install "boto==1.8d"
# apt-get install ant
# apt-get install subversion
# apt-get install maven2
{code}
# Add the following to your .profile
{code}
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/usr/local/hadoop-0.20.2
export HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
export MAHOUT_HOME=/usr/local/mahout-0.4
export MAHOUT_VERSION=0.4-SNAPSHOT
export MAVEN_OPTS=-Xmx1024m
{code}
# Upload the Hadoop distribution and configure it. This distribution is not 
available on the Hadoop site. You can download a beta version from [Cloudera's 
CH3 distribution|http://archive.cloudera.com/cdh/3/]
{code}
> scp -i <gsg-keypair.pem>  <where>/hadoop-0.20.2.tar.gz root@<instance public 
> DNS name>:.

# tar -xzf hadoop-0.20.2.tar.gz
# mv hadoop-0.20.2 /usr/local/.
{code}
# Configure Hadoop for temporary single node operation
## add the following to $HADOOP_HOME/conf/hadoop-env.sh
{code}
# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-6-sun

# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=2000
{code}
## add the following to $HADOOP_HOME/conf/core-site.xml and also 
$HADOOP_HOME/conf/mapred-site.xml
{code}
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>

  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>

  <property>
    <name>dfs.replication</name>
    <value>1</value>
        <!-- set to 1 to reduce warnings when 
        running on a single node -->
  </property>
</configuration>
{code}
## set up authorized keys for localhost login w/o passwords and format your 
name node
{code}
# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
# $HADOOP_HOME/bin/hadoop namenode -format
{code}
# Checkout and build Mahout from trunk. Alternatively, you can upload a Mahout 
release tarball and install it as we did with the Hadoop tarball (Don't forget 
to update your .profile accordingly).
{code}
# svn co http://svn.apache.org/repos/asf/mahout/trunk mahout 
# cd mahout
# mvn clean install
# cd ..
# mv mahout /usr/local/mahout-0.4
{code}
# Run Hadoop, just to prove you can, and test Mahout by building the Reuters 
dataset on it. Finally, delete the files and shut it down.
{code}
# $HADOOP_HOME/bin/hadoop namenode -format
# $HADOOP_HOME/bin/start-all.sh
# jps     // you should see all 5 Hadoop processes (NameNode, 
SecondaryNameNode, DataNode, JobTracker, TaskTracker)
# cd $MAHOUT_HOME
# ./examples/bin/build-reuters.sh

# $HADOOP_HOME/bin/stop-all.sh
# rm -rf /tmp/*                   // delete the Hadoop files
{code}
# Remove the single-host stuff you added to $HADOOP_HOME/conf/core-site.xml and 
$HADOOP_HOME/conf/mapred-site.xml in step #6b and verify you are happy with the 
other conf file settings. The Hadoop startup scripts will not make any changes 
to them. In particular, upping the Java heap size is required for many of the 
Mahout jobs.
{code}
   // $HADOOP_HOME/conf/mapred-site.xml
   <property>
     <name>mapred.child.java.opts</name>
     <value>-Xmx2000m</value>
   </property>
{code}
# Bundle your image into a new AMI, upload it to S3 and register it so it can 
be launched multiple times to construct a Mahout-ready Hadoop cluster. (See 
Amazon's [Preparing And Creating 
AMIs|http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?PreparingAndCreatingAMIs.html]
 for details). 
{code}
// copy your AWS private key file and certificate file to /mnt on your instance 
(you don't want to leave these around in the AMI).
> scp -i <gsg-keypair.pem> <your AWS cert directory>/*.pem root@<instance 
> public DNS name>:/mnt/.

# Note that ec2-bundle-vol may fail if EC2_HOME is set.  So you may want to 
temporarily unset EC2_HOME before running the bundle command.  However the 
shell will need to have the correct value of EC2_HOME set before running the 
ec2-register step.

# ec2-bundle-vol -k /mnt/pk*.pem -c /mnt/cert*.pem -u <your-AWS-user_id> -d 
/mnt -p mahout
# ec2-upload-bundle -b <your-s3-bucket> -m /mnt/mahout.manifest.xml -a 
<your-AWS-access_key> -s <your-AWS-secret_key> 
# ec2-register -K /mnt/pk-*.pem -C /mnt/cert-*.pem 
<your-s3-bucket>/mahout.manifest.xml
{code}
h1. Getting Started

# Now you can go back to your AWS Management Console and try launching a single 
instance of your image. Once this launches, make sure you can connect to it and 
test it by re-running the test code.  If you removed the single host 
configuration added in step 6(b) above, you will need to re-add it before you 
can run this test.  To test run (again):
{code}
# $HADOOP_HOME/bin/hadoop namenode -format
# $HADOOP_HOME/bin/start-all.sh
# jps     // you should see all 5 Hadoop processes (NameNode, 
SecondaryNameNode, DataNode, JobTracker, TaskTracker)
# cd $MAHOUT_HOME
# ./examples/bin/build-reuters.sh

# $HADOOP_HOME/bin/stop-all.sh
# rm -rf /tmp/*                   // delete the Hadoop files
{code}

# Now that you have a working Mahout-ready AMI, follow [Hadoop's 
instructions|http://wiki.apache.org/hadoop/AmazonEC2] to configure their 
scripts for your environment.
## edit bin/hadoop-ec2-env.sh, setting the following environment variables:
{code}
AWS_ACCOUNT_ID
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
S3_BUCKET
(and perhaps others depending upon your environment)
{code}
## edit bin/launch-hadoop-master and bin/launch-hadoop-slaves, setting:
{code}
AMI_IMAGE
{code}
## finally, launch your cluster and log in
{code}
> bin/hadoop-ec2 launch-cluster test-cluster 2
> bin/hadoop-ec2 login test-cluster
# ...  
# exit
> bin/hadoop-ec2 terminate-cluster test-cluster     // when you are done with it
{code}

h1. Running the Examples
# Submit the Reuters test job
{code}
# cd $MAHOUT_HOME
# ./examples/bin/build-reuters.sh
// the warnings about configuration files do not seem to matter
{code}
# See the Mahout [Quickstart] page for more examples
h1. References

[Amazon EC2 User 
Guide|http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html]
[Hadoop's instructions|http://wiki.apache.org/hadoop/AmazonEC2]



h1. Recognition

Some of the information available here was possible through the "Amazon Web 
Services Apache Projects Testing Program".

Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Mahout > Mahout on Amazon EC2

Reply via email to