[Lucene-hadoop Wiki] Update of "AmazonEC2" by TomWhite

Apache Wiki Sun, 25 Feb 2007 13:50:46 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by TomWhite:
http://wiki.apache.org/lucene-hadoop/AmazonEC2

The comment on the change is:
Changes as per HADOOP-1033

------------------------------------------------------------------------------
+ [[TableOfContents]]
+ 
  = Running Hadoop on Amazon EC2 =
  
- [http://www.amazon.com/gp/browse.html?node=201590011 Amazon EC2] (Elastic 
Compute Cloud) is a computing service.  One allocates a set of hosts, and runs 
ones's application on them, then, when done, de-allocates the hosts.  Billing 
is hourly per host.  Thus EC2 permits one to deploy Hadoop on a cluster without 
having to own and operate that cluster, but rather renting it on an hourly 
basis.
+ [http://aws.amazon.com/ec2 Amazon EC2] (Elastic Compute Cloud) is a computing 
service.  One allocates a set of hosts, and runs one's application on them, 
then, when done, de-allocates the hosts.  Billing is hourly per host.  Thus EC2 
permits one to deploy Hadoop on a cluster without having to own and operate 
that cluster, but rather renting it on an hourly basis.
  
- This document assumes that you have already followed the steps in 
[http://docs.amazonwebservices.com/AmazonEC2/gsg/2006-06-26/ Amazon's Getting 
Started Guide].
+ If you run Hadoop on EC2 you might consider using AmazonS3 for accessing job 
data (data transfer to and from S3 from EC2 instances is free). Initial input 
can be read from S3 when a cluster is launched, and the final output can be 
written back to S3 before the cluster is decomissioned. Intermediate, temporary 
data, only needed between MapReduce passes, is more efficiently stored in 
Hadoop's DFS. See AmazonS3 for more details.
  
- There are now some [#AutomatedScripts scripts] available for running Hadoop 
on EC2.
+ This document assumes that you have already followed the steps in 
[http://docs.amazonwebservices.com/AmazonEC2/gsg/2007-01-03/ Amazon's Getting 
Started Guide]. In particular, you should have run through the sections 
"Setting up an Account", "Setting up the Tools" and the "Generating a Keypair" 
section of "Running an Instance".
  
- == Concepts ==
+ Note that the older, manual step-by-step guide to getting Hadoop running on 
EC2 can be found 
[http://wiki.apache.org/lucene-hadoop/AmazonEC2?action=recall&rev=10 here].
  
-  * '''Abstract Machine Image (AMI)''', or ''image''.  A bootable Linux image, 
with software pre-installed.
+ == Preliminaries ==
+ 
+ === Concepts ===
+ 
+  * '''Amazon Machine Image (AMI)''', or ''image''.  A bootable Linux image, 
with software pre-installed. There are some public Hadoop AMIs that have 
everything you need to run Hadoop in a cluster.
   * '''instance'''.  A host running an AMI.
  
- == Conventions ==
+ === Conventions ===
  
  In this document, commands lines that start with '#' are executed on an 
Amazon instance, while command lines starting with a '%' are executed on your 
workstation.
-  
- == Building an Image ==
  
- To use Hadoop it is easiest to build an image with most of the software you 
require already installed.  Amazon provides 
[http://developer.amazonwebservices.com/connect/entry.jspa?externalID=354&categoryID=87
 good documentation] for building images.  Follow the "Getting Started" guide 
there to learn how to install the EC2 command-line tools, etc.
+ === Security ===
  
- To build an image for Hadoop:
+ Clusters of Hadoop instances are created in a security group. Instances 
within the group have unfettered access to one another. Machines outside the 
group (such as your workstation), can only access instance on port 22 (for 
SSH), port 50030 (for the JobTracker's web interface, permitting one to view 
job status), and port 50060 (for the TaskTracker's web interface, for more 
detailed debugging).
  
-  1. Run an instance of the fedora base image.
+ Hadoop requires slave nodes to be able to establish SSH connections to the 
master node (and vice versa). This is achieved after the cluster has launched 
by copying the EC2 private key to all machines in the cluster.
  
-  1. Login to this instance (using ssh).
+ == Setting up ==
+  * Unpack the latest Hadoop distribution on your system (version 0.12.0 or 
later: until this is available use the latest 
[http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/lastSuccessfulBuild/artifact/trunk/build/hadoop-0.11.3-dev.tar.gz
 nightly build]).
+  * Edit all relevant variables in ''src/contrib/ec2/bin/hadoop-ec2-env.sh''.
+    * Amazon Web Services variables (`AWS_ACCOUNT_ID`, `AWS_ACCESS_KEY_ID`, 
`AWS_SECRET_ACCESS_KEY`)
+      * All need filling in - they can be found by logging in to 
http://aws.amazon.com/.
+      * `AWS_ACCOUNT_ID` is your 12 digit account number.
+    * Security variables (`EC2_KEYDIR`, `KEY_NAME`, `PRIVATE_KEY_PATH`, 
`SSH_OPTS`)
+      * The defaults should be OK if you followed Amazon Getting Started 
guide, except `PRIVATE_KEY_PATH` which needs changing if you don't store this 
with your other EC2 keys.
+    * AMI selection (`HADOOP_VERSION`, `S3_BUCKET`)
+      * These two variables control which AMI is used.
+      * To see which versions are publicly available type: {{{
+ % ec2-describe-images -x all | grep hadoop
+ }}}
+      * The default value for `S3_BUCKET` (`hadoop-ec2-images`) is for public 
images. You normally only need to change this if you want to use a private 
image you have built yourself.      
+    * Hadoop cluster variables (`GROUP`, `MASTER_HOST`, `NO_INSTANCES`)
+      * `GROUP` specifies the private group to run the cluster in. Typically 
the default value is fine.
+      * `MASTER_HOST` is the hostname of the master node in the cluster. You 
need to set this to be a hostname that you have DNS control over - it needs 
resetting every time a cluster is launched. Services such as 
[http://www.dyndns.com/services/dns/dyndns/ DynDNS] make this fairly easy.
+      * `NO_INSTANCES` sets the number of instances in your cluster. You need 
to set this. Currently Amazon limits the number of concurrent instances to 20.
  
-  1. Install [http://java.sun.com/javase/downloads/index.jsp Java].  Copy the 
link of the "Linux self-extracting file" from Sun's download page, then use 
'''wget''' to retrieve the JVM.  Unpack this in a well-known location, like 
/usr/local. {{{
+ == Running a job on a cluster ==
+  * Open a command prompt in ''src/contrib/ec2''.
+  * Launch a EC2 cluster and start Hadoop with the following command. During 
execution of this script you will be prompted to set up DNS. {{{
+ % bin/hadoop-ec2 run
+ }}}
+  * You will then be logged into the master node where you can start your job.
+    * For example, to test your cluster, try {{{
- # cd /usr/local
+ # cd /usr/local/hadoop-*
- # wget -O java.bin http://.../jdk-1_5_0_09-linux-i586.bin
- # sh java.bin
- # rm java.bin
+ # bin/hadoop jar hadoop-*-examples.jar pi 10 10000000
+ }}}
+  * You can check progress of your job at `http://<MASTER_HOST>:50030/`.
+  * You can login to the master node from your workstation by typing: {{{
+ % bin/hadoop-ec2 login
+ }}}
+  * When you have finished, shutdown the cluster with {{{
+ % bin/hadoop-ec2 terminate
  }}}
  
-  1. Install rsync. {{{
- # yum install rsync
+ == Troubleshooting ==
+ Running Hadoop on EC2 involves a high level of configuration, so it can take 
a few goes to get the system working for your particular set up.
+ 
+ If you are having problems with the Hadoop EC2 `run` command then you can run 
the following in turn, which have the same effect but may help you to see where 
the problem is occurring: {{{
+ % bin/hadoop-ec2 launch-cluster
+ % bin/hadoop-ec2 start-hadoop
  }}}
  
+ Currently, the scripts don't have much in the way of error detection or 
handling. If a script produces an error, then you may need to use the Amazon 
EC2 tools for interacting with instances directly - for example, to shutdown an 
instance that is mis-configured.
-  1. (Optional) install other tools you might need. {{{
- # yum install emacs
- # yum install subversion
- }}}  To install [http://ant.apache.org/ Ant], copy the download URL from the 
website, and then: {{{
- # cd /usr/local
- # wget http://.../apache-ant-1.6.5-bin.tar.gz
- # tar xzf apache-ant-1.6.5-bin.tar.gz
- # rm apache-ant-1.6.5-bin.tar.gz
- }}}
  
+ Another technique for debugging is to manually run the scripts line-by-line 
until the error occurs. If you have feedback or suggestions, or need help then 
please use the Hadoop mailing lists.
-  1. Install Hadoop.{{{
- # cd /usr/local
- # wget http://.../hadoop-X.X.X.tar.gz
- # tar xzf hadoop-X.X.X.tar.gz
- # rm hadoop-X.X.X.tar.gz
- }}}
  
-  1. Configure Hadoop (described below).
+ == Building your own Hadoop image ==
+ The public images should be sufficient for most needs, however there are 
circumstances where you would like to build your own images, perhaps because an 
image with the version of Hadoop you want isn't available (an older version, 
the latest trunk version, or a patched version), or because you want to run 
extra software on your instances.
  
-  1. Edit shell config files to, e.g., add executables to your PATH, and 
perform other configurations that will make this system easy for you to use.  
For example, you may wish to add a non-root user account to run Hadoop's 
daemons.
+ === Design ===
  
+ Here is a high-level outline of how the scripts for creating a Hadoop AMI 
work. For details, please see the scripts' sources (linked to below). 
-  1. Start Hadoop, and test your configuration minimally. {{{
- # cd /usr/local/hadoop-X.X.X
- # bin/start-all.sh
- # bin/hadoop jar hadoop-0.7.2-examples.jar pi 10 10000000
- # bin/hadoop dfs -ls
- # bin/stop-all.sh
- }}}
  
-  1. Create a new image, using Amazon's instructions (bundle, upload & 
register).
+  1. The main script, 
[http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/contrib/ec2/bin/create-hadoop-image?view=co
 create-hadoop-image] starts a Fedora core Amazon AMI.
+  1. Once the Fedora instance has launched ''create-hadoop-image'' copies the 
environment variables file 
([http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/contrib/ec2/bin/hadoop-ec2-env.sh.template?view=co
 hadoop-ec2-env.sh]) and scripts to run on the Fedora instance 
([http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/contrib/ec2/bin/image/create-hadoop-image-remote?view=co
 create-hadoop-image-remote] and 
[http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/contrib/ec2/bin/image/hadoop-init?view=co
 hadoop-init]) then it logs into the Fedora instance and runs 
''create-hadoop-image-remote''.
+  1. The script ''create-hadoop-image-remote'' then installs Java, tools 
required to run Hadoop, and Hadoop itself. Then it configures Hadoop:
+    * In EC2, the local data volume is mounted as ''/mnt'', so logs are 
written under here.
+    * ''hadoop-init'' is installed as an init script to be run on instance 
start up. This takes advantage of an EC2 feature called 
[http://developer.amazonwebservices.com/connect/entry.jspa?externalID=531&categoryID=100
 parameterized launches]. For Hadoop, this allows the master hostname and the 
cluster size to be retrieved when the Hadoop instance starts - this information 
is used to finish the Hadoop configuration.
+  1. Finally, ''create-hadoop-image-remote'' bundles the machine as an AMI, 
and uploads it to S3. (Particular care has to be taken to ensure that no 
secrets, such as private keys, are bundled in the AMI. See 
[http://docs.amazonwebservices.com/AmazonEC2/dg/2006-10-01/public-ami-guidelines.html
 here] for more details.) The AMI is stored in a bucket named by the variable 
`$S3_BUCKET` and with the name `hadoop-$HADOOP_VERSION`.
+  1. Control then returns to ''create-hadoop-image'' which registers the image 
with EC2.
  
- == Configuring Hadoop ==
+ === Building a stock Hadoop image ===
  
- Hadoop is configured with a single master node and many slave nodes.  To 
facilliate re-deployment without re-configuration, one may register a name in 
DNS for the master host, then reset the address for this name each time the 
cluster is re-deployed.  Services such as 
[http://www.dyndns.com/services/dns/dyndns/ DynDNS] make this fairly easy.  In 
the following, we refer to the master as '''master.mydomain.com'''.  Please 
replace this with your actual master node's name.
- 
- In EC2, the local data volume is mounted as '''/mnt'''.
- 
- 
- === hadoop-env.sh ===
- 
- Specify the JVM location, the log directory and that rsync should be used to 
update slaves from the master.
- 
- {{{
- # Set Hadoop-specific environment variables here.
- 
- # The java implementation to use.  Required.
- export JAVA_HOME=/usr/local/jdk1.5.0_09
- 
- # Where log files are stored.  $HADOOP_HOME/logs by default.
- export HADOOP_LOG_DIR=/mnt/hadoop/logs
- 
- # host:path where hadoop code should be rsync'd from.  Unset by default.
- export HADOOP_MASTER=master.mydomain.com:/usr/local/hadoop-X.X.X
- 
- # Seconds to sleep between slave commands.  Unset by default.  This
- # can be useful in large clusters, where, e.g., slave rsyncs can
- # otherwise arrive faster than the master can service them.
- export HADOOP_SLAVE_SLEEP=1
- }}}
- 
- You must also create the log directory.
- 
- {{{
- % mkdir -p /mnt/hadoop/logs
- }}}
- 
- === hadoop-site.xml ===
- 
- All of Hadoop's local data is stored relative to '''hadoop.tmp.dir''', so we 
only need specify this, plus the name of the master node for DFS (the NameNode) 
and MapReduce (the JobTracker).
- 
- {{{
- <configuration>
- 
- <property>
-   <name>hadoop.tmp.dir</name>
-   <value>/mnt/hadoop</value>
- </property>
- 
- <property>
-   <name>fs.default.name</name>
-   <value>master.mydomain.com:50001</value>
- </property>
- 
- <property>
-   <name>mapred.job.tracker</name>
-   <value>master.mydomain.com:50002</value>
- </property>
- 
- </configuration>
- }}}
- 
- 
- === mapred-default.xml ===
- 
- This should vary with the size of your cluster.  Typically 
'''mapred.map.tasks''' should be 10x the number of instances, and 
'''mapred.reduce.tasks''' should be 3x the number of instances.  The following 
is thus configured for a 19-instance cluster.
- 
- {{{
- <configuration>
- 
- <property>
-   <name>mapred.map.tasks</name>
-   <value>190</value>
- </property>
- 
- <property>
-   <name>mapred.reduce.tasks</name>
-   <value>57</value>
- </property>
- 
- </configuration>
- }}}
- 
- == Security ==
- 
- To access your cluster, you must enable access from at least port 22, for 
ssh.  Generally it is also useful to open a few other ports, to view job 
progress.  Port 50030 is used for the JobTracker's web interface, permitting 
one to view job status, and port 50060 is used by the TaskTracker's web 
interface, for more detailed debugging.
- 
- {{{
- % ec2-add-group my-group
- % ec2-authorize my-group -p 22
- % ec2-authorize my-group -p 50030
- % ec2-authorize my-group -p 50060
- }}}
- 
- Instances within the cluster should have unfettered access to one another.  
This is enabled by the following, replacing the XXXXXXXXXXX with your Amazon 
web services user id.
- 
- {{{
- % ec2-authorize my-group -o my-group -u XXXXXXXXXXX
- }}}
- 
- == Launching your cluster ==
- 
- Start by allocating instances of your image.  Use '''ec2-describe-images''' 
to find the your image id, notated as ami-XXXXXXXX below. 
- 
- To run a 20-node cluster:
- 
- {{{
- % ec2-describe-images
- % ec2-run-instances ami-XXXXXXX -k gsg-keypair -g my-group -n 20
- }}}
- 
- Wait a few minutes for the instances to launch.
- 
- {{{
- % ec2-describe-instances
- }}}
- 
- Once instances are launched, register the first host listed in DNS with your 
master host name.
- 
- Create a slaves file containing the rest of the instances and copy it to the 
master.
- 
- {{{
- % ec2-describe-instances | grep INSTANCE | cut -f 4 | tail +2 > slaves
- % scp slaves master.mydomain.com:/usr/local/hadoop-X.X.X/conf/slaves
- }}}
- 
- Format the new cluster's filesystem.
- 
- {{{
- % ssh master.mydomain.com
- # /usr/local/hadoop-X.X.X/bin/hadoop namenode -format
- }}}
- 
- Start the cluster.
- 
- {{{
- # /usr/local/hadoop-X.X.X/bin/start-all.sh
- }}}
- 
- Visit your cluster's web ui at http://master.mydomain.com:50030/.
- 
- == Shutting down your cluster ==
- 
- {{{
- % ec2-terminate-instances `ec2-describe-instances | grep INSTANCE | cut -f 2`
- }}}
- 
- 
- == Future Work ==
- 
- Ideally Hadoop could directly access job data from 
[http://www.amazon.com/gp/browse.html?node=16427261 Amazon S3] (Simple Storage 
Service).  Initial input could be read from S3 when a cluster is launched, and 
the final output could be written back to S3 before the cluster is 
decomissioned.  Intermediate, temporary data, only needed between MapReduce 
passes, would be more efficiently stored in Hadoop's DFS. From Hadoop 0.10.1 
onwards there is an implementation of a Hadoop 
[http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/fs/FileSystem.html 
FileSystem] for S3. See ["AmazonS3"].
- 
- [[Anchor(AutomatedScripts)]]
- = Automated Scripts =
- 
- == Setting up ==
-  * Make sure you've followed the Amazon EC2 
[http://docs.amazonwebservices.com/AmazonEC2/gsg/2006-10-01/ Getting Started 
Guide], sections "Setting up an Account", "Setting up the Tools" and the 
"Generating a Keypair" section of "Running an Instance".
-  * Unpack the Hadoop distribution on your system (version 0.11.0 or later).
-  * Open a command prompt in ''src/contrib/ec2''.
-  * Edit all relevant variables in ''bin/hadoop-ec2-env.sh''.
+  * Edit all relevant variables in ''src/contrib/ec2/bin/hadoop-ec2-env.sh''.
-    * You need to get a Java download URL by visiting 
[http://java.sun.com/javase/downloads/index_jdk5.jsp here]. Make sure you get 
the JDK (not JRE) labelled "Linux self-extracting file". (The scripts have not 
been tested with Java 6 yet.)
- 
- == Creating an image ==
- You should only need to do this once for each version of Hadoop you want to 
create an image for. You can use the same image to run different job. (If you 
need to repeat this for the same Hadoop version number then you will need to 
run `ec2-deregister` to de-register the existing AMI.) 
+    * AMI selection (`HADOOP_VERSION`, `S3_BUCKET`)
+      * When creating an AMI, `HADOOP_VERSION` is used to select which version 
of Hadoop to download and install from 
http://www.apache.org/dist/lucene/hadoop/.
+      * Change `S3_BUCKET` to be a bucket you own that you want to store the 
Hadoop AMI in.
+    * Java variables
+      * `JAVA_BINARY_URL` is the download URL for a Sun JDK. Visit the 
[http://java.sun.com/javase/downloads/index_jdk5.jsp Sun Java 5 downloads page] 
and get the URL for the JDK (not JRE) labelled "Linux self-extracting file".
+      * `JAVA_VERSION` is the version number of the JDK to be installed.
+    * All other variables should be set as above.
   * Type {{{
- bin/create-hadoop-image
+ % bin/hadoop-ec2 create-image
  }}}
   * Accept the Java license terms.
   * The script will create a new image, then bundle, upload and register it. 
This may take some time (20 minutes or more). Be patient - don't assume it's 
crashed!
   * Terminate your instance using the command given by the script.
  
- == Running a job on a cluster ==
-  * During execution of this script you will be prompted to set up DNS (e.g. 
[http://www.dyndns.com/services/dns/dyndns/ DynDNS]). {{{
- bin/run-hadoop-cluster 
+ If you need to repeat this procedure to re-create an AMI then you will need 
to run `ec2-deregister` to de-register the existing AMI. You might also want to 
use `ec2-delete-bundle` command to remove the AMI from S3 if you no longer need 
it.
+ 
+ === Building a customized Hadoop image ===
+ 
+ If you want to build an image with a version of Hadoop that is not available 
from the Apache distribution site (e.g. trunk, or a patched version) then you 
will need to alter the ''create-hadoop-image-remote'' script to retrieve and 
install your required version of Hadoop. Similarly, if you wish to install 
other software on your image then the same script is the place to do it.
+ 
+ === Making an image public ===
+ Since there are already public Hadoop AMIs available you shouldn't need to do 
this. (At least consider discussing it on the developer mailing list first, 
please.) Furthermore, you should only do this if you are sure you have produced 
a secure AMI.
+ {{{
+ % ec2-modify-image-attribute AMI -l -a all
  }}}
+ where `AMI` is the ID of the AMI you want to publish.
-  * You will then be logged into the master node where you can start your job.
-    * For example, to test your cluster, try {{{
- cd /usr/local/hadoop-*
- bin/hadoop jar hadoop-*-examples.jar pi 10 10000000
- }}}
-  * You can check progress of your job at `http://<MASTER_HOST>:50030/`.
-  * When you have finished logout of the master node by typing `exit`, then 
shutdown the cluster with {{{
- bin/terminate-hadoop-cluster
- }}}
  
+ See 
[http://developer.amazonwebservices.com/connect/entry.jspa?entryID=530&ref=featured
 Introduction to Sharing AMIs] for more details. 
+ 
+ == Resources ==
+ 
+  * Amazon EC2 [http://aws.amazon.com/ec2 Homepage], 
[http://docs.amazonwebservices.com/AmazonEC2/gsg/2007-01-03/ Getting Started 
Guide], [http://docs.amazonwebservices.com/AmazonEC2/dg/2007-01-03/ Developer 
Guide], 
[http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=100  
Articles and Tutorials].
+  * [http://aws.typepad.com/aws/ AWS blog]
+

[Lucene-hadoop Wiki] Update of "AmazonEC2" by TomWhite

Reply via email to