[Hadoop Wiki] Update of "AmazonEC2" by SmithRogard

Apache Wiki Mon, 13 Jun 2011 04:47:40 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "AmazonEC2" page has been changed by SmithRogard:
http://wiki.apache.org/hadoop/AmazonEC2?action=diff&rev1=78&rev2=79

  <<TableOfContents>>
  
  = Running Hadoop on Amazon EC2 =
- 
  [[http://aws.amazon.com/ec2|Amazon EC2]] (Elastic Compute Cloud) is a 
computing service.  One allocates a set of hosts, and runs one's application on 
them, then, when done, de-allocates the hosts.  Billing is hourly per host.  
Thus EC2 permits one to deploy Hadoop on a cluster without having to own and 
operate that cluster, but rather renting it on an hourly basis.
  
  If you run Hadoop on EC2 you might consider using AmazonS3 for accessing job 
data (data transfer to and from S3 from EC2 instances is free). Initial input 
can be read from S3 when a cluster is launched, and the final output can be 
written back to S3 before the cluster is decomissioned. Intermediate, temporary 
data, only needed between MapReduce passes, is more efficiently stored in 
Hadoop's DFS. See AmazonS3 for more details.
@@ -17, +16 @@

  '''Note:''' Cloudera also provides their 
[[http://www.cloudera.com/hadoop-ec2|distribution for Hadoop]] as an EC2 AMI 
with single command deployment and support for Hive/Pig out of the box.
  
  == Preliminaries ==
- 
  === Concepts ===
- 
   * '''Amazon Machine Image (AMI)''', or ''image''.  A bootable Linux image, 
with software pre-installed. There are some public Hadoop AMIs that have 
everything you need to run Hadoop in a cluster.
   * '''instance'''.  A host running an AMI.
  
  === Conventions ===
- 
  In this document, commands lines that start with '#' are executed on an 
Amazon instance, while command lines starting with a '%' are executed on your 
workstation.
  
  === Security ===
- 
  Clusters of Hadoop instances are created in a security group. Instances 
within the group have unfettered access to one another. Machines outside the 
group (such as your workstation), can only access instance on port 22 (for 
SSH), port 50030 (for the JobTracker's web interface, permitting one to view 
job status), and port 50060 (for the TaskTracker's web interface, for more 
detailed debugging).
  
  ('''Pre Hadoop 0.17''') These EC2 scripts require slave nodes to be able to 
establish SSH connections to the master node (and vice versa). This is achieved 
after the cluster has launched by copying the EC2 private key to all machines 
in the cluster.
@@ -36, +31 @@

  == Setting up ==
   * Unpack [[http://www.apache.org/dyn/closer.cgi/hadoop/core/|the latest 
Hadoop distribution]] on your system (version 0.12.0 or later).
   * Edit all relevant variables in ''src/contrib/ec2/bin/hadoop-ec2-env.sh''.
-    * Amazon Web Services variables (`AWS_ACCOUNT_ID`, `AWS_ACCESS_KEY_ID`, 
`AWS_SECRET_ACCESS_KEY`)
+   * Amazon Web Services variables (`AWS_ACCOUNT_ID`, `AWS_ACCESS_KEY_ID`, 
`AWS_SECRET_ACCESS_KEY`)
-      * All need filling in - they can be found by logging in to 
http://aws.amazon.com/.
+    * All need filling in - they can be found by logging in to 
http://aws.amazon.com/.
-      * `AWS_ACCOUNT_ID` is your 12 digit account number.
+    * `AWS_ACCOUNT_ID` is your 12 digit account number.
-    * Security variables (`EC2_KEYDIR`, `KEY_NAME`, `PRIVATE_KEY_PATH`, 
`SSH_OPTS`)
+   * Security variables (`EC2_KEYDIR`, `KEY_NAME`, `PRIVATE_KEY_PATH`, 
`SSH_OPTS`)
-      * The defaults should be OK if you followed Amazon Getting Started 
guide, except `PRIVATE_KEY_PATH` which needs changing if you don't store this 
with your other EC2 keys.
+    * The defaults should be OK if you followed Amazon Getting Started guide, 
except `PRIVATE_KEY_PATH` which needs changing if you don't store this with 
your other EC2 keys.
-    * AMI selection (`HADOOP_VERSION`, `S3_BUCKET`)
+   * AMI selection (`HADOOP_VERSION`, `S3_BUCKET`)
-      * These two variables control which AMI is used.
+    * These two variables control which AMI is used.
-      * To see which versions are publicly available type: {{{
+    * To see which versions are publicly available type:
+    {{{
  % ec2-describe-images -x all | grep hadoop
  }}}
-      * The default value for `S3_BUCKET` (`hadoop-ec2-images`) is for public 
images. Images for Hadoop version 0.17.1 and later are in the `hadoop-images` 
bucket, so you should change this variable if you want to use one of these 
images. You also need to change this if you want to use a private image you 
have built yourself.      
+    * The default value for `S3_BUCKET` (`hadoop-ec2-images`) is for public 
images. Images for Hadoop version 0.17.1 and later are in the `hadoop-images` 
bucket, so you should change this variable if you want to use one of these 
images. You also need to change this if you want to use a private image you 
have built yourself.
-    * ('''Pre 0.17''') Hadoop cluster variables (`GROUP`, `MASTER_HOST`, 
`NO_INSTANCES`)
+   * ('''Pre 0.17''') Hadoop cluster variables (`GROUP`, `MASTER_HOST`, 
`NO_INSTANCES`)
-      * `GROUP` specifies the private group to run the cluster in. Typically 
the default value is fine.
+    * `GROUP` specifies the private group to run the cluster in. Typically the 
default value is fine.
-      * `MASTER_HOST` is the hostname of the master node in the cluster. You 
need to set this to be a hostname that you have DNS control over - it needs 
resetting every time a cluster is launched. Services such as 
[[http://www.dyndns.com/services/dns/dyndns/|DynDNS]] and 
[[http://developer.amazonwebservices.com/connect/thread.jspa?messageID=61609#61609|the
 like]] make this fairly easy.
+    * `MASTER_HOST` is the hostname of the master node in the cluster. You 
need to set this to be a hostname that you have DNS control over - it needs 
resetting every time a cluster is launched. Services such as 
[[http://www.dyndns.com/services/dns/dyndns/|DynDNS]] and 
[[http://developer.amazonwebservices.com/connect/thread.jspa?messageID=61609#61609|the
 like]] make this fairly easy.
-      * `NO_INSTANCES` sets the number of instances in your cluster. You need 
to set this. Currently Amazon limits the number of concurrent instances to 20.
+    * `NO_INSTANCES` sets the number of instances in your cluster. You need to 
set this. Currently Amazon limits the number of concurrent instances to 20.
  
  == Running a job on a cluster (Pre 0.17) ==
   * Open a command prompt in ''src/contrib/ec2''.
-  * Launch a EC2 cluster and start Hadoop with the following command. During 
execution of this script you will be prompted to set up DNS. {{{
+  * Launch a EC2 cluster and start Hadoop with the following command. During 
execution of this script you will be prompted to set up DNS.
+  {{{
  % bin/hadoop-ec2 run
  }}}
   * You will then be logged into the master node where you can start your job.
-    * For example, to test your cluster, try {{{
+   * For example, to test your cluster, try
+   {{{
  # cd /usr/local/hadoop-*
  # bin/hadoop jar hadoop-*-examples.jar pi 10 10000000
  }}}
   * You can check progress of your job at `http://<MASTER_HOST>:50030/`.
-  * You can login to the master node from your workstation by typing: {{{
+  * You can login to the master node from your workstation by typing:
+  {{{
  % bin/hadoop-ec2 login
  }}}
   * When you have finished, shutdown the cluster with the following:
-    * For Hadoop 0.14.0 and newer:{{{
+   * For Hadoop 0.14.0 and newer:
+   {{{
  % bin/hadoop-ec2 terminate-cluster
  }}}
-    * For Hadoop 0.13.1 and older: /!\ '''NB: this command will terminate 
''all'' your EC2 instances. See 
[[https://issues.apache.org/jira/browse/HADOOP-1504|HADOOP-1504]].'''{{{
+   * For Hadoop 0.13.1 and older: /!\ '''NB: this command will terminate 
''all'' your EC2 instances. See 
[[https://issues.apache.org/jira/browse/HADOOP-1504|HADOOP-1504]].'''
+   {{{
  % bin/hadoop-ec2 terminate
  }}}
  
  == Running a job on a cluster (0.17+) ==
   * Open a command prompt in ''src/contrib/ec2''.
-  * Launch a EC2 cluster and start Hadoop with the following command. You must 
supply a cluster name (test-cluster) and the number of slaves (2). After the 
cluster boots, the public DNS name will be printed to the console. {{{
+  * Launch a EC2 cluster and start Hadoop with the following command. You must 
supply a cluster name (test-cluster) and the number of slaves (2). After the 
cluster boots, the public DNS name will be printed to the console.
+  {{{
  % bin/hadoop-ec2 launch-cluster test-cluster 2
  }}}
-  * You can login to the master node from your workstation by typing: {{{
+  * You can login to the master node from your workstation by typing:
+  {{{
  % bin/hadoop-ec2 login test-cluster
  }}}
   * You will then be logged into the master node where you can start your job.
-    * For example, to test your cluster, try {{{
+   * For example, to test your cluster, try
+   {{{
  # cd /usr/local/hadoop-*
  # bin/hadoop jar hadoop-*-examples.jar pi 10 10000000
  }}}
   * You can check progress of your job at `http://<MASTER_HOST>:50030/`. Where 
MASTER_HOST is the host name returned after the cluster started, above.
-  * When you have finished, shutdown the cluster with the following:{{{
+  * When you have finished, shutdown the cluster with the following:
+  {{{
  % bin/hadoop-ec2 terminate-cluster test-cluster
  }}}
-  * Keep in mind that the master node is started first and configured, then 
all slaves nodes are booted simultaneously with boot parameters pointing to the 
master node. Even though the `lauch-cluster` command has returned, the whole 
cluster may not have yet 'booted'. You should monitor the cluster via port 
50030 to make sure all nodes are up. 
+  * Keep in mind that the master node is started first and configured, then 
all slaves nodes are booted simultaneously with boot parameters pointing to the 
master node. Even though the `lauch-cluster` command has returned, the whole 
cluster may not have yet 'booted'. You should monitor the cluster via port 
50030 to make sure all nodes are up.
  
  <<Anchor(FromRemoteMachine)>>
+ 
  === Running a job on a cluster from a remote machine (0.17+) ===
  In some cases it's desirable to be able to submit a job to a hadoop cluster 
running in EC2 from a machine that's outside EC2 (for example a personal 
workstation). Similarly - it's convenient to be able to browse/cat files in 
HDFS from a remote machine. One of the advantages of this setup is that it 
obviates the need to create custom AMIs that bundle stock Hadoop AMIs and user 
libraries/code. All the non-Hadoop code can be kept on the remote machine and 
can be made available to Hadoop during job submission time (in the form of jar 
files and other files that are copied into Hadoop's distributed cache). The 
only downside being the [[http://aws.amazon.com/ec2/#pricing|cost of copying 
these data sets]] into EC2 and the latency involved in doing so.
  
@@ -102, +108 @@

  == Troubleshooting (Pre 0.17) ==
  Running Hadoop on EC2 involves a high level of configuration, so it can take 
a few goes to get the system working for your particular set up.
  
- If you are having problems with the Hadoop EC2 `run` command then you can run 
the following in turn, which have the same effect but may help you to see where 
the problem is occurring: {{{
+ If you are having problems with the Hadoop EC2 `run` command then you can run 
the following in turn, which have the same effect but may help you to see where 
the problem is occurring:
+ 
+ {{{
  % bin/hadoop-ec2 launch-cluster
  % bin/hadoop-ec2 start-hadoop
  }}}
- 
  Currently, the scripts don't have much in the way of error detection or 
handling. If a script produces an error, then you may need to use the Amazon 
EC2 tools for interacting with instances directly - for example, to shutdown an 
instance that is mis-configured.
  
  Another technique for debugging is to manually run the scripts line-by-line 
until the error occurs. If you have feedback or suggestions, or need help then 
please use the Hadoop mailing lists.
@@ -114, +121 @@

  == Troubleshooting (0.17) ==
  Running Hadoop on EC2 involves a high level of configuration, so it can take 
a few goes to get the system working for your particular set up.
  
- If you are having problems with the Hadoop EC2 `launch-cluster` command then 
you can run the following in turn, which have the same effect but may help you 
to see where the problem is occurring: {{{
+ If you are having problems with the Hadoop EC2 `launch-cluster` command then 
you can run the following in turn, which have the same effect but may help you 
to see where the problem is occurring:
+ 
+ {{{
  % bin/hadoop-ec2 launch-master <cluster-name>
  % bin/hadoop-ec2 launch-slaves <cluster-name> <num slaves>
  }}}
- 
  Note you can call the `launch-slaves` command as many times as necessary to 
grow your cluster. Shrinking a cluster is more tricky and should be done by 
hand (after balancing file replications etc).
  
- To browse all your nodes via a web browser, starting at the 50030 status 
page, start the following command in a new shell window: {{{
+ To browse all your nodes via a web browser, starting at the 50030 status 
page, start the following command in a new shell window:
+ 
+ {{{
  % bin/hadoop-ec2 proxy <cluster-name>
  }}}
- 
  This command will start a SOCKS tunnel through your master node, and print 
out all the URLs you can reach from you web browser. For this to work, you must 
configure your browser to send requests over SOCKS to the local proxy on port 
6666. The FireFox plugin FoxyProxy is great for this.
  
  Currently, the scripts don't have much in the way of error detection or 
handling. If a script produces an error, then you may need to use the Amazon 
EC2 tools for interacting with instances directly - for example, to shutdown an 
instance that is mis-configured.
@@ -137, +146 @@

  The public images should be sufficient for most needs, however there are 
circumstances where you would like to build your own images, perhaps because an 
image with the version of Hadoop you want isn't available (an older version, 
the latest trunk version, or a patched version), or because you want to run 
extra software on your instances.
  
  === Design ===
- 
- Here is a high-level outline of how the scripts for creating a Hadoop AMI 
work. For details, please see the scripts' sources (linked to below). 
+ Here is a high-level outline of how the scripts for creating a Hadoop AMI 
work. For details, please see the scripts' sources (linked to below).
  
   1. The main script, 
[[http://svn.apache.org/viewvc/hadoop/core/trunk/src/contrib/ec2/bin/create-hadoop-image?view=co|create-hadoop-image]]
 starts a Fedora core Amazon AMI.
   1. Once the Fedora instance has launched ''create-hadoop-image'' copies the 
environment variables file 
([[http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/contrib/ec2/bin/hadoop-ec2-env.sh.template?view=co|hadoop-ec2-env.sh]])
 and scripts to run on the Fedora instance 
([[http://svn.apache.org/viewvc/hadoop/core/trunk/src/contrib/ec2/bin/image/create-hadoop-image-remote?view=co|create-hadoop-image-remote]]
 and 
[[http://svn.apache.org/viewvc/hadoop/core/trunk/src/contrib/ec2/bin/image/hadoop-init?view=co|hadoop-init]])
 then it logs into the Fedora instance and runs ''create-hadoop-image-remote''.
   1. The script ''create-hadoop-image-remote'' then installs Java, tools 
required to run Hadoop, and Hadoop itself. Then it configures Hadoop:
-    * In EC2, the local data volume is mounted as ''/mnt'', so logs are 
written under here.
+   * In EC2, the local data volume is mounted as ''/mnt'', so logs are written 
under here.
-    * ''hadoop-init'' is installed as an init script to be run on instance 
start up. This takes advantage of an EC2 feature called 
[[http://developer.amazonwebservices.com/connect/entry.jspa?externalID=531&categoryID=100|parameterized
 launches]]. For Hadoop, this allows the master hostname and the cluster size 
to be retrieved when the Hadoop instance starts - this information is used to 
finish the Hadoop configuration.
+   * ''hadoop-init'' is installed as an init script to be run on instance 
start up. This takes advantage of an EC2 feature called 
[[http://developer.amazonwebservices.com/connect/entry.jspa?externalID=531&categoryID=100|parameterized
 launches]]. For Hadoop, this allows the master hostname and the cluster size 
to be retrieved when the Hadoop instance starts - this information is used to 
finish the Hadoop configuration.
   1. Finally, ''create-hadoop-image-remote'' bundles the machine as an AMI, 
and uploads it to S3. (Particular care has to be taken to ensure that no 
secrets, such as private keys, are bundled in the AMI. See 
[[http://docs.amazonwebservices.com/AmazonEC2/dg/2006-10-01/public-ami-guidelines.html|here]]
 for more details.) The AMI is stored in a bucket named by the variable 
`$S3_BUCKET` and with the name `hadoop-$HADOOP_VERSION`.
   1. Control then returns to ''create-hadoop-image'' which registers the image 
with EC2.
  
  === Building a stock Hadoop image ===
- 
   * Edit all relevant variables in ''src/contrib/ec2/bin/hadoop-ec2-env.sh''.
-    * AMI selection (`HADOOP_VERSION`, `S3_BUCKET`)
+   * AMI selection (`HADOOP_VERSION`, `S3_BUCKET`)
-      * When creating an AMI, `HADOOP_VERSION` is used to select which version 
of Hadoop to download and install from 
http://www.apache.org/dist/lucene/hadoop/.
+    * When creating an AMI, `HADOOP_VERSION` is used to select which version 
of Hadoop to download and install from 
http://www.apache.org/dist/lucene/hadoop/.
-      * Change `S3_BUCKET` to be a bucket you own that you want to store the 
Hadoop AMI in.
+    * Change `S3_BUCKET` to be a bucket you own that you want to store the 
Hadoop AMI in.
-    * ('''0.17''') AMI size selection (`INSTANCE_TYPE`)
+   * ('''0.17''') AMI size selection (`INSTANCE_TYPE`)
-      * When creating an AMI, `INSTANCE_TYPE` denotes the instance size the 
image will be run on (small, large, or xlarge). Ultimately this decides if the 
image is `i386` or `x86_64`, so this value is also used on cluster startup.
+    * When creating an AMI, `INSTANCE_TYPE` denotes the instance size the 
image will be run on (small, large, or xlarge). Ultimately this decides if the 
image is `i386` or `x86_64`, so this value is also used on cluster startup.
-    * Java variables
+   * Java variables
-      * `JAVA_BINARY_URL` is the download URL for a Sun JDK. Visit the 
[[http://java.sun.com/javase/downloads/index.jsp|Sun Java downloads page]], 
select a recent stable JDK, and get the URL for the JDK (not JRE) labelled 
"Linux self-extracting file".
+    * `JAVA_BINARY_URL` is the download URL for a Sun JDK. Visit the 
[[http://java.sun.com/javase/downloads/index.jsp|Sun Java downloads page]], 
select a recent stable JDK, and get the URL for the JDK (not JRE) labelled 
"Linux self-extracting file".
-      * `JAVA_VERSION` is the version number of the JDK to be installed.
+    * `JAVA_VERSION` is the version number of the JDK to be installed.
-    * All other variables should be set as above.
+   * All other variables should be set as above.
-  * Type {{{
+  * Type
+  {{{
  % bin/hadoop-ec2 create-image
  }}}
   * Accept the Java license terms.
@@ -170, +178 @@

  If you need to repeat this procedure to re-create an AMI then you will need 
to run `ec2-deregister` to de-register the existing AMI. You might also want to 
use `ec2-delete-bundle` command to remove the AMI from S3 if you no longer need 
it.
  
  === Building a customized Hadoop image ===
- 
  If you want to build an image with a version of Hadoop that is not available 
from the Apache distribution site (e.g. trunk, or a patched version) then you 
will need to alter the ''create-hadoop-image-remote'' script to retrieve and 
install your required version of Hadoop. Similarly, if you wish to install 
other software on your image then the same script is the place to do it.
  
  === Making an image public ===
  Since there are already public Hadoop AMIs available you shouldn't need to do 
this. (At least consider discussing it on the developer mailing list first, 
please.) Furthermore, you should only do this if you are sure you have produced 
a secure AMI.
+ 
  {{{
  % ec2-modify-image-attribute AMI -l -a all
  }}}
  where `AMI` is the ID of the AMI you want to publish.
  
- See 
[[http://developer.amazonwebservices.com/connect/entry.jspa?entryID=530&ref=featured|Introduction
 to Sharing AMIs]] for more details. 
+ See 
[[http://developer.amazonwebservices.com/connect/entry.jspa?entryID=530&ref=featured|Introduction
 to Sharing AMIs]] for more details.
  
  == Resources ==
- 
   * Amazon EC2 [[http://aws.amazon.com/ec2/|Homepage]], 
[[http://docs.amazonwebservices.com/AmazonEC2/gsg/2007-01-03/|Getting Started 
Guide]], [[http://docs.amazonwebservices.com/AmazonEC2/dg/2007-01-03/|Developer 
Guide]], 
[[http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=100|Articles
 and Tutorials]].
-  * [[http://aws.typepad.com/aws/|AWS blog]]
+  * [[http://www.logodesignmaestro.com|Logo Design]]
   * 
[[http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873&categoryID=112|Running
 Hadoop MapReduce on Amazon EC2 and Amazon S3]] by Tom White, Amazon Web 
Services Developer Connection, July 2007
   * 
[[http://www.manamplified.org/archives/2008/03/notes-on-using-ec2-s3.html|Notes 
on Using EC2 and S3]] Details on FoxyProxy setup, and other things to watch out 
for.

[Hadoop Wiki] Update of "AmazonEC2" by SmithRogard

Reply via email to