[Hadoop Wiki] Trivial Update of "AmazonEC2" by JoydeepSensarma

Apache Wiki Fri, 15 May 2009 02:37:14 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by JoydeepSensarma:
http://wiki.apache.org/hadoop/AmazonEC2

------------------------------------------------------------------------------
   * Keep in mind that the master node is started first and configured, then 
all slaves nodes are booted simultaneously with boot parameters pointing to the 
master node. Even though the `lauch-cluster` command has returned, the whole 
cluster may not have yet 'booted'. You should monitor the cluster via port 
50030 to make sure all nodes are up. 
  
  === Running a job on a cluster from a remote machine (0.17+) ===
- In some cases it's desirable to be able to submit a job to a hadoop cluster 
running in EC2 from a machine that's outside EC2 (for example a personal 
workstation). Similarly - it's convenient to be able to browse/cat files in 
HDFS from a remote machine. One of the advantages of this is technique is that 
it obviates the need to create custom AMIs that bundle stock Hadoop AMIs and 
user libraries/code. All the non-Hadoop code can be kept on the remote machine 
and can be made available to Hadoop during job submission time (in the form of 
jar files and other files that are copied into Hadoop's distributed cache). The 
only downside being the [http://aws.amazon.com/ec2/#pricing cost of copying 
these data sets] into EC2 and the latency involved in doing so.
+ In some cases it's desirable to be able to submit a job to a hadoop cluster 
running in EC2 from a machine that's outside EC2 (for example a personal 
workstation). Similarly - it's convenient to be able to browse/cat files in 
HDFS from a remote machine. One of the advantages of this setup is that it 
obviates the need to create custom AMIs that bundle stock Hadoop AMIs and user 
libraries/code. All the non-Hadoop code can be kept on the remote machine and 
can be made available to Hadoop during job submission time (in the form of jar 
files and other files that are copied into Hadoop's distributed cache). The 
only downside being the [http://aws.amazon.com/ec2/#pricing cost of copying 
these data sets] into EC2 and the latency involved in doing so.
  
- The recipe for doing this is well described in 
[http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
 this Cloudera blog post] and involves configuring hadoop to use a ssh tunnel 
through the master hadoop node. In addition - this recipe only works when using 
EC2 scripts from versions of Hadoop that have the fix for 
[https://issues.apache.org/jira/browse/HADOOP-5839 HADOOP-5839] incorporated. 
(Alternatively, users can apply patches from this JIRA to older versions of 
Hadoop that do not have this fix).
+ The recipe for doing this is well documented in 
[http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
 this Cloudera blog post] and involves configuring hadoop to use a ssh tunnel 
through the master hadoop node. In addition - this recipe only works when using 
EC2 scripts from versions of Hadoop that have the fix for 
[https://issues.apache.org/jira/browse/HADOOP-5839 HADOOP-5839] incorporated. 
(Alternatively, users can apply patches from this JIRA to older versions of 
Hadoop that do not have this fix).
  
  == Troubleshooting (Pre 0.17) ==
  Running Hadoop on EC2 involves a high level of configuration, so it can take 
a few goes to get the system working for your particular set up.

[Hadoop Wiki] Trivial Update of "AmazonEC2" by JoydeepSensarma

Reply via email to