[Hadoop Wiki] Update of "Hive/HiveAws" by JoydeepSensarma

Apache Wiki Sun, 17 May 2009 09:19:16 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by JoydeepSensarma:
http://wiki.apache.org/hadoop/Hive/HiveAws

------------------------------------------------------------------------------
  
  This document explores the different ways of leveraging Hive on Amazon Web 
Services - namely [[http://aws.amazon.com/s3 S3]], [[http://aws.amazon.com/EC2 
EC2]] and [[http://aws.amazon.com/elasticmapreduce/ Elastic Map-Reduce]]. 
  
- Hadoop already has a rich tradition of being run on EC2 and S3. These are 
well document documented here and a must read:
+ Hadoop already has a long tradition of being run on EC2 and S3. These are 
well documented in the links below which are a must read:
   * [[http://wiki.apache.org/hadoop/AmazonS3 Hadoop and S3]]
   * [[http://wiki.apache.org/hadoop/AmazonEC2 Amazon and EC2]]
  
- The second document also has pointers on how to get started using EC2 and S3. 
For people who are new to S3 - there's a few helpful hints in [#S3n00b S3 for 
n00bs section] below. The rest of the documentation below assumes that the 
reader can launch a hadoop cluster in EC2 and run some simple Hadoop jobs.
+ The second document also has pointers on how to get started using EC2 and S3. 
For people who are new to S3 - there's a few helpful notes in [#S3n00b S3 for 
n00bs section] below. The rest of the documentation below assumes that the 
reader can launch a hadoop cluster in EC2, copy files into and out of S3 and 
run some simple Hadoop jobs.
  
  == Introduction to Hive and AWS ==
  There are three separate questions to consider when running Hive on AWS:
-  1. Where to run the [wiki:LanguageManual/Cli Hive CLI] from and store the 
metastore db (that contains table and schema definitions).
+  1. Where to run the [[http://wiki.apache.org/hadoop/Hive/LanguageManual/Cli 
Hive CLI] from and store the metastore db (that contains table and schema 
definitions).
-  1. How to define Hive tables over existing datasets (potentially in S3)
+  1. How to define Hive tables over existing datasets (potentially those that 
are already in S3)
   1. How to dispatch Hive queries (which are all executed using one or more 
map-reduce programs) to a Hadoop cluster running in EC2.
  
- We walk you through the choices involved here and then show you simple sample 
configurations.
+ We walk you through the choices involved here and show some practical case 
studies that contain detailed setup and configuration instructions.
  
- === Running the Hive CLI ===
+ == Running the Hive CLI ==
- Hive CLI environment is completely independent of Hadoop. The CLI takes in 
queries, compiles them into a plan consisting of map-reduce jobs and then 
submits them to the configured Hadoop Cluster. For this reason the CLI can be 
run from any node that has a Hive distribution, a Java Runtime Engine and that 
can connect to the Hadoop cluster. There are two choices on where to run the 
CLI from:
+ Hive CLI environment is completely independent of Hadoop. The CLI takes in 
queries, compiles them into a plan consisting of map-reduce jobs and then 
submits them to a Hadoop Cluster. For this reason the CLI can be run from any 
node that has a Hive distribution, a Java Runtime Engine and that can connect 
to the Hadoop cluster. The Hive CLI also needs to access table metadata. By 
default this is persisted by Hive via an embedded Derby database into a folder 
named metastore_db on the local file system (however state can be persisted in 
any database - including remote mysql instances).
  
-  1. Run Hive CLI from within EC2 - the Hadoop master node being the obvious 
choice. One problem here is the lack of comprehensive AMIs that bundle 
different versions of Hive and Hadoop distributions (and difficulty in doing so 
considering the large number of such combinations). 
[[http://www.cloudera.com/hadoop-ec2 Cloudera]] provides some AMIs that bundle 
Hive with Hadoop - although the choice in terms of Hive and Hadoop versions may 
be restricted. Another issue here is that any required map-reduce scripts may 
also need to be copied to the master.
+ There are two choices on where to run the Hive CLI from:
  
-  2. Run Hive CLI from outside EC2. In this case, the user installs a Hive 
distribution on a personal machine, - the main trick with this option is 
connecting to the Hadoop cluster - both for submitting jobs and for reading 
writing files. The section on 
[[http://wiki.apache.org/hadoop/AmazonEC2#FromRemoteMachine Running jobs from a 
remote machine]] details how this can be done. [#CaseStudyOne Case Study I] 
goes into this in more detail.
+  1. Run Hive CLI from within EC2 - the Hadoop master node being the obvious 
choice. There are several problems with this approach:
+   * Lack of comprehensive AMIs that bundle different versions of Hive and 
Hadoop distributions (and the difficulty in doing so considering the large 
number of such combinations). [[http://www.cloudera.com/hadoop-ec2 Cloudera]] 
provides some AMIs that bundle Hive with Hadoop - although the choice in terms 
of Hive and Hadoop versions may be restricted.
+   * Any required map-reduce scripts may also need to be copied to the 
master/Hive node.
+   * If the default Derby database is used - then one has to think about 
persisting state beyond the lifetime of one hadoop cluster. S3 is an obvious 
choice - but the user must restore and backup Hive metadata at the launch and 
termination of the Hadoop cluster.
  
- By default, Hive stores metadata in a local Derby database (created under a 
folder named metastore_db in the directory from where hive is launched).
+  2. Run Hive CLI remotely from outside EC2. In this case, the user installs a 
Hive distribution on a personal workstation, - the main trick with this option 
is connecting to the Hadoop cluster - both for submitting jobs and for reading 
and writing files to HDFS. The section on 
[[http://wiki.apache.org/hadoop/AmazonEC2#FromRemoteMachine Running jobs from a 
remote machine]] details how this can be done. [wiki:/HivingS3nRemotely Case 
Study 1] goes into the setup for this in more detail. This option solves the 
problems mentioned above:
+   * Stock Hadoop AMIs can be used. The user can run any version of Hive on 
their workstation, launch a Hadoop cluster with the desired version etc. on EC2 
and start running queries.
+   * Map-reduce scripts are automatically pushed by Hive into Hadoop's 
distributed cache at job submission time and do not need to be copied to the 
Hadoop machines.
+   * Hive Metadata can be stored on local disk painlessly.
  
+ However - the one downside of Option 2 is that jar files are copied over to 
the Hadoop cluster for each map-reduce job. This can cause high latency in job 
submission as well as incur some AWS network transmission costs. Option 1 seems 
suitable for advanced users who have figured out a stable Hadoop and Hive (and 
potentially external libraries) configuration that works for them and can 
create a new AMI with the same.
-  1. For Option 1, the metastore db can/should be zipped up and stored 
persistently in S3 (before terminating the Hadoop cluster) and conversely 
restored from there the next time a Hadoop cluster is launched. One can also 
consider alternative persistent stores in AWS like EBS. Th
-  2. For Option 2, the metastore db can be stored on local disk and does not 
need to be stored in the cloud.
  
- === Loading Data into Hive Tables ===
+ == Loading Data into Hive Tables ==
- Before getting into this - it is useful to go over the main storage choices 
for Hadoop/EC2 environment:
+ It is useful to go over the main storage choices for Hadoop/EC2 environment:
  
   * S3 is an excellent place to store data for the long term. There are a 
couple of choices on how S3 can be used:
-   * Data can be either stored as files within S3 using tools like aws and 
s3curl as detailed in [#S3n00b S3 for n00bs section]. This suffers from the 
restriction of 5G limit on file size in S3. But the nice thing is that there 
are probably scores of tools that can help in copying/replicating data to S3 in 
this manner.
+   * Data can be either stored as files within S3 using tools like aws and 
s3curl as detailed in [#S3n00b S3 for n00bs section]. This suffers from the 
restriction of 5G limit on file size in S3. But the nice thing is that there 
are probably scores of tools that can help in copying/replicating data to S3 in 
this manner. Hadoop is able to read/write such files using the S3N filesystem.
-    * Alternatively Hadoop can be used to use S3 as a backing store for HDFS. 
In this case - data can only be read and written via HDFS.
+   * Alternatively Hadoop provides a block based file system using S3 as a 
backing store. This does not suffer from the 5G max file size restriction. 
However - Hadoop utilities and libraries must be used for reading/writing such 
files.
  
-  * HDFS instance on the local drives of Hadoop clusters allocated
+  * HDFS instance on the local drives of the machines in the Hadoop cluster. 
The lifetime of this is restricted to that of the Hadoop instance - hence this 
is not suitable for long lived data. However it should provide data that can be 
accessed much faster and hence is a good choice for intermediate/tmp data.
  
+ Considering these factors, the following makes sense in terms of Hive tables:
+  1. For long-lived tables, use S3 based storage mechanisms
+  2. For intermediate data and tmp tables, use HDFS
  
+ [wiki:/HivingS3nRemotely Case Study 1] shows you how to achieve such an 
arrangement using the S3N filesystem.
  
+ If the user is running Hive CLI from their personal workstation - they can 
also use Hive's 'load data local' commands as a convenient alternative (to dfs 
commands) to copy data from their local filesystems (accessible from their 
workstation) into tables defined over either HDFS or S3.
+ 
+ == Submitting jobs to a Hadoop cluster ==
+ This applies particularly when Hive CLI is run remotely. A single Hive CLI 
session can switch across different hadoop clusters (especially as clusters are 
bought up and terminated). Only two configuration variables:
+  * fs.default.name
+  * mapred.job.tracker
+ need to be changed to point the CLI from one Hadoop cluster to another. 
Beware though that tables stored in previous HDFS instance will not be 
accessible as the CLI switches from one cluster to another. Again - more 
details can be found in [wiki:/HivingS3nRemotely Case Study 1].
+ 
+ == Case Studies ==
+  1. [wiki:/HivingS3nRemotely Querying files in S3 using EC2, Hive and Hadoop 
] 
+ 
+ == Appendix ==
  
  [[Anchor(S3n00b)]]
  === S3 for n00bs ===
+ One of the things useful to understand is how S3 is used as a file system 
normally. Each S3 bucket can be considered as a root of a File System. 
Different files within this filesystem become objects stored in S3 - where the 
path name of the file (path components joined with '/') become the S3 key 
within the bucket and file contents become the value. Different tools like 
[[https://addons.mozilla.org/en-US/firefox/addon/3247 S3Fox]] and native S3 
FileSystem in Hadoop (s3n) show a directory structure that's implied by the 
common prefixes found in the keys. Not all tools are able to create an empty 
directory. In particular - S3Fox does (by creating a empty key representing the 
directory). Other popular tools like [[http://timkay.com/aws/ aws], 
[[http://s3tools.org/s3cmd s3cmd] and 
[[http://developer.amazonwebservices.com/connect/entry.jspa?externalID=128 
s3curl]] provide convenient ways of accessing S3 from the command line - but 
don't have the capability of creating empty dire
 ctories.
  
- For n00bs - one of the things useful to understand is how S3 is used as a 
file system. Each S3 bucket can be considered as a root of a File System. 
Different files within this filesystem become objects stored in S3 - where the 
path name of the file (path components joined with '/') become the S3 key 
within the bucket and file contents become the value. Different tools like 
[[https://addons.mozilla.org/en-US/firefox/addon/3247 S3Fox]] and native S3 
FileSystem in Hadoop (s3n) show a directory structure that's implied by the 
common prefixes found in the keys. Not all tools are able to create an empty 
directory - in particular - S3Fox does (by creating a empty key representing 
the directory). Other popular 
-

[Hadoop Wiki] Update of "Hive/HiveAws" by JoydeepSensarma

Reply via email to