Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The following page has been changed by JoydeepSensarma:
http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely

New page:
= Querying S3 files from your PC (using EC2, Hive and Hadoop) =

== Usage Scenario ==
The scenario being covered here goes as follows:
 * A user has data stored in S3 - for example Apache log files archived in the 
cloud, or databases backed up into S3.
 * The user would like to declare tables over the data sets here and issue SQL 
queries against them
 * These SQL queries should be executed using computed resources provisioned 
from EC2. Ideally, the compute resources can be provisioned in proportion to 
the compute costs of the queries
 * Results from such queries that need to be retained for the long term can be 
stored back in S3

This tutorial walks through the steps required to accomplish this.

== Required Software ==
On the client side (PC), the following are required:
 * Any version of Hive that incorporates 
[[https://issues.apache.org/jira/browse/HIVE-467 HIVE-467]]. (As of this 
writing - the relevant patches are not committed. For convenience sake - a Hive 
distribution with this patch can be downloaded from 
[[http://jsensarma.com/downloads/hive-s3-ec2.tar.gz here]].)
 * A version of Hadoop ec2 scripts (src/contrib/ec2/bin) with a fix for 
[[https://issues.apache.org/jira/browse/HADOOP-5839]]. Again - since the 
relevant patches are not committed yet - a version of Hadoop-19 ec2 scripts 
with the relevant patches applied can be downloaded from 
[[http://jsensarma.com/downloads/hadoop-0.19-ec2-remote.tar.gz]]. These scripts 
must be used to launch hadoop clusters in EC2.

Hive requires a local directory of Hadoop to run (specified using environment 
variable HADOOP_HOME). This can be a version of Hadoop compatible with the one 
running on the EC2 clusters. This recipe has been tried with hadoop 
distribution created from from branch-19.

It is assumed that the user can successfully launch Hive CLI ({{{bin/hive}}} 
from the Hive distribution) at this point.

== Hive Setup ==
A few Hadoop configuration variables are required to be specified for all Hive 
sessions. These can be set using the hive cli as follows:
{{{
set hadoop.socks.server=localhost:2600;
set 
hadoop.rpc.socket.factory.class.default=org.apache.hadoop.net.SocksSocketFactory;
set hadoop.job.ugi=root,root;
set mapred.map.tasks=40;
set mapred.reduce.tasks=-1;
set fs.s3n.awsSecretAccessKey=2GAHKWG3+1wxcqyhpj5b1Ggqc0TIxj21DKkidjfz
set fs.s3n.awsAccessKeyId=1B5JYHPQCXW13GWKHAG2
}}}

The values assigned to s3n keys are just an example and need to be filled in by 
the user as per their account details. Explanation for the rest of the values 
can be found in [#ConfigHell Configuration Guide] section below.

Instead of specifying these command lines each time the CLI is bought up - we 
can store these persistently within {{{hive-site.xml}}} in the {{{conf/} 
directory of the Hive installation (from where they will be picked up each time 
the CLI is launched.

== Example Public Data Sets ==
Some example data files are provided in the S3 bucket {{{data.s3ndemo.hive}}}. 
We will use them for the sql examples in this tutorial:
 * s3n://data.s3ndemo.hive/kv - Key Value pairs in a text file
 * s3n://data.s3ndemo.hive/pkv - Key Value pairs in a directories that are 
partitioned by date
 * s3n://data.s3ndemo.hive/tpch/* - Eight directories containing data 
corresponding to the eight tables used by [[http://tpc.org/tpch/ TPCH 
benchmark]]. The data is generated for a scale 10 (approx 10GB) database using 
the standard {{{dbgen}}} utility provided by TPCH.

== Setting up tables (DDL Statements) ==
In this example - we will use HDFS as the default table store for Hive. We will 
make Hive tables over the files in S3 using the {{{external tables}}} 
functionality in Hive. Executing DDL commands does not require a functioning 
Hadoop cluster (since we are just setting up metadata):

 * Declare a simple table containing key-value pairs:
{{{create external table kv (key int, values string)  location 
's3n://data.s3ndemo.hive/kv';}}}
 * Declare a partitioned table over a nested directory containing key-value 
pairs and associate table partitions with dirs:
{{{
create external table pkv (key int, values string) partitioned by (insertdate 
string);
alter table pkv add partition (insertdate='2008-01-01') location 
's3n://data.s3ndemo.hive/pkv/2008-01-01';
}}}
 * Declare a table over a TPCH table:
  


== Appendix ==
[[Anchor(ConfigHell)]]
=== Configuration Guide ===
The socket related options allow Hive CLI to communicate with the Hadoop 
cluster using a ssh tunnel (that will be established later). The job.ugi is 
specified to avoid issues with permissions on HDFS. {{{mapred.map.tasks}}} 
specification is a hack that works around 
[[https://issues.apache.org/jira/browse/HADOOP-5861 HADOOP-5861]] and may need 
to be set higher for large clusters. {{{mapred.reduce.tasks}}} is specified to 
let Hive determine the number of reducers (see 
[[https://issues.apache.org/jira/browse/HIVE-490 HIVE-490]]).

Reply via email to