HivingS3nRemotely" by JamesByers

Apache Wiki Mon, 24 Jan 2011 20:23:31 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/HiveAws/HivingS3nRemotely" page has been changed by JamesByers.
The comment on this change is: clean up linebreaks and whitespace in code 
samples; remove extra link brackets; redact real EC2 IPs.
http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely?action=diff&rev1=9&rev2=10

--------------------------------------------------

  
  == Required Software ==
  On the client side (PC), the following are required:
-  * Any version of Hive that incorporates 
[[[https://issues.apache.org/jira/browse/HIVE-467|HIVE-467]]]. (As of this 
writing - the relevant patches are not committed. For convenience sake - a Hive 
distribution with this patch can be downloaded from 
[[[http://jsensarma.com/downloads/hive-s3-ec2.tar.gz|here]]].)
+  * Any version of Hive that incorporates 
[[https://issues.apache.org/jira/browse/HIVE-467|HIVE-467]]. (As of this 
writing - the relevant patches are not committed. For convenience sake - a Hive 
distribution with this patch can be downloaded from 
[[http://jsensarma.com/downloads/hive-s3-ec2.tar.gz|here]].)
-  * A version of Hadoop ec2 scripts (src/contrib/ec2/bin) with a fix for 
[[[https://issues.apache.org/jira/browse/HADOOP-5839]]]. Again - since the 
relevant patches are not committed yet - a version of Hadoop-19 ec2 scripts 
with the relevant patches applied can be downloaded from 
[[[http://jsensarma.com/downloads/hadoop-0.19-ec2-remote.tar.gz|here]]]. These 
scripts must be used to launch hadoop clusters in EC2.
+  * A version of Hadoop ec2 scripts (src/contrib/ec2/bin) with a fix for 
[[https://issues.apache.org/jira/browse/HADOOP-5839]]. Again - since the 
relevant patches are not committed yet - a version of Hadoop-19 ec2 scripts 
with the relevant patches applied can be downloaded from 
[[http://jsensarma.com/downloads/hadoop-0.19-ec2-remote.tar.gz|here]]. These 
scripts must be used to launch hadoop clusters in EC2.
  
  Hive requires a local directory of Hadoop to run (specified using environment 
variable HADOOP_HOME). This can be a version of Hadoop compatible with the one 
running on the EC2 clusters. This recipe has been tried with hadoop 
distribution created from from branch-19.
  
@@ -22, +22 @@

  
  == Hive Setup ==
  A few Hadoop configuration variables are required to be specified for all 
Hive sessions. These can be set using the hive cli as follows:
+ {{{
- {{{hive> set hadoop.socks.server=localhost:2600; 
+ hive> set hadoop.socks.server=localhost:2600; 
  hive> set 
hadoop.rpc.socket.factory.class.default=org.apache.hadoop.net.SocksSocketFactory;
  hive> set hadoop.job.ugi=root,root;
  hive> set mapred.map.tasks=40;
  hive> set mapred.reduce.tasks=-1;
  hive> set fs.s3n.awsSecretAccessKey=2GAHKWG3+1wxcqyhpj5b1Ggqc0TIxj21DKkidjfz
- hive> set fs.s3n.awsAccessKeyId=1B5JYHPQCXW13GWKHAG2}}}
+ hive> set fs.s3n.awsAccessKeyId=1B5JYHPQCXW13GWKHAG2
+ }}}
  
  '''The values assigned to s3n keys are just an example and need to be filled 
in by the user as per their account details.''' Explanation for the rest of the 
values can be found in [[#ConfigHell|Configuration Guide]] section below.
  
@@ -38, +40 @@

  Some example data files are provided in the S3 bucket 
{{{data.s3ndemo.hive}}}. We will use them for the sql examples in this tutorial:
   * {{{s3n://data.s3ndemo.hive/kv}}} - Key Value pairs in a text file
   * {{{s3n://data.s3ndemo.hive/pkv}}} - Key Value pairs in a directories that 
are partitioned by date
-  * {{{s3n://data.s3ndemo.hive/tpch/*}}} - Eight directories containing data 
corresponding to the eight tables used by [[[http://tpc.org/tpch/|TPCH 
benchmark]]]. The data is generated for a scale 10 (approx 10GB) database using 
the standard {{{dbgen}}} utility provided by TPCH.
+  * {{{s3n://data.s3ndemo.hive/tpch/*}}} - Eight directories containing data 
corresponding to the eight tables used by [[http://tpc.org/tpch/|TPCH 
benchmark]]. The data is generated for a scale 10 (approx 10GB) database using 
the standard {{{dbgen}}} utility provided by TPCH.
  
  == Setting up tables (DDL Statements) ==
  In this example - we will use HDFS as the default table store for Hive. We 
will make Hive tables over the files in S3 using the {{{external tables}}} 
functionality in Hive. Executing DDL commands does not require a functioning 
Hadoop cluster (since we are just setting up metadata):
  
   * Declare a simple table containing key-value pairs:
+  . {{{
-  . {{{hive> create external table kv (key int, values string)  location 
's3n://data.s3ndemo.hive/kv';}}}
+ hive> create external table kv (key int, values string)  location 
's3n://data.s3ndemo.hive/kv';
+ }}}
   * Declare a partitioned table over a nested directory containing key-value 
pairs and associate table partitions with dirs:
   . {{{
  hive> create external table pkv (key int, values string) partitioned by 
(insertdate string);
@@ -53, +57 @@

   * Declare a table over a TPCH table:
   . {{{
  hive> create external table lineitem (
- l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity 
double,
+   l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity 
double,
- l_extendedprice double, l_discount double, l_tax double, l_returnflag string, 
+   l_extendedprice double, l_discount double, l_tax double, l_returnflag 
string, 
- l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate 
string,
+   l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate 
string,
- l_shipinstruct string, l_shipmode string, l_comment string) 
+   l_shipinstruct string, l_shipmode string, l_comment string) 
- row format delimited fields terminated by '|' 
+   row format delimited fields terminated by '|' 
- location 's3n://data.s3ndemo.hive/tpch/lineitem';
+   location 's3n://data.s3ndemo.hive/tpch/lineitem';
  }}}
    
- The TPCH DDL statements are slightly modified versions of the original TPCH 
statements (since Hive does not support all the data types used in TPCH). All 
the TPCH DDL statements for Hive can be be found 
[[[attachment:TpchDdlForHive.sql|here]]]
+ The TPCH DDL statements are slightly modified versions of the original TPCH 
statements (since Hive does not support all the data types used in TPCH). All 
the TPCH DDL statements for Hive can be be found 
[[attachment:TpchDdlForHive.sql|here]]
  
  == Executing Queries ==
  Hive can execute some queries without a Hadoop cluster. For example:
+ {{{
- {{{hive> select * from kv limit 10; }}}
+ hive> select * from kv limit 10;
+ }}}
  
  {{{select *}}} queries with limit clauses can be performed locally on the 
Hive CLI itself. If you are doing this - please note that:
   * {{{fs.default.name}}} should be set to {{{file:///}}} in case CLI is not 
configured to use a working Hadoop cluster
   * '''Please Please do not select all the rows from large data sets'''. This 
will cause large amount of data to be downloaded from S3 to outside AWS and 
incur charges on the host account for these data sets!
  
  Of course - the real fun is in doing some non-trivial queries using 
map-reduce. For this we will need a Hadoop cluster (finally!):
-  1. Start a Hadoop cluster on EC2 (using directions from 
[[[http://wiki.apache.org/hadoop/AmazonEC2|Hadoop-EC2 tutorial]] - but making 
sure to use a version of ec2 scripts with HADOOP-5839 applied! User is free to 
allocate any number of nodes they wish - although this tutorial was tried out 
with 10 nodes.
+  1. Start a Hadoop cluster on EC2 (using directions from 
[[http://wiki.apache.org/hadoop/AmazonEC2|Hadoop-EC2 tutorial]] - but making 
sure to use a version of ec2 scripts with HADOOP-5839 applied! User is free to 
allocate any number of nodes they wish - although this tutorial was tried out 
with 10 nodes.
   1. Note down the public hostnames of the master node. For example, the 
public hostname maybe something like: 
-  . {{{ec2-75-101-170-131.compute-1.amazonaws.com}}}
+  . {{{ec2-12-34-56-78.compute-1.amazonaws.com}}}
   1.#3 Point the Hive CLI to use this Hadoop cluster by executing:
+  . {{{
-  . {{{hive> set 
fs.default.name=hdfs://ec2-75-101-170-131.compute-1.amazonaws.com:50001;
+ hive> set 
fs.default.name=hdfs://ec2-12-34-56-78.compute-1.amazonaws.com:50001;
- hive> set 
mapred.job.tracker=ec2-75-101-170-131.compute-1.amazonaws.com:50002;}}}
+ hive> set mapred.job.tracker=ec2-12-34-56-78.compute-1.amazonaws.com:50002;
+ }}}
   1.#4 Set up a ssh tunnel via port 2600 to the Hadoop master. This can be 
done by executing the following from another terminal/window:
-  . {{{linux> ssh -i <path to Hadoop private key path> -D 2600 
ec2-75-101-170-131.compute-1.amazonaws.com}}}
+  . {{{$ ssh -i <path to Hadoop private key path> -D 2600 
ec2-12-34-56-78.compute-1.amazonaws.com}}}
  
  Now we are all setup. The sample query from TPCH (1.sql) can be tried as 
follows:
  
  {{{
  hive> insert overwrite directory '/tmp/tpcresults-1.sql' 
-  select l_returnflag, l_linestatus, sum ( l_quantity ) as sum_qty, sum ( 
l_extendedprice ) as sum_base_price,
+   select l_returnflag, l_linestatus, sum ( l_quantity ) as sum_qty, sum ( 
l_extendedprice ) as sum_base_price,
-  sum ( l_extendedprice * ( 1 - l_discount )) as sub_disc_price, 
+   sum ( l_extendedprice * ( 1 - l_discount )) as sub_disc_price, 
-  sum ( l_extendedprice * ( 1 - l_discount ) * ( 1 + l_tax )) as sum_charge,
+   sum ( l_extendedprice * ( 1 - l_discount ) * ( 1 + l_tax )) as sum_charge,
-  avg ( l_quantity ) as avg_qty, avg ( l_extendedprice ) as avg_price, 
+   avg ( l_quantity ) as avg_qty, avg ( l_extendedprice ) as avg_price, 
-  avg ( l_discount ) as avg_disc, count ( 1 ) as count_order
+   avg ( l_discount ) as avg_disc, count ( 1 ) as count_order
-  from lineitem where l_shipdate <= to_date('1998-12-01') group by 
l_returnflag, l_linestatus ; 
+   from lineitem where l_shipdate <= to_date('1998-12-01') group by 
l_returnflag, l_linestatus; 
  }}}
  
  This launches one map-reduce job and on 10 nodes with default hadoop/hive 
settings - this took about 10 minutes. The results in this case are stored in 
HDFS and can be obtained by doing a {{{dfs -cat /tmp/tpcresults/1-2.sql/*}}} - 
either from bin/hadoop or from hive CLI. The query above differs from the TPCH 
query in skipping the order by clause - since it's not implemented by Hive 
currently.
  
  == Storing results back in S3 ==
  The results could also have been stored as a file in S3 directly, for 
example, we could alter the previous insert clause to read as:
- {{{hive> insert overwrite directory 's3n://jssarma/tpcresults-1.sql' }}} - 
where {{{jssarma}}} is a pre-existing bucket in S3.
+ {{{
+ hive> insert overwrite directory 's3n://target-bucket/tpcresults-1.sql';
+ }}}
  
  As another alternative, one could have created an external table over S3 and 
stored the results directly in it, for example:
+ {{{
- {{{hive> create external table t1 (flag string, status string, double ...)
+ hive> create external table t1 (flag string, status string, double ...)
- location 's3n://jssarma/tpcresults-1.sql';
+   location 's3n://jssarma/tpcresults-1.sql';
- hive> insert overwrite table t1 select ...}}}
+ hive> insert overwrite table t1 select ...;
+ }}}
  
  Similarly, one could have stored the results back in a partition in an 
partitioned external table as well.
  
  == Using tmp tables in HDFS ==
  Currently, Hive does not have any explicit support tmp tables. But tables 
defined over HDFS in EC2 are like tmp tables since they only last for the 
duration of the Hadoop cluster. Since they are likely to be much faster than 
accessing S3 directly - they can be used to stage data that may be accessed 
repeatedly during a session. For example - for the TPCH dataset - one may want 
to do some analysis of customer attributes against order details - and it would 
be first beneficial to materialize the join of these data sets and then do 
repeated queries against it. Here's some example sql that would do the same:
+ {{{
- {{{hive> create table cust_order (nationkey string, acctbal double, 
mktsegment string, orderstatus string, totalprice double);
+ hive> create table cust_order (nationkey string, acctbal double, mktsegment 
string, orderstatus string, totalprice double);
- hive> from customer c left outer join orders o on (c.c_custkey = o.o_custkey) 
insert overwrite table cust_order select c.c_nationkey, c.c_acctbal, 
c.c_mktsegment, o.o_orderstatus, o.o_totalprice; }}}
+ hive> from customer c left outer join orders o on (c.c_custkey = o.o_custkey)
+   insert overwrite table cust_order
+   select c.c_nationkey, c.c_acctbal, c.c_mktsegment, o.o_orderstatus, 
o.o_totalprice;
+ }}}
  
  == Appendix ==
  <<Anchor(ConfigHell)>>
  === Configuration Guide ===
- The socket related options allow Hive CLI to communicate with the Hadoop 
cluster using a ssh tunnel (that will be established later). The job.ugi is 
specified to avoid issues with permissions on HDFS. {{{mapred.map.tasks}}} 
specification is a hack that works around 
[[[https://issues.apache.org/jira/browse/HADOOP-5861|HADOOP-5861]]] and may 
need to be set higher for large clusters. {{{mapred.reduce.tasks}}} is 
specified to let Hive determine the number of reducers (see 
[[[https://issues.apache.org/jira/browse/HIVE-490|HIVE-490]]]).
+ The socket related options allow Hive CLI to communicate with the Hadoop 
cluster using a ssh tunnel (that will be established later). The job.ugi is 
specified to avoid issues with permissions on HDFS. {{{mapred.map.tasks}}} 
specification is a hack that works around 
[[https://issues.apache.org/jira/browse/HADOOP-5861|HADOOP-5861]] and may need 
to be set higher for large clusters. {{{mapred.reduce.tasks}}} is specified to 
let Hive determine the number of reducers (see 
[[https://issues.apache.org/jira/browse/HIVE-490|HIVE-490]]).
  
  === Links ===
   * [[../| Hive and AWS]] presents general landscape and alternative on 
running Hive queries in AWS.

[Hadoop Wiki] Update of "Hive/HiveAws/HivingS3nRemotely" by JamesByers

Reply via email to