HivingS3nRemotely" by JoydeepSensarma

Apache Wiki Mon, 18 May 2009 05:51:31 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by JoydeepSensarma:
http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely

The comment on the change is:
add sample TPCH query ..

------------------------------------------------------------------------------
   * These SQL queries should be executed using computed resources provisioned 
from EC2. Ideally, the compute resources can be provisioned in proportion to 
the compute costs of the queries
   * Results from such queries that need to be retained for the long term can 
be stored back in S3
  
- This tutorial walks through the steps required to accomplish this.
+ This tutorial walks through the steps required to accomplish this. Please 
send email to the hive-users mailing list in case of any problems with this 
tutorial.
  
  == Required Software ==
  On the client side (PC), the following are required:
@@ -34, +34 @@

  
  The values assigned to s3n keys are just an example and need to be filled in 
by the user as per their account details. Explanation for the rest of the 
values can be found in [#ConfigHell Configuration Guide] section below.
  
- Instead of specifying these command lines each time the CLI is bought up - we 
can store these persistently within {{{hive-site.xml}}} in the {{{conf/} 
directory of the Hive installation (from where they will be picked up each time 
the CLI is launched.
+ Instead of specifying these command lines each time the CLI is bought up - we 
can store these persistently within {{{hive-site.xml}}} in the {{{conf/}}} 
directory of the Hive installation (from where they will be picked up each time 
the CLI is launched.
  
  == Example Public Data Sets ==
  Some example data files are provided in the S3 bucket 
{{{data.s3ndemo.hive}}}. We will use them for the sql examples in this tutorial:
@@ -53, +53 @@

  alter table pkv add partition (insertdate='2008-01-01') location 
's3n://data.s3ndemo.hive/pkv/2008-01-01';
  }}}
   * Declare a table over a TPCH table:
+ {{{
+ create external table lineitem (l_orderkey int, l_partkey int, l_suppkey int, 
l_linenumber int, l_quantity double, l_extendedprice double, l_discount double, 
l_tax double, l_returnflag string, l_linestatus string, l_shipdate string, 
l_commitdate string, l_receiptdate string, l_shipinstruct string, l_shipmode 
string, l_comment string) row format delimited fields terminated by '|' 
location 's3n://data.s3ndemo.hive/tpch/lineitem';
+ }}}
    
+ The TPCH DDL statements are slightly modified versions of the original TPCH 
statements (since Hive does not support all the data types used in TPCH).
+ 
+ == Executing Queries ==
+ Hive can execute some queries without a Hadoop cluster. For example:
+  * {{{ select * from kv limit 10; }}}
+ 
+ {{{select *}}} queries with limit clauses can be performed on the PC itself. 
If you are doing this - please note that:
+  * {{{fs.default.name}}} should be set to {{{file:///}}}
+  * '''Please Please do not select all the rows from large data sets'''. This 
will cause large amount of data to be downloaded from S3 to outside AWS and 
incur charges on the host account for these data sets!
+ 
+ Of course - the real fun is in doing some non-trivial queries using 
map-reduce. For this we will need a Hadoop cluster (finally!):
+ 
+  1. Start a Hadoop cluster on EC2 (using directions from 
[[http://wiki.apache.org/hadoop/AmazonEC2 Hadoop-EC2 tutorial] - but making 
sure to use a version of ec2 scripts with HADOOP-5839 applied! User is free to 
allocate any number of nodes they wish - although this tutorial was tried out 
with
+  2. Note down the public hostnames of the master node. For example - the 
public hostname maybe something like: 
{{{ec2-75-101-170-131.compute-1.amazonaws.com}}}
+  3. Point the Hive CLI to use this Hadoop cluster by executing:
+   * {{{set 
fs.default.name=hdfs://ec2-75-101-170-131.compute-1.amazonaws.com:50001; }}}
+   * {{{set 
mapred.job.tracker=ec2-75-101-170-131.compute-1.amazonaws.com:50002;}}}
+  4. Set up a ssh tunnel via port 2600 to the Hadoop master. This can be done 
by executing the following from another terminal/window:
+   * {{{ ssh -i <path to Hadoop private key path> -D 2600 
ec2-75-101-170-131.compute-1.amazonaws.com }}}
+ 
+ Now we are all setup. The sample query from TPCH (1.sql) can be tried as 
follows:
+  * {{{ insert overwrite directory '/tmp/tpcresults/1-2.sql' select 
l_returnflag, l_linestatus, sum ( l_quantity ) as sum_qty, sum ( 
l_extendedprice ) as sum_base_price, sum ( l_extendedprice * ( 1 - l_discount 
)) as sub_disc_price, sum ( l_extendedprice * ( 1 - l_discount ) * ( 1 + l_tax 
)) as sum_charge, avg ( l_quantity ) as avg_qty, avg ( l_extendedprice ) as 
avg_price, avg ( l_discount ) as avg_disc, count ( 1 ) as count_order from 
lineitem where l_shipdate <= to_date('1998-12-01') group by l_returnflag, 
l_linestatus ; }}}
+ 
+ This launches one map-reduce job and on 10 nodes with default hadoop/hive 
settings - this took about 10 minutes.
  
  
  == Appendix ==

[Hadoop Wiki] Update of "Hive/HiveAws/HivingS3nRemotely" by JoydeepSensarma

Reply via email to