[Hadoop Wiki] Update of "Hive/Tutorial" by Ning Zhang

Apache Wiki Thu, 15 Apr 2010 12:45:08 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/Tutorial" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive/Tutorial?action=diff&rev1=20&rev2=21

--------------------------------------------------

  }}}
  It is assumed that the array and map fields in the input.txt files are null 
fields for these examples.
  
+ == Simple Query ==
+ For all the active users, one can use the query of the following form:
+ 
+ {{{
+     INSERT OVERWRITE TABLE user_active
+     SELECT user.*
+     FROM user
+     WHERE user.active = 1;
+ }}}
+ Note that unlike SQL, we always insert the results into a table. We will 
illustrate later how the user can inspect these results and even dump them to a 
local file. You can also run the following query on Hive CLI:
+ 
+ {{{
+     SELECT user.*
+     FROM user
+     WHERE user.active = 1;
+ }}}
+ This will be internally rewritten to some temporary file and displayed to the 
Hive client side.
+ 
+ == Partition Based Query ==
+ What partitions to use in a query is determined automatically by the system 
on the basis of where clause conditions on partition columns. For example, in 
order to get all the page_views in the month of 03/2008 referred from domain 
xyz.com, one could write the following query:
+ 
+ {{{
+     INSERT OVERWRITE TABLE xyz_com_page_views
+     SELECT page_views.*
+     FROM page_views
+     WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' 
AND
+           page_views.referrer_url like '%xyz.com';
+ }}}
+ Note that page_views.date is used here because the table (above) was defined 
with PARTITIONED BY(date DATETIME, country STRING) ; if you name your partition 
something different, don't expect .date to do what you think!
+ 
+ == Joins ==
+ In order to get a demographic breakdown (by gender) of page_view of 
2008-03-03 one would need to join the page_view table and the user table on the 
userid column. This can be accomplished with a join as shown in the following 
query:
+ 
+ {{{
+     INSERT OVERWRITE TABLE pv_users
+     SELECT pv.*, u.gender, u.age
+     FROM user u JOIN page_view pv ON (pv.userid = u.id)
+     WHERE pv.date = '2008-03-03';
+ }}}
+ In order to do outer joins the user can qualify the join with LEFT OUTER, 
RIGHT OUTER or FULL OUTER keywords in order to indicate the kind of outer join 
(left preserved, right preserved or both sides preserved). For example, in 
order to do a full outer join in the query above, the corresponding syntax 
would look like the following query:
+ 
+ {{{
+     INSERT OVERWRITE TABLE pv_users
+     SELECT pv.*, u.gender, u.age
+     FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id)
+     WHERE pv.date = '2008-03-03';
+ }}}
+ In order check the existence of a key in another table, the user can use LEFT 
SEMI JOIN as illustrated by the following example.
+ 
+ {{{
+     INSERT OVERWRITE TABLE pv_users
+     SELECT u.*
+     FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id)
+     WHERE pv.date = '2008-03-03';
+ }}}
+ In order to join more than one tables, the user can use the following syntax:
+ 
+ {{{
+     INSERT OVERWRITE TABLE pv_friends
+     SELECT pv.*, u.gender, u.age, f.friends
+     FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON 
(u.id = f.uid)
+     WHERE pv.date = '2008-03-03';
+ }}}
+ Note that Hive only supports 
[[http://en.wikipedia.org/wiki/Join_(SQL)#Equi-join|equi-joins]]. Also it is 
best to put the largest table on the rightmost side of the join to get the best 
performance.
+ 
+ == Aggregations ==
+ In order to count the number of distinct users by gender one could write the 
following query:
+ 
+ {{{
+     INSERT OVERWRITE TABLE pv_gender_sum
+     SELECT pv_users.gender, count (DISTINCT pv_users.userid)
+     FROM pv_users
+     GROUP BY pv_users.gender;
+ }}}
+ Multiple aggregations can be done at the same time, however, no two 
aggregations can have different DISTINCT columns .e.g while the following is 
possible
+ 
+ {{{
+     INSERT OVERWRITE TABLE pv_gender_agg
+     SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(1), 
sum(DISTINCT pv_users.userid)
+     FROM pv_users
+     GROUP BY pv_users.gender;
+ }}}
+ however, the following query is not allowed
+ 
+ {{{
+     INSERT OVERWRITE TABLE pv_gender_agg
+     SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT 
pv_users.ip)
+     FROM pv_users
+     GROUP BY pv_users.gender;
+ }}}
+ == Multi Table/File Inserts ==
+ The output of the aggregations or simple selects can be further sent into 
multiple tables or even to hadoop dfs files (which can then be manipulated 
using hdfs utilities). e.g. if along with the gender breakdown, one needed to 
find the breakdown of unique page views by age, one could accomplish that with 
the following query:
+ 
+ {{{
+     FROM pv_users
+     INSERT OVERWRITE TABLE pv_gender_sum
+         SELECT pv_users.gender, count_distinct(pv_users.userid)
+         GROUP BY pv_users.gender
+ 
+     INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
+         SELECT pv_users.age, count_distinct(pv_users.userid)
+         GROUP BY pv_users.age;
+ }}}
+ The first insert clause sends the results of the first group by to a Hive 
table while the second one sends the results to a hadoop dfs files.
+ 
+ == Dynamic-partition Insert ==
+ 
- In the above examples, the user has to know which partition to insert into 
and only one partition can be inserted in one insert statement. If you want to 
load into multiple partitions, you have to use multi-insert statement as 
illustrated below. 
+ In the previous examples, the user has to know which partition to insert into 
and only one partition can be inserted in one insert statement. If you want to 
load into multiple partitions, you have to use multi-insert statement as 
illustrated below. 
  {{{
      FROM page_view_stg pvs
+     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') 
-     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') 
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, 
pvs.ip WHERE pvs.country = 'US'
+            SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, 
null, null, pvs.ip WHERE pvs.country = 'US'
+     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA') 
-     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA') 
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, 
pvs.ip WHERE pvs.country = 'CA'
+            SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, 
null, null, pvs.ip WHERE pvs.country = 'CA'
+     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK') 
-     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK') 
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, 
pvs.ip WHERE pvs.country = 'UK';    
+            SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, 
null, null, pvs.ip WHERE pvs.country = 'UK';    
  }}}
  
  In order to load data into all country partitions in a particular day, you 
have to add an insert statement for each country in the input data. This is 
very inconvenient and inefficient since you have to have the priori knowledge 
of the list of countries exist in the input data and create the partitions 
beforehand. If the list changed for another day, you have to modify your insert 
DML as well as the partition creation DDLs. 
  
- ''Dynamic-partition insert'' (or multi-partition insert) is designed to solve 
this problem by dynamically determining which partitions should be created and 
populated. This is a newly added feature that is only available from version 
0.6.0 (trunk now). In the dynamic partition insert, the input column values are 
evaluated to determine which partition this row should be inserted into. If 
that partition has not been created, it will create that partition 
automatically. Using this feature you need only one insert statement to create 
and populate all necessary partitions. Below is an example of loading data to 
all country partitions using one insert statement:
+ ''Dynamic-partition insert'' (or multi-partition insert) is designed to solve 
this problem by dynamically determining which partitions should be created and 
populated while scanning the input table. This is a newly added feature that is 
only available from version 0.6.0 (trunk now). In the dynamic partition insert, 
the input column values are evaluated to determine which partition this row 
should be inserted into. If that partition has not been created, it will create 
that partition automatically. Using this feature you need only one insert 
statement to create and populate all necessary partitions. Below is an example 
of loading data to all country partitions using one insert statement:
  
  {{{
- FROM page_view_stg pvs
+     FROM page_view_stg pvs
- INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
+     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) 
- SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, 
pvs.ip, pvs.country
+            SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, 
null, null, pvs.ip, pvs.country
  }}}
  
  There are several syntactic differences from the multi-insert statement: 
@@ -350, +460 @@

    * Another situation we want to protect against dynamic partition insert is 
that the user may accidentally specify all partitions to be dynamic partitions 
without specifying one static partition, while the original intention is to 
just overwrite the sub-partitions of one root partition. We define another 
parameter hive.exec.dynamic.partition.mode=strict to prevent the all-dynamic 
partition case. In the strict mode, you have to specify at least one static 
partition. The default mode is strict. In addition, we have a parameter 
hive.exec.dynamic.partition=true/false to control whether to allow dynamic 
partition at all. The default value is false. 
    * Currently dynamic partition insert will not work with 
hive.merge.mapfiles=true or hive.merge.mapredfiles=true, so it internally turns 
off the merge parameters. The reason is that if either of the merge parameters 
is set to true, there will be a map reduce job for that particular partition to 
merge all files into one. In dynamic partition insert, we do not know the 
number of partitions at compile time thus no MapReduce job could be generated. 
There is a new JIRA HIVE-1307 filed for this task. 
  
- == Simple Query ==
- For all the active users, one can use the query of the following form:
- 
- {{{
-     INSERT OVERWRITE TABLE user_active
-     SELECT user.*
-     FROM user
-     WHERE user.active = 1;
- }}}
- Note that unlike SQL, we always insert the results into a table. We will 
illustrate later how the user can inspect these results and even dump them to a 
local file. You can also run the following query on Hive CLI:
- 
- {{{
-     SELECT user.*
-     FROM user
-     WHERE user.active = 1;
- }}}
- This will be internally rewritten to some temporary file and displayed to the 
Hive client side.
- 
- == Partition Based Query ==
- What partitions to use in a query is determined automatically by the system 
on the basis of where clause conditions on partition columns. For example, in 
order to get all the page_views in the month of 03/2008 referred from domain 
xyz.com, one could write the following query:
- 
- {{{
-     INSERT OVERWRITE TABLE xyz_com_page_views
-     SELECT page_views.*
-     FROM page_views
-     WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' 
AND
-           page_views.referrer_url like '%xyz.com';
- }}}
- Note that page_views.date is used here because the table (above) was defined 
with PARTITIONED BY(date DATETIME, country STRING) ; if you name your partition 
something different, don't expect .date to do what you think!
- 
- == Joins ==
- In order to get a demographic breakdown (by gender) of page_view of 
2008-03-03 one would need to join the page_view table and the user table on the 
userid column. This can be accomplished with a join as shown in the following 
query:
- 
- {{{
-     INSERT OVERWRITE TABLE pv_users
-     SELECT pv.*, u.gender, u.age
-     FROM user u JOIN page_view pv ON (pv.userid = u.id)
-     WHERE pv.date = '2008-03-03';
- }}}
- In order to do outer joins the user can qualify the join with LEFT OUTER, 
RIGHT OUTER or FULL OUTER keywords in order to indicate the kind of outer join 
(left preserved, right preserved or both sides preserved). For example, in 
order to do a full outer join in the query above, the corresponding syntax 
would look like the following query:
- 
- {{{
-     INSERT OVERWRITE TABLE pv_users
-     SELECT pv.*, u.gender, u.age
-     FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id)
-     WHERE pv.date = '2008-03-03';
- }}}
- In order check the existence of a key in another table, the user can use LEFT 
SEMI JOIN as illustrated by the following example.
- 
- {{{
-     INSERT OVERWRITE TABLE pv_users
-     SELECT u.*
-     FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id)
-     WHERE pv.date = '2008-03-03';
- }}}
- In order to join more than one tables, the user can use the following syntax:
- 
- {{{
-     INSERT OVERWRITE TABLE pv_friends
-     SELECT pv.*, u.gender, u.age, f.friends
-     FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON 
(u.id = f.uid)
-     WHERE pv.date = '2008-03-03';
- }}}
- Note that Hive only supports 
[[http://en.wikipedia.org/wiki/Join_(SQL)#Equi-join|equi-joins]]. Also it is 
best to put the largest table on the rightmost side of the join to get the best 
performance.
- 
- == Aggregations ==
- In order to count the number of distinct users by gender one could write the 
following query:
- 
- {{{
-     INSERT OVERWRITE TABLE pv_gender_sum
-     SELECT pv_users.gender, count (DISTINCT pv_users.userid)
-     FROM pv_users
-     GROUP BY pv_users.gender;
- }}}
- Multiple aggregations can be done at the same time, however, no two 
aggregations can have different DISTINCT columns .e.g while the following is 
possible
- 
- {{{
-     INSERT OVERWRITE TABLE pv_gender_agg
-     SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(1), 
sum(DISTINCT pv_users.userid)
-     FROM pv_users
-     GROUP BY pv_users.gender;
- }}}
- however, the following query is not allowed
- 
- {{{
-     INSERT OVERWRITE TABLE pv_gender_agg
-     SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT 
pv_users.ip)
-     FROM pv_users
-     GROUP BY pv_users.gender;
- }}}
- == Multi Table/File Inserts ==
- The output of the aggregations or simple selects can be further sent into 
multiple tables or even to hadoop dfs files (which can then be manipulated 
using hdfs utilities). e.g. if along with the gender breakdown, one needed to 
find the breakdown of unique page views by age, one could accomplish that with 
the following query:
- 
- {{{
-     FROM pv_users
-     INSERT OVERWRITE TABLE pv_gender_sum
-         SELECT pv_users.gender, count_distinct(pv_users.userid)
-         GROUP BY pv_users.gender
- 
-     INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
-         SELECT pv_users.age, count_distinct(pv_users.userid)
-         GROUP BY pv_users.age;
- }}}
- The first insert clause sends the results of the first group by to a Hive 
table while the second one sends the results to a hadoop dfs files.
  
  == Inserting into local files ==
  In certain situations you would want to write the output into a local file so 
that you could load it into an excel spreadsheet. This can be accomplished with 
the following command:

[Hadoop Wiki] Update of "Hive/Tutorial" by Ning Zhang

Reply via email to