Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "Hive/Tutorial" page has been changed by Ning Zhang. http://wiki.apache.org/hadoop/Hive/Tutorial?action=diff&rev1=20&rev2=21 -------------------------------------------------- }}} It is assumed that the array and map fields in the input.txt files are null fields for these examples. + == Simple Query == + For all the active users, one can use the query of the following form: + + {{{ + INSERT OVERWRITE TABLE user_active + SELECT user.* + FROM user + WHERE user.active = 1; + }}} + Note that unlike SQL, we always insert the results into a table. We will illustrate later how the user can inspect these results and even dump them to a local file. You can also run the following query on Hive CLI: + + {{{ + SELECT user.* + FROM user + WHERE user.active = 1; + }}} + This will be internally rewritten to some temporary file and displayed to the Hive client side. + + == Partition Based Query == + What partitions to use in a query is determined automatically by the system on the basis of where clause conditions on partition columns. For example, in order to get all the page_views in the month of 03/2008 referred from domain xyz.com, one could write the following query: + + {{{ + INSERT OVERWRITE TABLE xyz_com_page_views + SELECT page_views.* + FROM page_views + WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND + page_views.referrer_url like '%xyz.com'; + }}} + Note that page_views.date is used here because the table (above) was defined with PARTITIONED BY(date DATETIME, country STRING) ; if you name your partition something different, don't expect .date to do what you think! + + == Joins == + In order to get a demographic breakdown (by gender) of page_view of 2008-03-03 one would need to join the page_view table and the user table on the userid column. This can be accomplished with a join as shown in the following query: + + {{{ + INSERT OVERWRITE TABLE pv_users + SELECT pv.*, u.gender, u.age + FROM user u JOIN page_view pv ON (pv.userid = u.id) + WHERE pv.date = '2008-03-03'; + }}} + In order to do outer joins the user can qualify the join with LEFT OUTER, RIGHT OUTER or FULL OUTER keywords in order to indicate the kind of outer join (left preserved, right preserved or both sides preserved). For example, in order to do a full outer join in the query above, the corresponding syntax would look like the following query: + + {{{ + INSERT OVERWRITE TABLE pv_users + SELECT pv.*, u.gender, u.age + FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id) + WHERE pv.date = '2008-03-03'; + }}} + In order check the existence of a key in another table, the user can use LEFT SEMI JOIN as illustrated by the following example. + + {{{ + INSERT OVERWRITE TABLE pv_users + SELECT u.* + FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id) + WHERE pv.date = '2008-03-03'; + }}} + In order to join more than one tables, the user can use the following syntax: + + {{{ + INSERT OVERWRITE TABLE pv_friends + SELECT pv.*, u.gender, u.age, f.friends + FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid) + WHERE pv.date = '2008-03-03'; + }}} + Note that Hive only supports [[http://en.wikipedia.org/wiki/Join_(SQL)#Equi-join|equi-joins]]. Also it is best to put the largest table on the rightmost side of the join to get the best performance. + + == Aggregations == + In order to count the number of distinct users by gender one could write the following query: + + {{{ + INSERT OVERWRITE TABLE pv_gender_sum + SELECT pv_users.gender, count (DISTINCT pv_users.userid) + FROM pv_users + GROUP BY pv_users.gender; + }}} + Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns .e.g while the following is possible + + {{{ + INSERT OVERWRITE TABLE pv_gender_agg + SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(1), sum(DISTINCT pv_users.userid) + FROM pv_users + GROUP BY pv_users.gender; + }}} + however, the following query is not allowed + + {{{ + INSERT OVERWRITE TABLE pv_gender_agg + SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip) + FROM pv_users + GROUP BY pv_users.gender; + }}} + == Multi Table/File Inserts == + The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilities). e.g. if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query: + + {{{ + FROM pv_users + INSERT OVERWRITE TABLE pv_gender_sum + SELECT pv_users.gender, count_distinct(pv_users.userid) + GROUP BY pv_users.gender + + INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum' + SELECT pv_users.age, count_distinct(pv_users.userid) + GROUP BY pv_users.age; + }}} + The first insert clause sends the results of the first group by to a Hive table while the second one sends the results to a hadoop dfs files. + + == Dynamic-partition Insert == + - In the above examples, the user has to know which partition to insert into and only one partition can be inserted in one insert statement. If you want to load into multiple partitions, you have to use multi-insert statement as illustrated below. + In the previous examples, the user has to know which partition to insert into and only one partition can be inserted in one insert statement. If you want to load into multiple partitions, you have to use multi-insert statement as illustrated below. {{{ FROM page_view_stg pvs + INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') - INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US' + SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US' + INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA') - INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'CA' + SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'CA' + INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK') - INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK'; + SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK'; }}} In order to load data into all country partitions in a particular day, you have to add an insert statement for each country in the input data. This is very inconvenient and inefficient since you have to have the priori knowledge of the list of countries exist in the input data and create the partitions beforehand. If the list changed for another day, you have to modify your insert DML as well as the partition creation DDLs. - ''Dynamic-partition insert'' (or multi-partition insert) is designed to solve this problem by dynamically determining which partitions should be created and populated. This is a newly added feature that is only available from version 0.6.0 (trunk now). In the dynamic partition insert, the input column values are evaluated to determine which partition this row should be inserted into. If that partition has not been created, it will create that partition automatically. Using this feature you need only one insert statement to create and populate all necessary partitions. Below is an example of loading data to all country partitions using one insert statement: + ''Dynamic-partition insert'' (or multi-partition insert) is designed to solve this problem by dynamically determining which partitions should be created and populated while scanning the input table. This is a newly added feature that is only available from version 0.6.0 (trunk now). In the dynamic partition insert, the input column values are evaluated to determine which partition this row should be inserted into. If that partition has not been created, it will create that partition automatically. Using this feature you need only one insert statement to create and populate all necessary partitions. Below is an example of loading data to all country partitions using one insert statement: {{{ - FROM page_view_stg pvs + FROM page_view_stg pvs - INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) + INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) - SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country + SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country }}} There are several syntactic differences from the multi-insert statement: @@ -350, +460 @@ * Another situation we want to protect against dynamic partition insert is that the user may accidentally specify all partitions to be dynamic partitions without specifying one static partition, while the original intention is to just overwrite the sub-partitions of one root partition. We define another parameter hive.exec.dynamic.partition.mode=strict to prevent the all-dynamic partition case. In the strict mode, you have to specify at least one static partition. The default mode is strict. In addition, we have a parameter hive.exec.dynamic.partition=true/false to control whether to allow dynamic partition at all. The default value is false. * Currently dynamic partition insert will not work with hive.merge.mapfiles=true or hive.merge.mapredfiles=true, so it internally turns off the merge parameters. The reason is that if either of the merge parameters is set to true, there will be a map reduce job for that particular partition to merge all files into one. In dynamic partition insert, we do not know the number of partitions at compile time thus no MapReduce job could be generated. There is a new JIRA HIVE-1307 filed for this task. - == Simple Query == - For all the active users, one can use the query of the following form: - - {{{ - INSERT OVERWRITE TABLE user_active - SELECT user.* - FROM user - WHERE user.active = 1; - }}} - Note that unlike SQL, we always insert the results into a table. We will illustrate later how the user can inspect these results and even dump them to a local file. You can also run the following query on Hive CLI: - - {{{ - SELECT user.* - FROM user - WHERE user.active = 1; - }}} - This will be internally rewritten to some temporary file and displayed to the Hive client side. - - == Partition Based Query == - What partitions to use in a query is determined automatically by the system on the basis of where clause conditions on partition columns. For example, in order to get all the page_views in the month of 03/2008 referred from domain xyz.com, one could write the following query: - - {{{ - INSERT OVERWRITE TABLE xyz_com_page_views - SELECT page_views.* - FROM page_views - WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND - page_views.referrer_url like '%xyz.com'; - }}} - Note that page_views.date is used here because the table (above) was defined with PARTITIONED BY(date DATETIME, country STRING) ; if you name your partition something different, don't expect .date to do what you think! - - == Joins == - In order to get a demographic breakdown (by gender) of page_view of 2008-03-03 one would need to join the page_view table and the user table on the userid column. This can be accomplished with a join as shown in the following query: - - {{{ - INSERT OVERWRITE TABLE pv_users - SELECT pv.*, u.gender, u.age - FROM user u JOIN page_view pv ON (pv.userid = u.id) - WHERE pv.date = '2008-03-03'; - }}} - In order to do outer joins the user can qualify the join with LEFT OUTER, RIGHT OUTER or FULL OUTER keywords in order to indicate the kind of outer join (left preserved, right preserved or both sides preserved). For example, in order to do a full outer join in the query above, the corresponding syntax would look like the following query: - - {{{ - INSERT OVERWRITE TABLE pv_users - SELECT pv.*, u.gender, u.age - FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id) - WHERE pv.date = '2008-03-03'; - }}} - In order check the existence of a key in another table, the user can use LEFT SEMI JOIN as illustrated by the following example. - - {{{ - INSERT OVERWRITE TABLE pv_users - SELECT u.* - FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id) - WHERE pv.date = '2008-03-03'; - }}} - In order to join more than one tables, the user can use the following syntax: - - {{{ - INSERT OVERWRITE TABLE pv_friends - SELECT pv.*, u.gender, u.age, f.friends - FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid) - WHERE pv.date = '2008-03-03'; - }}} - Note that Hive only supports [[http://en.wikipedia.org/wiki/Join_(SQL)#Equi-join|equi-joins]]. Also it is best to put the largest table on the rightmost side of the join to get the best performance. - - == Aggregations == - In order to count the number of distinct users by gender one could write the following query: - - {{{ - INSERT OVERWRITE TABLE pv_gender_sum - SELECT pv_users.gender, count (DISTINCT pv_users.userid) - FROM pv_users - GROUP BY pv_users.gender; - }}} - Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns .e.g while the following is possible - - {{{ - INSERT OVERWRITE TABLE pv_gender_agg - SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(1), sum(DISTINCT pv_users.userid) - FROM pv_users - GROUP BY pv_users.gender; - }}} - however, the following query is not allowed - - {{{ - INSERT OVERWRITE TABLE pv_gender_agg - SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip) - FROM pv_users - GROUP BY pv_users.gender; - }}} - == Multi Table/File Inserts == - The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilities). e.g. if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query: - - {{{ - FROM pv_users - INSERT OVERWRITE TABLE pv_gender_sum - SELECT pv_users.gender, count_distinct(pv_users.userid) - GROUP BY pv_users.gender - - INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum' - SELECT pv_users.age, count_distinct(pv_users.userid) - GROUP BY pv_users.age; - }}} - The first insert clause sends the results of the first group by to a Hive table while the second one sends the results to a hadoop dfs files. == Inserting into local files == In certain situations you would want to write the output into a local file so that you could load it into an excel spreadsheet. This can be accomplished with the following command:
