[Hadoop Wiki] Update of "Hive/Tutorial" by Ning Zhang

Apache Wiki Thu, 15 Apr 2010 12:30:31 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/Tutorial" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive/Tutorial?action=diff&rev1=19&rev2=20

--------------------------------------------------

  }}}
  It is assumed that the array and map fields in the input.txt files are null 
fields for these examples.
  
+ In the above examples, the user has to know which partition to insert into 
and only one partition can be inserted in one insert statement. If you want to 
load into multiple partitions, you have to use multi-insert statement as 
illustrated below. 
+ {{{
+     FROM page_view_stg pvs
+     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') 
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, 
pvs.ip WHERE pvs.country = 'US'
+     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA') 
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, 
pvs.ip WHERE pvs.country = 'CA'
+     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK') 
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, 
pvs.ip WHERE pvs.country = 'UK';    
+ }}}
+ 
+ In order to load data into all country partitions in a particular day, you 
have to add an insert statement for each country in the input data. This is 
very inconvenient and inefficient since you have to have the priori knowledge 
of the list of countries exist in the input data and create the partitions 
beforehand. If the list changed for another day, you have to modify your insert 
DML as well as the partition creation DDLs. 
+ 
+ ''Dynamic-partition insert'' (or multi-partition insert) is designed to solve 
this problem by dynamically determining which partitions should be created and 
populated. This is a newly added feature that is only available from version 
0.6.0 (trunk now). In the dynamic partition insert, the input column values are 
evaluated to determine which partition this row should be inserted into. If 
that partition has not been created, it will create that partition 
automatically. Using this feature you need only one insert statement to create 
and populate all necessary partitions. Below is an example of loading data to 
all country partitions using one insert statement:
+ 
+ {{{
+ FROM page_view_stg pvs
+ INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
+ SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, 
pvs.ip, pvs.country
+ }}}
+ 
+ There are several syntactic differences from the multi-insert statement: 
+   * country appears in the PARTITION clause, but it has no value associated 
with it. In this case, country is a ''dynamic partition column''. On the other 
hand, ds has a value associated with it, which means it is a ''static partition 
column''. If a column is dynamic partition column, its value will be coming 
from the input column. Currently we only allow dynamic partition columns to be 
the last column(s) in the partition clause because the partition column order 
indicates its hierarchical order (meaning dt is the root partition, and country 
is the child partition). You cannot specify a partition clause with (dt, 
country='US') because that means you need to update all partitions with any 
date and its country subpartition is 'US'. 
+   * An additional pvs.country column is added in the select statement. This 
is the corresponding input column for the dynamic partition column. Note that 
you do not need to add an input column for the static partition column because 
its value is already known in the PARTITION clause. 
+ 
+ Semantics of the dynamic partition insert statement:
+   * Since a Hive partition corresponds to a directory in HDFS, the partition 
value has to conform to the HDFS path format (URI in Java). Any character 
having a special meaning in URI (e.g., '%', ':', '/', '#') will be escaped with 
'%' followed by 2 bytes of its ASCII value.  
+   * If the input column is a type different than STRING, its value will be 
first converted to STRING to be used to construct the HDFS path. 
+   * If the input column value is NULL or empty string, the row will be put 
into a special partition, whose name is controlled by the hive parameter 
hive.exec.default.dynamic.partition.name. The default value is 
__HIVE_DEFAULT_PARTITION__. Basically this partition will contain all "bad" 
rows whose value are not valid partition names. The caveat of this approach is 
that the bad value will be lost and is replaced by __HIVE_DEFAULT_PARTITION__ 
if you select them Hive. JIRA HIVE-1309 is a solution to let user specify "bad 
file" to retain the input partition column values as well.
+   * Dynamic partition insert could potentially resource hog in that it could 
generate a large number of partitions in a short time. To get yourself buckled, 
we define two parameters hive.exec.max.dynamic.partitions (default value being 
1000) which is the total number of dynamic partitions could be created by one 
DML and hive.exec.max.dynamic.partition.pernode (default value being 100) which 
is the maximum dynamic partitions that can be created for each mapper or 
reducer. If one mapper or reducer created more than that the threshold, a fatal 
error will be raised from the mapper/reducer (through counter) and the whole 
job will be killed. If each mapper/reducer did not exceed the limit but the 
total number of dynamic partitions does, then an exception is raised at the end 
of the job before the intermediate data are moved to the final destination. 
+   * Another situation we want to protect against dynamic partition insert is 
that the user may accidentally specify all partitions to be dynamic partitions 
without specifying one static partition, while the original intention is to 
just overwrite the sub-partitions of one root partition. We define another 
parameter hive.exec.dynamic.partition.mode=strict to prevent the all-dynamic 
partition case. In the strict mode, you have to specify at least one static 
partition. The default mode is strict. In addition, we have a parameter 
hive.exec.dynamic.partition=true/false to control whether to allow dynamic 
partition at all. The default value is false. 
+   * Currently dynamic partition insert will not work with 
hive.merge.mapfiles=true or hive.merge.mapredfiles=true, so it internally turns 
off the merge parameters. The reason is that if either of the merge parameters 
is set to true, there will be a map reduce job for that particular partition to 
merge all files into one. In dynamic partition insert, we do not know the 
number of partitions at compile time thus no MapReduce job could be generated. 
There is a new JIRA HIVE-1307 filed for this task. 
+ 
  == Simple Query ==
  For all the active users, one can use the query of the following form:

[Hadoop Wiki] Update of "Hive/Tutorial" by Ning Zhang

Reply via email to