[Hadoop Wiki] Update of "Hive/Tutorial" by Ning Zhang

Apache Wiki Fri, 16 Apr 2010 21:24:53 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/Tutorial" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive/Tutorial?action=diff&rev1=22&rev2=23

--------------------------------------------------

             SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, 
null, null, pvs.ip WHERE pvs.country = 'UK';    
  }}}
  
- In order to load data into all country partitions in a particular day, you 
have to add an insert statement for each country in the input data. This is 
very inconvenient and inefficient since you have to have the priori knowledge 
of the list of countries exist in the input data and create the partitions 
beforehand. If the list changed for another day, you have to modify your insert 
DML as well as the partition creation DDLs. 
+ In order to load data into all country partitions in a particular day, you 
have to add an insert statement for each country in the input data. This is 
very inconvenient since you have to have the priori knowledge of the list of 
countries exist in the input data and create the partitions beforehand. If the 
list changed for another day, you have to modify your insert DML as well as the 
partition creation DDLs. It is also inefficient since each insert statement may 
be turned into a MapReduce Job. 
  
- ''Dynamic-partition insert'' (or multi-partition insert) is designed to solve 
this problem by dynamically determining which partitions should be created and 
populated while scanning the input table. This is a newly added feature that is 
only available from version 0.6.0 (trunk now). In the dynamic partition insert, 
the input column values are evaluated to determine which partition this row 
should be inserted into. If that partition has not been created, it will create 
that partition automatically. Using this feature you need only one insert 
statement to create and populate all necessary partitions. Below is an example 
of loading data to all country partitions using one insert statement:
+ ''Dynamic-partition insert'' (or multi-partition insert) is designed to solve 
this problem by dynamically determining which partitions should be created and 
populated while scanning the input table. This is a newly added feature that is 
only available from version 0.6.0 (trunk now). In the dynamic partition insert, 
the input column values are evaluated to determine which partition this row 
should be inserted into. If that partition has not been created, it will create 
that partition automatically. Using this feature you need only one insert 
statement to create and populate all necessary partitions. In addition, since 
there is only one insert statement, there is only one corresponding MapReduce 
job. This significantly improves performance and reduce the Hadoop cluster 
workload comparing to the multiple insert case.
+ 
+ Below is an example of loading data to all country partitions using one 
insert statement:
  
  {{{
      FROM page_view_stg pvs
@@ -449, +451 @@

  }}}
  
  There are several syntactic differences from the multi-insert statement: 
-   * country appears in the PARTITION clause, but it has no value associated 
with it. In this case, country is a ''dynamic partition column''. On the other 
hand, ds has a value associated with it, which means it is a ''static partition 
column''. If a column is dynamic partition column, its value will be coming 
from the input column. Currently we only allow dynamic partition columns to be 
the last column(s) in the partition clause because the partition column order 
indicates its hierarchical order (meaning dt is the root partition, and country 
is the child partition). You cannot specify a partition clause with (dt, 
country='US') because that means you need to update all partitions with any 
date and its country subpartition is 'US'. 
+   * country appears in the PARTITION specification, but with no value 
associated. In this case, country is a ''dynamic partition column''. On the 
other hand, ds has a value associated with it, which means it is a ''static 
partition column''. If a column is dynamic partition column, its value will be 
coming from the input column. Currently we only allow dynamic partition columns 
to be the last column(s) in the partition clause because the partition column 
order indicates its hierarchical order (meaning dt is the root partition, and 
country is the child partition). You cannot specify a partition clause with 
(dt, country='US') because that means you need to update all partitions with 
any date and its country sub-partition is 'US'. 
    * An additional pvs.country column is added in the select statement. This 
is the corresponding input column for the dynamic partition column. Note that 
you do not need to add an input column for the static partition column because 
its value is already known in the PARTITION clause. 
  
  Semantics of the dynamic partition insert statement:

[Hadoop Wiki] Update of "Hive/Tutorial" by Ning Zhang

Reply via email to