[Hadoop Wiki] Trivial Update of "Hive/Tutorial" by NeilConway

Apache Wiki Wed, 01 Apr 2009 17:41:15 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by NeilConway:
http://wiki.apache.org/hadoop/Hive/Tutorial

------------------------------------------------------------------------------
  || BIGINT || count(1), count(DISTINCT col [, col]...)|| count(1) returns the 
number of members in the group, whereas the count(DISTINCT col) gets the count 
of distinct values of the columns in the group||
  || DOUBLE || sum(col), sum(DISTINCT col) || returns the sum of the elements 
in the group or the sum of the distinct values of the column in the group||
  || DOUBLE || avg(col), avg(DISTINCT col) || returns the average of the 
elements in the group or the average of the distinct values of the column in 
the group||
- || DOUBLE || min(col) || returns the minimum of the column in the group||
+ || DOUBLE || min(col) || returns the minimum value of the column in the 
group||
- || DOUBLE || max(col) || returns the maximum value of the column n the group||
+ || DOUBLE || max(col) || returns the maximum value of the column in the 
group||
  
  == Language capabilities ==
  Hive query language provides the basic SQL like operations. These operations 
work on tables or partitions. These operations are:
@@ -223, +223 @@

              LINES TERMINATED BY '12' 
      STORED AS SEQUENCEFILE; 
  }}}
- In the example above, the table is bucketed(clustered by) userid and within 
each bucket the data is sorted in the increasing order of viewTime. Such an 
organization allows the user to do efficient sampling on the clustered column - 
in this case userid. The sorting property allows internal operators to take 
advantage of the better-known data structure while evaluating queries, also 
increasing efficiency.
+ In the example above, the table is bucketed (clustered by) userid and within 
each bucket the data is sorted in the increasing order of viewTime. Such an 
organization allows the user to do efficient sampling on the clustered column - 
in this case userid. The sorting property allows internal operators to take 
advantage of the better-known data structure while evaluating queries, also 
increasing efficiency.
  {{{    
      CREATE TABLE page_view(viewTime INT, userid BIGINT,
                      page_url STRING, referrer_url STRING, 
@@ -341, +341 @@

  (Note that page_views.date is used here because the table (above) was defined 
with PARTITIONED BY(date DATETIME, country STRING) ; if you name your partition 
something different, don't expect .date to do what you think!) 
  
  == Joins ==
- In order to get a demographic breakdown(by gender) of page_view of 2008-03-03 
one would need to join the page_view table and the user table on the userid 
column. This can be accomplished with a join as shown in the following query: 
+ In order to get a demographic breakdown (by gender) of page_view of 
2008-03-03 one would need to join the page_view table and the user table on the 
userid column. This can be accomplished with a join as shown in the following 
query: 
  {{{     
      INSERT OVERWRITE TABLE pv_users 
      SELECT pv.*, u.gender, u.age

[Hadoop Wiki] Trivial Update of "Hive/Tutorial" by NeilConway

Reply via email to