GroupBy" by ZhengShao

Apache Wiki Wed, 21 Jan 2009 15:11:21 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by ZhengShao:
http://wiki.apache.org/hadoop/Hive/HiveQL/GroupBy

New page:

== Group By Syntax ==
{{{

}}}

== Simple Examples ==
In order to count the number of distinct users by gender one could write the 
following query:
{{{
  INSERT OVERWRITE TABLE pv_gender_sum
  SELECT pv_users.gender, count (DISTINCT pv_users.userid)
  FROM pv_users
  GROUP BY pv_users.gender;
}}}

Multiple aggregations can be done at the same time, however, no two 
aggregations can have different DISTINCT columns .e.g while the following is 
possible

{{{
  INSERT OVERWRITE TABLE pv_gender_agg
  SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(1), 
sum(DISTINCT pv_users.userid)
  FROM pv_users
  GROUP BY pv_users.gender;
}}}

However, the following query is not allowed.  We don't allow multiple DISTINCT 
expressions in the same query.

{{{
  INSERT OVERWRITE TABLE pv_gender_agg
  SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT 
pv_users.ip)
  FROM pv_users
  GROUP BY pv_users.gender;
}}}


== Advanced Features ==

=== Multi-Group-By Inserts ===
The output of the aggregations or simple selects can be further sent into 
multiple tables or even to hadoop dfs files (which can then be manipulated 
using hdfs utilitites). e.g. if along with the gender breakdown, one needed to 
find the breakdown of unique page views by age, one could accomplish that with 
the following query:

{{{
  FROM pv_users 
  INSERT OVERWRITE TABLE pv_gender_sum
    SELECT pv_users.gender, count_distinct(pv_users.userid) 
    GROUP BY pv_users.gender 
  INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum'
    SELECT pv_users.age, count_distinct(pv_users.userid) 
    GROUP BY pv_users.age; 
}}}


=== Map-side Aggregation for Group By ===
''hive.map.aggr'' controls how we do aggregations.  The default is false.  If 
it is set to true, Hive will do the first-level aggregation directly in the map 
task.
This usually provides better efficiency, but may require more memory to run 
successfully.

{{{
  set hive.map.aggr=true;
  SELECT COUNT(1) FROM table2;
}}}

[Hadoop Wiki] Trivial Update of "Hive/HiveQL/GroupBy" by ZhengShao

Reply via email to