SortBy" by ZhengShao

Apache Wiki Thu, 22 Jan 2009 14:52:59 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by ZhengShao:
http://wiki.apache.org/hadoop/Hive/LanguageManual/SortBy

------------------------------------------------------------------------------
  sortBy: SORT BY colName colOrder? (',' colName colOrder?)*
  query: SELECT expression (',' expression)* FROM src sortBy
  }}}
+ 
+ Hive uses the columns in ''SORT BY'' to sort the rows before feeding the rows 
to a single reducer.  The sort order will be dependent on the column types.  If 
the column is of numeric type, then the sort order is also in numeric order.  
If the column is of string type, then the sort order will be lexicographical 
order.
+ 
  
  == Difference between Sort By and Order By ==
  Most database systems supports ''ORDER BY'', which Hive does not support 
directly.
@@ -46, +49 @@

  SELECT key, value FROM src SORT BY key ASC, value DESC;
  }}}
  
+ This sometimes will make the reducer a performance bottleneck.  A lot of 
cases the user only wants to see the top N rows where N is a small number.  In 
this case, we can use LIMIT clause.  We don't have an example here but users 
are encouraged to provide one.
+ 
+ == Syntax of Cluster By and Distribute By ==
+ 
+ ''Cluster By'' and ''Distribute By'' are used mainly with the 
[wiki:Self:Hive/LanguageManual/Transform Transform/Map-Reduce Scripts]. But, it 
is sometimes useful in SELECT statements if there is a need to partition and 
sort the output of a query for subsequent queries.
+ 
+ ''Cluster By'' is a short-cut for both ''Distribute By'' and ''Sort By''.
+ 
+ Hive uses the columns in ''Distribute By'' to distribute the rows among 
reducers.  All rows with the same ''Distribute By'' columns will go to the same 
reducer.
+ 
+ Instead of specifying ''Cluster By'', the user can specify ''Distribute By'' 
and ''Sort By'', so the partition columns and sort columns can be different. 
The usual case is that the partition columns are a prefix of sort columns, but 
that is not required.
+ 
+ 
+ {{{
+ SELECT col1, col2 FROM t1 CLUSTER BY col1
+ }}}
+ 
+ {{{
+ SELECT col1, col2 FROM t1 DISTRIBUTE BY col1
+ 
+ SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC
+ }}}
+ 
+ {{{
+   FROM (
+     FROM pv_users
+     MAP ( pv_users.userid, pv_users.date )
+     USING 'map_script'
+     AS c1, c2, c3
+     DISTRIBUTE BY c2
+     SORT BY c2, c1) map_output
+   INSERT OVERWRITE TABLE pv_users_reduced
+     REDUCE ( map_output.c1, map_output.c2, map_output.c3 )
+     USING 'reduce_script'
+     AS date, count;
+ }}}
+

[Hadoop Wiki] Trivial Update of "Hive/LanguageManual/SortBy" by ZhengShao

Reply via email to