Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by ZhengShao: http://wiki.apache.org/hadoop/Hive/LanguageManual/SortBy ------------------------------------------------------------------------------ sortBy: SORT BY colName colOrder? (',' colName colOrder?)* query: SELECT expression (',' expression)* FROM src sortBy }}} + + Hive uses the columns in ''SORT BY'' to sort the rows before feeding the rows to a single reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order. + == Difference between Sort By and Order By == Most database systems supports ''ORDER BY'', which Hive does not support directly. @@ -46, +49 @@ SELECT key, value FROM src SORT BY key ASC, value DESC; }}} + This sometimes will make the reducer a performance bottleneck. A lot of cases the user only wants to see the top N rows where N is a small number. In this case, we can use LIMIT clause. We don't have an example here but users are encouraged to provide one. + + == Syntax of Cluster By and Distribute By == + + ''Cluster By'' and ''Distribute By'' are used mainly with the [wiki:Self:Hive/LanguageManual/Transform Transform/Map-Reduce Scripts]. But, it is sometimes useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries. + + ''Cluster By'' is a short-cut for both ''Distribute By'' and ''Sort By''. + + Hive uses the columns in ''Distribute By'' to distribute the rows among reducers. All rows with the same ''Distribute By'' columns will go to the same reducer. + + Instead of specifying ''Cluster By'', the user can specify ''Distribute By'' and ''Sort By'', so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required. + + + {{{ + SELECT col1, col2 FROM t1 CLUSTER BY col1 + }}} + + {{{ + SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 + + SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC + }}} + + {{{ + FROM ( + FROM pv_users + MAP ( pv_users.userid, pv_users.date ) + USING 'map_script' + AS c1, c2, c3 + DISTRIBUTE BY c2 + SORT BY c2, c1) map_output + INSERT OVERWRITE TABLE pv_users_reduced + REDUCE ( map_output.c1, map_output.c2, map_output.c3 ) + USING 'reduce_script' + AS date, count; + }}} +
