Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by RaghothamMurthy: http://wiki.apache.org/hadoop/Hive/LanguageManual/Select New page: == Select Syntax == {{{ SELECT [DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT number] }}} * table_reference indicates the input to the query. It can be a regular table, a [wiki:Self:Hive/LanguageManual/Joins join construct] or a [wiki:Self:Hive/LanguageManual/SubQuery subquery]. * Simple query. For example, the following query retrieves all columns and all rows from table t1. {{{ SELECT * FROM t1 }}} * Where clause - The where condition is a [wiki:Self:Hive/LanguageManual/Types boolean] [wiki:Self:Hive/LanguageManual/Expressions expression]. For example, the following query returns only those sales records which have an amount greater than 10 from the US region. Hive does not support IN, EXISTS or subqueries in the WHERE clause. {{{ SELECT * FROM sales WHERE amount > 10 AND region = "US" }}} * Partition based queries. In general, a SELECT query scans the entire table (other than for [wiki:Self:Hive/LanguageManual/Sampling sampling]). If a table created using the [wiki:Self:Hive/LanguageManual/DDL PARTITIONED BY] clause, a query can do '''input pruning''' and scan only a fraction of the table relevant to the query. For example, if table page_views is partitioned on column date, the following query retrieves rows for just one day 2008-03-31. {{{ SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' }}} * [wiki:Self:Hive/LanguageManual/GroupBy Group By] {{{ SELECT col1, COUNT(col2), sum(col3) FROM t1 GROUP BY col1 }}} * [wiki:Self:Hive/LanguageManual/ClusterBy Cluster By] {{{ SELECT col1, col2 FROM t1 CLUSTER BY col1 }}} * [wiki:Self:Hive/LanguageManual/ClusterBy Distribute By and Sort By] {{{ SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1, col2 }}} * Order By - Hive currently does not support ORDER BY. A similar effect can be gotten by using SORT BY and setting number of reducers to 1. The following query does ORDER BY col1. Note however that this query can take a long time if the size of t1 is large since there is only one reducer. {{{ SET mapred.reduce.tasks = 1 SELECT * FROM t1 SORT BY col1 }}} * Having - Hive currently does not support HAVING clause. A similar effect can be gotten by using a subquery. For example, {{{ SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10 }}} can be rewritten as {{{ SELECT col1 FROM (SELECT col1, SUM(col2) AS col2sum FROM t1 GROUP BY col1) t2 WHERE t2.col2sum > 10 }}} * Limit indicates the number of rows to be returned. The rows returned are chosen at random. The following query returns 5 rows from t1 at random. {{{ SELECT * FROM t1 LIMIT 5 }}} * Top k queries. The following query returns the top 5 sales records wrt amount. {{{ SET mapred.reduce.tasks = 1 SELECT * FROM sales SORT BY amount DESC LIMIT 5 }}}
