Select" by RaghothamMurthy

Apache Wiki Wed, 21 Jan 2009 17:08:44 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by RaghothamMurthy:
http://wiki.apache.org/hadoop/Hive/LanguageManual/Select

New page:
== Select Syntax ==
{{{
SELECT [DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition] 
[GROUP BY col_list]
[   CLUSTER BY col_list
  | [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT number]
}}}

 * table_reference indicates the input to the query. It can be a regular table, 
a [wiki:Self:Hive/LanguageManual/Joins join construct] or a 
[wiki:Self:Hive/LanguageManual/SubQuery subquery].
 * Simple query. For example, the following query retrieves all columns and all 
rows from table t1.
{{{
SELECT * FROM t1
}}}
 * Where clause - The where condition is a [wiki:Self:Hive/LanguageManual/Types 
boolean] [wiki:Self:Hive/LanguageManual/Expressions expression]. For example, 
the following query returns only those sales records which have an amount 
greater than 10 from the US region. Hive does not support IN, EXISTS or 
subqueries in the WHERE clause.
{{{
SELECT * FROM sales WHERE amount > 10 AND region = "US"
}}}

 * Partition based queries. In general, a SELECT query scans the entire table 
(other than for [wiki:Self:Hive/LanguageManual/Sampling sampling]). If a table 
created using the [wiki:Self:Hive/LanguageManual/DDL PARTITIONED BY] clause, a 
query can do '''input pruning''' and scan only a fraction of the table relevant 
to the query. For example, if table page_views is partitioned on column date, 
the following query retrieves rows for just one day 2008-03-31.
{{{
    SELECT page_views.* 
    FROM page_views 
    WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'
}}}

 * [wiki:Self:Hive/LanguageManual/GroupBy Group By]
{{{
SELECT col1, COUNT(col2), sum(col3) FROM t1 GROUP BY col1
}}}

 * [wiki:Self:Hive/LanguageManual/ClusterBy Cluster By]
{{{
SELECT col1, col2 FROM t1 CLUSTER BY col1
}}}

 * [wiki:Self:Hive/LanguageManual/ClusterBy Distribute By and Sort By]
{{{
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1

SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1, col2
}}}

 * Order By - Hive currently does not support ORDER BY. A similar effect can be 
gotten by using SORT BY and setting number of reducers to 1. The following 
query does ORDER BY col1. Note however that this query can take a long time if 
the size of t1 is large since there is only one reducer.
{{{
SET mapred.reduce.tasks = 1
SELECT * FROM t1 SORT BY col1
}}}

  * Having - Hive currently does not support HAVING clause. A similar effect 
can be gotten by using a subquery. For example, 
{{{
SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10
}}}
can be rewritten as
{{{
SELECT col1 FROM (SELECT col1, SUM(col2) AS col2sum FROM t1 GROUP BY col1) t2 
WHERE t2.col2sum > 10
}}}

 * Limit indicates the number of rows to be returned. The rows returned are 
chosen at random. The following query returns 5 rows from t1 at random.
{{{
SELECT * FROM t1 LIMIT 5
}}}

 * Top k queries. The following query returns the top 5 sales records wrt 
amount.
{{{
SET mapred.reduce.tasks = 1
SELECT * FROM sales SORT BY amount DESC LIMIT 5
}}}

[Hadoop Wiki] Trivial Update of "Hive/LanguageManual/Select" by RaghothamMurthy

Reply via email to