http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_analytic_functions.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_analytic_functions.xml b/docs/topics/impala_analytic_functions.xml new file mode 100644 index 0000000..293a512 --- /dev/null +++ b/docs/topics/impala_analytic_functions.xml @@ -0,0 +1,1742 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="2.0.0" id="analytic_functions"> + + <title>Impala Analytic Functions</title> + + <titlealts> + + <navtitle>Analytic Functions</navtitle> + + </titlealts> + + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="Aggregate Functions"/> + <data name="Category" value="Analytic Functions"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">analytic functions</indexterm> + + <indexterm audience="Cloudera">window functions</indexterm> + Analytic functions (also known as window functions) are a special category of built-in functions. Like + aggregate functions, they examine the contents of multiple input rows to compute each output value. However, + rather than being limited to one result value per <codeph>GROUP BY</codeph> group, they operate on + <term>windows</term> where the input rows are ordered and grouped using flexible conditions expressed through + an <codeph>OVER()</codeph> clause. + </p> + + <p conref="../shared/impala_common.xml#common/added_in_20"/> + +<!-- + <p> + Analytic functions produce one output value for each input row, like scalar functions such as + <codeph>length()</codeph> or + <codeph>substr()</codeph>. + </p> +--> + + <p> + Some functions, such as <codeph>LAG()</codeph> and <codeph>RANK()</codeph>, can only be used in this analytic + context. Some aggregate functions do double duty: when you call the aggregation functions such as + <codeph>MAX()</codeph>, <codeph>SUM()</codeph>, <codeph>AVG()</codeph>, and so on with an + <codeph>OVER()</codeph> clause, they produce an output value for each row, based on computations across other + rows in the window. + </p> + + <p> + Although analytic functions often compute the same value you would see from an aggregate function in a + <codeph>GROUP BY</codeph> query, the analytic functions produce a value for each row in the result set rather + than a single value for each group. This flexibility lets you include additional columns in the + <codeph>SELECT</codeph> list, offering more opportunities for organizing and filtering the result set. + </p> + + <p> + Analytic function calls are only allowed in the <codeph>SELECT</codeph> list and in the outermost + <codeph>ORDER BY</codeph> clause of the query. During query processing, analytic functions are evaluated + after other query stages such as joins, <codeph>WHERE</codeph>, and <codeph>GROUP BY</codeph>, + </p> + +<!-- Oracle doesn't show examples until it gets to the actual functions, so let's follow that lead. + <p> + The following example shows a very simple call to <codeph>MAX()</codeph> in + an analytic context, and a similar query using a <codeph>GROUP BY</codeph> clause. + </p> +--> + +<!-- +This basic query could be represented either as an analytic +function call or an aggregation function call in a <codeph>GROUP BY</codeph> query. +For more elaborate kinds of computations, the flexibility of the analytic window +makes that the preferred option. +--> + +<!-- TK: construct sample data and fill in query results. --> + +<!-- Other DB docs don't necessarily include examples up at this level, only for the individual functions. + So maybe take these placeholders out entirely. + +<codeblock>SELECT year, month, max(degrees_c) OVER (PARTITION BY year) FROM historical_temps; +SELECT year, month, max(degrees_c) FROM historical_temps GROUP BY year; +</codeblock> +--> + + <p> + The rows that are part of each partition are analyzed by computations across an ordered or unordered set of + rows. For example, <codeph>COUNT()</codeph> and <codeph>SUM()</codeph> might be applied to all the rows in + the partition, in which case the order of analysis does not matter. The <codeph>ORDER BY</codeph> clause + might be used inside the <codeph>OVER()</codeph> clause to defines the ordering that applies to functions + such as <codeph>LAG()</codeph> and <codeph>FIRST_VALUE()</codeph>. + </p> + +<!-- TK: output needed here also. --> + +<!-- +<codeblock>SELECT year, month, max(degrees_c) OVER (PARTITION BY year ORDER BY MONTH DESC) FROM historical_temps; +</codeblock> +--> + + <p> + Analytic functions are frequently used in fields such as finance and science to provide trend, outlier, and + bucketed analysis for large data sets. You might also see the term <q>window functions</q> in database + literature, referring to the sequence of rows (the <q>window</q>) that the function call applies to, + particularly when the <codeph>OVER</codeph> clause includes a <codeph>ROWS</codeph> or <codeph>RANGE</codeph> + keyword. + </p> + + <p> + The following sections describe the analytic query clauses and the pure analytic functions provided by + Impala. For usage information about aggregate functions in an analytic context, see + <xref href="impala_aggregate_functions.xml#aggregate_functions"/>. + </p> + + <p outputclass="toc inpage"/> + + </conbody> + + <concept id="over"> + + <title>OVER Clause</title> + + <conbody> + + <p> + The <codeph>OVER</codeph> clause is required for calls to pure analytic functions such as + <codeph>LEAD()</codeph>, <codeph>RANK()</codeph>, and <codeph>FIRST_VALUE()</codeph>. When you include an + <codeph>OVER</codeph> clause with calls to aggregate functions such as <codeph>MAX()</codeph>, + <codeph>COUNT()</codeph>, or <codeph>SUM()</codeph>, they operate as analytic functions. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>function(<varname>args</varname>) OVER([<varname>partition_by_clause</varname>] [<varname>order_by_clause</varname> [<varname>window_clause</varname>]]) + +partition_by_clause ::= PARTITION BY <varname>expr</varname> [, <varname>expr</varname> ...] +order_by_clause ::= ORDER BY <varname>expr</varname> [ASC | DESC] [NULLS FIRST | NULLS LAST] [, <varname>expr</varname> [ASC | DESC] [NULLS FIRST | NULLS LAST] ...] +window_clause: See <xref href="#window_clause">Window Clause</xref> +</codeblock> + + <p> + <b>PARTITION BY clause:</b> + </p> + + <p> + The <codeph>PARTITION BY</codeph> clause acts much like the <codeph>GROUP BY</codeph> clause in the + outermost block of a query. It divides the rows into groups containing identical values in one or more + columns. These logical groups are known as <term>partitions</term>. Throughout the discussion of analytic + functions, <q>partitions</q> refers to the groups produced by the <codeph>PARTITION BY</codeph> clause, not + to partitioned tables. + </p> + + <p> + The sequence of results from an analytic function <q>resets</q> for each new partition in the result set. + That is, the set of preceding or following rows considered by the analytic function always come from a + single partition. Any <codeph>MAX()</codeph>, <codeph>SUM()</codeph>, <codeph>ROW_NUMBER()</codeph>, and so + on apply to each partition independently. Omit the <codeph>PARTITION BY</codeph> clause to apply the + analytic operation to all the rows in the table. + </p> + + <p> + <b>ORDER BY clause:</b> + </p> + + <p> + The <codeph>ORDER BY</codeph> clause works much like the <codeph>ORDER BY</codeph> clause in the outermost + block of a query. It defines the order in which rows are evaluated for the entire input set, or for each + group produced by a <codeph>PARTITION BY</codeph> clause. You can order by one or multiple expressions, and + for each expression optionally choose ascending or descending order and whether nulls come first or last in + the sort order. Because this <codeph>ORDER BY</codeph> clause only defines the order in which rows are + evaluated, if you want the results to be output in a specific order, also include an <codeph>ORDER + BY</codeph> clause in the outer block of the query. + </p> + + <p> + When the <codeph>ORDER BY</codeph> clause is omitted, the analytic function applies to all items in the + group produced by the <codeph>PARTITION BY</codeph> clause. When the <codeph>ORDER BY</codeph> clause is + included, the analysis can apply to all or a subset of the items in the group, depending on the optional + window clause. + </p> + + <p> + The order in which the rows are analyzed is only defined for those columns specified in <codeph>ORDER + BY</codeph> clauses. + </p> + + <p> + One difference between the analytic and outer uses of the <codeph>ORDER BY</codeph> clause: inside the + <codeph>OVER</codeph> clause, <codeph>ORDER BY 1</codeph> or other integer value is interpreted as a + constant sort value (effectively a no-op) rather than referring to column 1. + </p> + + <p> + <b>Window clause:</b> + </p> + + <p> + The window clause is only allowed in combination with an <codeph>ORDER BY</codeph> clause. If the + <codeph>ORDER BY</codeph> clause is specified but the window clause is not, the default window is + <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>. See + <xref href="impala_analytic_functions.xml#window_clause"/> for full details. + </p> + +<!-- + <p conref="/Content/impala_common_xi44078.xml#common/usage_notes_blurb"/> +--> + + <p conref="../shared/impala_common.xml#common/hbase_blurb"/> + + <p> + Because HBase tables are optimized for single-row lookups rather than full scans, analytic functions using + the <codeph>OVER()</codeph> clause are not recommended for HBase tables. Although such queries work, their + performance is lower than on comparable tables using HDFS data files. + </p> + + <p conref="../shared/impala_common.xml#common/parquet_blurb"/> + + <p> + Analytic functions are very efficient for Parquet tables. The data that is examined during evaluation of + the <codeph>OVER()</codeph> clause comes from a specified set of columns, and the values for each column + are arranged sequentially within each data file. + </p> + + <p conref="../shared/impala_common.xml#common/text_blurb"/> + + <p> + Analytic functions are convenient to use with text tables for exploratory business intelligence. When the + volume of data is substantial, prefer to use Parquet tables for performance-critical analytic queries. + </p> + + <p conref="../shared/impala_common.xml#common/added_in_20"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example shows how to synthesize a numeric sequence corresponding to all the rows in a table. + The new table has the same columns as the old one, plus an additional column <codeph>ID</codeph> containing + the integers 1, 2, 3, and so on, corresponding to the order of a <codeph>TIMESTAMP</codeph> column in the + original table. + </p> + +<!-- TK: synthesize some data and fill in output here. --> + +<codeblock>CREATE TABLE events_with_id AS + SELECT + row_number() OVER (ORDER BY date_and_time) AS id, + c1, c2, c3, c4 + FROM events; +</codeblock> + + <p> + The following example shows how to determine the number of rows containing each value for a column. Unlike + a corresponding <codeph>GROUP BY</codeph> query, this one can analyze a single column and still return all + values (not just the distinct ones) from the other columns. + </p> + +<!-- TK: verify the 'unbounded' shortcut syntax. --> + +<codeblock>SELECT x, y, z, + count() OVER (PARTITION BY x) AS how_many_x +FROM t1; +</codeblock> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p> + You cannot directly combine the <codeph>DISTINCT</codeph> operator with analytic function calls. You can + put the analytic function call in a <codeph>WITH</codeph> clause or an inline view, and apply the + <codeph>DISTINCT</codeph> operator to its result set. + </p> + +<codeblock>WITH t1 AS (SELECT x, sum(x) OVER (PARTITION BY x) AS total FROM t1) + SELECT DISTINCT x, total FROM t1; +</codeblock> + + </conbody> + + </concept> + + <concept id="window_clause"> + + <title>Window Clause</title> + + <conbody> + + <p> + Certain analytic functions accept an optional <term>window clause</term>, which makes the function analyze + only certain rows <q>around</q> the current row rather than all rows in the partition. For example, you can + get a moving average by specifying some number of preceding and following rows, or a running count or + running total by specifying all rows up to the current position. This clause can result in different + analytic results for rows within the same partition. + </p> + + <p> + The window clause is supported with the <codeph>AVG()</codeph>, <codeph>COUNT()</codeph>, + <codeph>FIRST_VALUE()</codeph>, <codeph>LAST_VALUE()</codeph>, and <codeph>SUM()</codeph> functions. +<!-- To do: fill in this factoid under MAX and MIN also. --> + For <codeph>MAX()</codeph> and <codeph>MIN()</codeph>, the window clause only allowed if the start bound is + <codeph>UNBOUNDED PRECEDING</codeph> + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>ROWS BETWEEN [ { <varname>m</varname> | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | <varname>n</varname> } FOLLOWING] ] +RANGE BETWEEN [ {<varname>m</varname> | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | <varname>n</varname> } FOLLOWING] ]</codeblock> + + <p> + <codeph>ROWS BETWEEN</codeph> defines the size of the window in terms of the indexes of the rows in the + result set. The size of the window is predictable based on the clauses the position within the result set. + </p> + + <p> + <codeph>RANGE BETWEEN</codeph> does not currently support numeric arguments to define a variable-size + sliding window. +<!-- +Currently, it effectively works the same as the +equivalent <codeph>ROWS BETWEEN</codeph> clause. +--> + </p> + +<!-- +<p> +<codeph>RANGE BETWEEN</codeph> defines the size of the window based on arithmetic comparisons +of the values in the result set. +The size of the window varies depending on the order and distribution of values. +</p> +--> + +<!-- <p conref="/Content/impala_common_xi44078.xml#common/restrictions_blurb"/> --> + + <p> + Currently, Impala supports only some combinations of arguments to the <codeph>RANGE</codeph> clause: + </p> + + <ul> + <li> + <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph> (the default when <codeph>ORDER + BY</codeph> is specified and the window clause is omitted) + </li> + + <li> + <codeph>RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING</codeph> + </li> + + <li> + <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING</codeph> + </li> + </ul> + + <p> + When <codeph>RANGE</codeph> is used, <codeph>CURRENT ROW</codeph> includes not just the current row but all + rows that are tied with the current row based on the <codeph>ORDER BY</codeph> expressions. + </p> + + <p conref="../shared/impala_common.xml#common/added_in_20"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following examples show financial data for a fictional stock symbol <codeph>JDR</codeph>. The closing + price moves up and down each day. + </p> + +<codeblock>create table stock_ticker (stock_symbol string, closing_price decimal(8,2), closing_date timestamp); +...load some data... +select * from stock_ticker order by stock_symbol, closing_date ++--------------+---------------+---------------------+ +| stock_symbol | closing_price | closing_date | ++--------------+---------------+---------------------+ +| JDR | 12.86 | 2014-10-02 00:00:00 | +| JDR | 12.89 | 2014-10-03 00:00:00 | +| JDR | 12.94 | 2014-10-04 00:00:00 | +| JDR | 12.55 | 2014-10-05 00:00:00 | +| JDR | 14.03 | 2014-10-06 00:00:00 | +| JDR | 14.75 | 2014-10-07 00:00:00 | +| JDR | 13.98 | 2014-10-08 00:00:00 | ++--------------+---------------+---------------------+ +</codeblock> + + <p> + The queries use analytic functions with window clauses to compute moving averages of the closing price. For + example, <codeph>ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING</codeph> produces an average of the value from a + 3-day span, producing a different value for each row. The first row, which has no preceding row, only gets + averaged with the row following it. If the table contained more than one stock symbol, the + <codeph>PARTITION BY</codeph> clause would limit the window for the moving average to only consider the + prices for a single stock. + </p> + +<codeblock>select stock_symbol, closing_date, closing_price, + avg(closing_price) over (partition by stock_symbol order by closing_date + rows between 1 preceding and 1 following) as moving_average + from stock_ticker; ++--------------+---------------------+---------------+----------------+ +| stock_symbol | closing_date | closing_price | moving_average | ++--------------+---------------------+---------------+----------------+ +| JDR | 2014-10-02 00:00:00 | 12.86 | 12.87 | +| JDR | 2014-10-03 00:00:00 | 12.89 | 12.89 | +| JDR | 2014-10-04 00:00:00 | 12.94 | 12.79 | +| JDR | 2014-10-05 00:00:00 | 12.55 | 13.17 | +| JDR | 2014-10-06 00:00:00 | 14.03 | 13.77 | +| JDR | 2014-10-07 00:00:00 | 14.75 | 14.25 | +| JDR | 2014-10-08 00:00:00 | 13.98 | 14.36 | ++--------------+---------------------+---------------+----------------+ +</codeblock> + + <p> + The clause <codeph>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph> produces a cumulative moving + average, from the earliest data up to the value for each day. + </p> + +<codeblock>select stock_symbol, closing_date, closing_price, + avg(closing_price) over (partition by stock_symbol order by closing_date + rows between unbounded preceding and current row) as moving_average + from stock_ticker; ++--------------+---------------------+---------------+----------------+ +| stock_symbol | closing_date | closing_price | moving_average | ++--------------+---------------------+---------------+----------------+ +| JDR | 2014-10-02 00:00:00 | 12.86 | 12.86 | +| JDR | 2014-10-03 00:00:00 | 12.89 | 12.87 | +| JDR | 2014-10-04 00:00:00 | 12.94 | 12.89 | +| JDR | 2014-10-05 00:00:00 | 12.55 | 12.81 | +| JDR | 2014-10-06 00:00:00 | 14.03 | 13.05 | +| JDR | 2014-10-07 00:00:00 | 14.75 | 13.33 | +| JDR | 2014-10-08 00:00:00 | 13.98 | 13.42 | ++--------------+---------------------+---------------+----------------+ +</codeblock> + +<!-- Matt suggests not always true depending on data. Hiding until I can try myself. +<p> +The clause <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph> would produce the same +output as above. Because <codeph>RANGE</codeph> currently does not support numeric offsets while +<codeph>ROWS</codeph> does, currently the <codeph>ROWS</codeph> syntax is more flexible. +</p> +--> + + </conbody> + + </concept> + + <concept id="avg_analytic"> + + <title>AVG() Function - Analytic Context</title> + + <conbody> + + <p> + You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic + function. See <xref href="impala_avg.xml#avg"/> for details and examples. + </p> + + </conbody> + + </concept> + + <concept id="count_analytic"> + + <title>COUNT() Function - Analytic Context</title> + + <conbody> + + <p> + You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic + function. See <xref href="impala_count.xml#count"/> for details and examples. + </p> + + </conbody> + + </concept> + + <concept rev="2.3.0" id="cume_dist"> + + <title>CUME_DIST() Function (CDH 5.5 or higher only)</title> + + <conbody> + + <p> + Returns the cumulative distribution of a value. The value for each row in the result set is greater than 0 + and less than or equal to 1. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>CUME_DIST (<varname>expr</varname>) + OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>) +</codeblock> + + <p> + The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The + window clause is not allowed. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + Within each partition of the result set, the <codeph>CUME_DIST()</codeph> value represents an ascending + sequence that ends at 1. Each value represents the proportion of rows in the partition whose values are + less than or equal to the value in the current row. + </p> + + <p> + If the sequence of input values contains ties, the <codeph>CUME_DIST()</codeph> results are identical for the + tied values. + </p> + + <p> + Impala only supports the <codeph>CUME_DIST()</codeph> function in an analytic context, not as a regular + aggregate function. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + This example uses a table with 9 rows. The <codeph>CUME_DIST()</codeph> + function evaluates the entire table because there is no <codeph>PARTITION BY</codeph> clause, + with the rows ordered by the weight of the animal. + the sequence of values shows that 1/9 of the values are less than or equal to the lightest + animal (mouse), 2/9 of the values are less than or equal to the second-lightest animal, + and so on up to the heaviest animal (elephant), where 9/9 of the rows are less than or + equal to its weight. + </p> + +<codeblock>create table animals (name string, kind string, kilos decimal(9,3)); +insert into animals values + ('Elephant', 'Mammal', 4000), ('Giraffe', 'Mammal', 1200), ('Mouse', 'Mammal', 0.020), + ('Condor', 'Bird', 15), ('Horse', 'Mammal', 500), ('Owl', 'Bird', 2.5), + ('Ostrich', 'Bird', 145), ('Polar bear', 'Mammal', 700), ('Housecat', 'Mammal', 5); + +select name, cume_dist() over (order by kilos) from animals; ++------------+-----------------------+ +| name | cume_dist() OVER(...) | ++------------+-----------------------+ +| Elephant | 1 | +| Giraffe | 0.8888888888888888 | +| Polar bear | 0.7777777777777778 | +| Horse | 0.6666666666666666 | +| Ostrich | 0.5555555555555556 | +| Condor | 0.4444444444444444 | +| Housecat | 0.3333333333333333 | +| Owl | 0.2222222222222222 | +| Mouse | 0.1111111111111111 | ++------------+-----------------------+ +</codeblock> + + <p> + Using a <codeph>PARTITION BY</codeph> clause produces a separate sequence for each partition + group, in this case one for mammals and one for birds. Because there are 3 birds and 6 mammals, + the sequence illustrates how 1/3 of the <q>Bird</q> rows have a <codeph>kilos</codeph> value that is less than or equal to + the lightest bird, 1/6 of the <q>Mammal</q> rows have a <codeph>kilos</codeph> value that is less than or equal to + the lightest mammal, and so on until both the heaviest bird and heaviest mammal have a <codeph>CUME_DIST()</codeph> + value of 1. + </p> + +<codeblock>select name, kind, cume_dist() over (partition by kind order by kilos) from animals ++------------+--------+-----------------------+ +| name | kind | cume_dist() OVER(...) | ++------------+--------+-----------------------+ +| Ostrich | Bird | 1 | +| Condor | Bird | 0.6666666666666666 | +| Owl | Bird | 0.3333333333333333 | +| Elephant | Mammal | 1 | +| Giraffe | Mammal | 0.8333333333333334 | +| Polar bear | Mammal | 0.6666666666666666 | +| Horse | Mammal | 0.5 | +| Housecat | Mammal | 0.3333333333333333 | +| Mouse | Mammal | 0.1666666666666667 | ++------------+--------+-----------------------+ +</codeblock> + + <p> + We can reverse the ordering within each partition group by using an <codeph>ORDER BY ... DESC</codeph> + clause within the <codeph>OVER()</codeph> clause. Now the lightest (smallest value of <codeph>kilos</codeph>) + animal of each kind has a <codeph>CUME_DIST()</codeph> value of 1. + </p> + +<codeblock>select name, kind, cume_dist() over (partition by kind order by kilos desc) from animals ++------------+--------+-----------------------+ +| name | kind | cume_dist() OVER(...) | ++------------+--------+-----------------------+ +| Owl | Bird | 1 | +| Condor | Bird | 0.6666666666666666 | +| Ostrich | Bird | 0.3333333333333333 | +| Mouse | Mammal | 1 | +| Housecat | Mammal | 0.8333333333333334 | +| Horse | Mammal | 0.6666666666666666 | +| Polar bear | Mammal | 0.5 | +| Giraffe | Mammal | 0.3333333333333333 | +| Elephant | Mammal | 0.1666666666666667 | ++------------+--------+-----------------------+ +</codeblock> + + <p> + The following example manufactures some rows with identical values in the <codeph>kilos</codeph> column, + to demonstrate how the results look in case of tie values. For simplicity, it only shows the <codeph>CUME_DIST()</codeph> + sequence for the <q>Bird</q> rows. Now with 3 rows all with a value of 15, all of those rows have the same + <codeph>CUME_DIST()</codeph> value. 4/5 of the rows have a value for <codeph>kilos</codeph> that is less than or + equal to 15. + </p> + +<codeblock>insert into animals values ('California Condor', 'Bird', 15), ('Andean Condor', 'Bird', 15) + +select name, kind, cume_dist() over (order by kilos) from animals where kind = 'Bird'; ++-------------------+------+-----------------------+ +| name | kind | cume_dist() OVER(...) | ++-------------------+------+-----------------------+ +| Ostrich | Bird | 1 | +| Condor | Bird | 0.8 | +| California Condor | Bird | 0.8 | +| Andean Condor | Bird | 0.8 | +| Owl | Bird | 0.2 | ++-------------------+------+-----------------------+ +</codeblock> + + <p> + The following example shows how to use an <codeph>ORDER BY</codeph> clause in the outer block + to order the result set in case of ties. Here, all the <q>Bird</q> rows are together, then in descending order + by the result of the <codeph>CUME_DIST()</codeph> function, and all tied <codeph>CUME_DIST()</codeph> + values are ordered by the animal name. + </p> + +<codeblock>select name, kind, cume_dist() over (partition by kind order by kilos) as ordering + from animals +where + kind = 'Bird' +order by kind, ordering desc, name; ++-------------------+------+----------+ +| name | kind | ordering | ++-------------------+------+----------+ +| Ostrich | Bird | 1 | +| Andean Condor | Bird | 0.8 | +| California Condor | Bird | 0.8 | +| Condor | Bird | 0.8 | +| Owl | Bird | 0.2 | ++-------------------+------+----------+ +</codeblock> + + </conbody> + + </concept> + + <concept rev="2.0.0" id="dense_rank"> + + <title>DENSE_RANK() Function</title> + + <conbody> + + <p> + Returns an ascending sequence of integers, starting with 1. The output sequence produces duplicate integers + for duplicate values of the <codeph>ORDER BY</codeph> expressions. After generating duplicate output values + for the <q>tied</q> input values, the function continues the sequence with the next higher integer. + Therefore, the sequence contains duplicates but no gaps when the input contains duplicates. Starts the + sequence over for each group produced by the <codeph>PARTITIONED BY</codeph> clause. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>DENSE_RANK() OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock> + + <p> + The <codeph>PARTITION BY</codeph> clause is optional. The <codeph>ORDER BY</codeph> clause is required. The + window clause is not allowed. + </p> + +<!-- Can make the text for ROW_NUMBER, RANK, and DENSE_RANK identical + so it can be conref'ed in all 3 places. --> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + Often used for top-N and bottom-N queries. For example, it could produce a <q>top 10</q> report including + all the items with the 10 highest values, even if several items tied for 1st place. + </p> + + <p> + Similar to <codeph>ROW_NUMBER</codeph> and <codeph>RANK</codeph>. These functions differ in how they treat + duplicate combinations of values. + </p> + + <p conref="../shared/impala_common.xml#common/added_in_20"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example demonstrates how the <codeph>DENSE_RANK()</codeph> function identifies where each + value <q>places</q> in the result set, producing the same result for duplicate values, but with a strict + sequence from 1 to the number of groups. For example, when results are ordered by the <codeph>X</codeph> + column, both <codeph>1</codeph> values are tied for first; both <codeph>2</codeph> values are tied for + second; and so on. + </p> + +<codeblock>select x, dense_rank() over(order by x) as rank, property from int_t; ++----+------+----------+ +| x | rank | property | ++----+------+----------+ +| 1 | 1 | square | +| 1 | 1 | odd | +| 2 | 2 | even | +| 2 | 2 | prime | +| 3 | 3 | prime | +| 3 | 3 | odd | +| 4 | 4 | even | +| 4 | 4 | square | +| 5 | 5 | odd | +| 5 | 5 | prime | +| 6 | 6 | even | +| 6 | 6 | perfect | +| 7 | 7 | lucky | +| 7 | 7 | lucky | +| 7 | 7 | lucky | +| 7 | 7 | odd | +| 7 | 7 | prime | +| 8 | 8 | even | +| 9 | 9 | square | +| 9 | 9 | odd | +| 10 | 10 | round | +| 10 | 10 | even | ++----+------+----------+ +</codeblock> + + <p> + The following examples show how the <codeph>DENSE_RANK()</codeph> function is affected by the + <codeph>PARTITION</codeph> property within the <codeph>ORDER BY</codeph> clause. + </p> + + <p> + Partitioning by the <codeph>PROPERTY</codeph> column groups all the even, odd, and so on values together, + and <codeph>DENSE_RANK()</codeph> returns the place of each value within the group, producing several + ascending sequences. + </p> + +<codeblock>select x, dense_rank() over(partition by property order by x) as rank, property from int_t; ++----+------+----------+ +| x | rank | property | ++----+------+----------+ +| 2 | 1 | even | +| 4 | 2 | even | +| 6 | 3 | even | +| 8 | 4 | even | +| 10 | 5 | even | +| 7 | 1 | lucky | +| 7 | 1 | lucky | +| 7 | 1 | lucky | +| 1 | 1 | odd | +| 3 | 2 | odd | +| 5 | 3 | odd | +| 7 | 4 | odd | +| 9 | 5 | odd | +| 6 | 1 | perfect | +| 2 | 1 | prime | +| 3 | 2 | prime | +| 5 | 3 | prime | +| 7 | 4 | prime | +| 10 | 1 | round | +| 1 | 1 | square | +| 4 | 2 | square | +| 9 | 3 | square | ++----+------+----------+ +</codeblock> + + <p> + Partitioning by the <codeph>X</codeph> column groups all the duplicate numbers together and returns the + place each each value within the group; because each value occurs only 1 or 2 times, + <codeph>DENSE_RANK()</codeph> designates each <codeph>X</codeph> value as either first or second within its + group. + </p> + +<codeblock>select x, dense_rank() over(partition by x order by property) as rank, property from int_t; ++----+------+----------+ +| x | rank | property | ++----+------+----------+ +| 1 | 1 | odd | +| 1 | 2 | square | +| 2 | 1 | even | +| 2 | 2 | prime | +| 3 | 1 | odd | +| 3 | 2 | prime | +| 4 | 1 | even | +| 4 | 2 | square | +| 5 | 1 | odd | +| 5 | 2 | prime | +| 6 | 1 | even | +| 6 | 2 | perfect | +| 7 | 1 | lucky | +| 7 | 1 | lucky | +| 7 | 1 | lucky | +| 7 | 2 | odd | +| 7 | 3 | prime | +| 8 | 1 | even | +| 9 | 1 | odd | +| 9 | 2 | square | +| 10 | 1 | even | +| 10 | 2 | round | ++----+------+----------+ +</codeblock> + + <p> + The following example shows how <codeph>DENSE_RANK()</codeph> produces a continuous sequence while still + allowing for ties. In this case, Croesus and Midas both have the second largest fortune, while Crassus has + the third largest. (In <xref href="impala_analytic_functions.xml#rank"/>, you see a similar query with the + <codeph>RANK()</codeph> function that shows that while Crassus has the third largest fortune, he is the + fourth richest person.) + </p> + +<codeblock>select dense_rank() over (order by net_worth desc) as placement, name, net_worth from wealth order by placement, name; ++-----------+---------+---------------+ +| placement | name | net_worth | ++-----------+---------+---------------+ +| 1 | Solomon | 2000000000.00 | +| 2 | Croesus | 1000000000.00 | +| 2 | Midas | 1000000000.00 | +| 3 | Crassus | 500000000.00 | +| 4 | Scrooge | 80000000.00 | ++-----------+---------+---------------+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_analytic_functions.xml#rank"/>, <xref href="impala_analytic_functions.xml#row_number"/> + </p> + + </conbody> + + </concept> + + <concept rev="2.0.0" id="first_value"> + + <title>FIRST_VALUE() Function</title> + + <conbody> + + <p> + Returns the expression value from the first row in the window. The return value is <codeph>NULL</codeph> if + the input expression is <codeph>NULL</codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>FIRST_VALUE(<varname>expr</varname>) OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname> [<varname>window_clause</varname>])</codeblock> + + <p> + The <codeph>PARTITION BY</codeph> clause is optional. The <codeph>ORDER BY</codeph> clause is required. The + window clause is optional. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + If any duplicate values occur in the tuples evaluated by the <codeph>ORDER BY</codeph> clause, the result + of this function is not deterministic. Consider adding additional <codeph>ORDER BY</codeph> columns to + ensure consistent ordering. + </p> + + <p conref="../shared/impala_common.xml#common/added_in_20"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example shows a table with a wide variety of country-appropriate greetings. For consistency, + we want to standardize on a single greeting for each country. The <codeph>FIRST_VALUE()</codeph> function + helps to produce a mail merge report where every person from the same country is addressed with the same + greeting. + </p> + +<codeblock>select name, country, greeting from mail_merge ++---------+---------+--------------+ +| name | country | greeting | ++---------+---------+--------------+ +| Pete | USA | Hello | +| John | USA | Hi | +| Boris | Germany | Guten tag | +| Michael | Germany | Guten morgen | +| Bjorn | Sweden | Hej | +| Mats | Sweden | Tja | ++---------+---------+--------------+ + +select country, name, + first_value(greeting) + over (partition by country order by name, greeting) as greeting + from mail_merge; ++---------+---------+-----------+ +| country | name | greeting | ++---------+---------+-----------+ +| Germany | Boris | Guten tag | +| Germany | Michael | Guten tag | +| Sweden | Bjorn | Hej | +| Sweden | Mats | Hej | +| USA | John | Hi | +| USA | Pete | Hi | ++---------+---------+-----------+ +</codeblock> + + <p> + Changing the order in which the names are evaluated changes which greeting is applied to each group. + </p> + +<codeblock>select country, name, + first_value(greeting) + over (partition by country order by name desc, greeting) as greeting + from mail_merge; ++---------+---------+--------------+ +| country | name | greeting | ++---------+---------+--------------+ +| Germany | Michael | Guten morgen | +| Germany | Boris | Guten morgen | +| Sweden | Mats | Tja | +| Sweden | Bjorn | Tja | +| USA | Pete | Hello | +| USA | John | Hello | ++---------+---------+--------------+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_analytic_functions.xml#last_value"/> + </p> + + </conbody> + + </concept> + + <concept rev="2.0.0" id="lag"> + + <title>LAG() Function</title> + + <conbody> + + <p> + This function returns the value of an expression using column values from a preceding row. You specify an + integer offset, which designates a row position some number of rows previous to the current row. Any column + references in the expression argument refer to column values from that prior row. Typically, the table + contains a time sequence or numeric sequence column that clearly distinguishes the ordering of the rows. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>LAG (<varname>expr</varname> [, <varname>offset</varname>] [, <varname>default</varname>]) + OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock> + + <p> + The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The + window clause is not allowed. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + Sometimes used an an alternative to doing a self-join. + </p> + + <p conref="../shared/impala_common.xml#common/added_in_20"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example uses the same stock data created in <xref href="#window_clause"/>. For each day, the + query prints the closing price alongside the previous day's closing price. The first row for each stock + symbol has no previous row, so that <codeph>LAG()</codeph> value is <codeph>NULL</codeph>. + </p> + +<codeblock>select stock_symbol, closing_date, closing_price, + lag(closing_price,1) over (partition by stock_symbol order by closing_date) as "yesterday closing" + from stock_ticker + order by closing_date; ++--------------+---------------------+---------------+-------------------+ +| stock_symbol | closing_date | closing_price | yesterday closing | ++--------------+---------------------+---------------+-------------------+ +| JDR | 2014-09-13 00:00:00 | 12.86 | NULL | +| JDR | 2014-09-14 00:00:00 | 12.89 | 12.86 | +| JDR | 2014-09-15 00:00:00 | 12.94 | 12.89 | +| JDR | 2014-09-16 00:00:00 | 12.55 | 12.94 | +| JDR | 2014-09-17 00:00:00 | 14.03 | 12.55 | +| JDR | 2014-09-18 00:00:00 | 14.75 | 14.03 | +| JDR | 2014-09-19 00:00:00 | 13.98 | 14.75 | ++--------------+---------------------+---------------+-------------------+ +</codeblock> + + <p> + The following example does an arithmetic operation between the current row and a value from the previous + row, to produce a delta value for each day. This example also demonstrates how <codeph>ORDER BY</codeph> + works independently in the different parts of the query. The <codeph>ORDER BY closing_date</codeph> in the + <codeph>OVER</codeph> clause makes the query analyze the rows in chronological order. Then the outer query + block uses <codeph>ORDER BY closing_date DESC</codeph> to present the results with the most recent date + first. + </p> + +<codeblock>select stock_symbol, closing_date, closing_price, + cast( + closing_price - lag(closing_price,1) over + (partition by stock_symbol order by closing_date) + as decimal(8,2) + ) + as "change from yesterday" + from stock_ticker + order by closing_date desc; ++--------------+---------------------+---------------+-----------------------+ +| stock_symbol | closing_date | closing_price | change from yesterday | ++--------------+---------------------+---------------+-----------------------+ +| JDR | 2014-09-19 00:00:00 | 13.98 | -0.76 | +| JDR | 2014-09-18 00:00:00 | 14.75 | 0.72 | +| JDR | 2014-09-17 00:00:00 | 14.03 | 1.47 | +| JDR | 2014-09-16 00:00:00 | 12.55 | -0.38 | +| JDR | 2014-09-15 00:00:00 | 12.94 | 0.04 | +| JDR | 2014-09-14 00:00:00 | 12.89 | 0.03 | +| JDR | 2014-09-13 00:00:00 | 12.86 | NULL | ++--------------+---------------------+---------------+-----------------------+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + This function is the converse of <xref href="impala_analytic_functions.xml#lead"/>. + </p> + + </conbody> + + </concept> + + <concept rev="2.0.0" id="last_value"> + + <title>LAST_VALUE() Function</title> + + <conbody> + + <p> + Returns the expression value from the last row in the window. This same value is repeated for all result + rows for the group. The return value is <codeph>NULL</codeph> if the input expression is + <codeph>NULL</codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>LAST_VALUE(<varname>expr</varname>) OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname> [<varname>window_clause</varname>])</codeblock> + + <p> + The <codeph>PARTITION BY</codeph> clause is optional. The <codeph>ORDER BY</codeph> clause is required. The + window clause is optional. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + If any duplicate values occur in the tuples evaluated by the <codeph>ORDER BY</codeph> clause, the result + of this function is not deterministic. Consider adding additional <codeph>ORDER BY</codeph> columns to + ensure consistent ordering. + </p> + + <p conref="../shared/impala_common.xml#common/added_in_20"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example uses the same <codeph>MAIL_MERGE</codeph> table as in the example for + <xref href="impala_analytic_functions.xml#first_value"/>. Because the default window when <codeph>ORDER + BY</codeph> is used is <codeph>BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>, the query requires the + <codeph>UNBOUNDED FOLLOWING</codeph> to look ahead to subsequent rows and find the last value for each + country. + </p> + +<codeblock>select country, name, + last_value(greeting) over ( + partition by country order by name, greeting + rows between unbounded preceding and unbounded following + ) as greeting + from mail_merge ++---------+---------+--------------+ +| country | name | greeting | ++---------+---------+--------------+ +| Germany | Boris | Guten morgen | +| Germany | Michael | Guten morgen | +| Sweden | Bjorn | Tja | +| Sweden | Mats | Tja | +| USA | John | Hello | +| USA | Pete | Hello | ++---------+---------+--------------+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_analytic_functions.xml#first_value"/> + </p> + + </conbody> + + </concept> + + <concept rev="2.0.0" id="lead"> + + <title>LEAD() Function</title> + + <conbody> + + <p> + This function returns the value of an expression using column values from a following row. You specify an + integer offset, which designates a row position some number of rows after to the current row. Any column + references in the expression argument refer to column values from that later row. Typically, the table + contains a time sequence or numeric sequence column that clearly distinguishes the ordering of the rows. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>LEAD (<varname>expr</varname> [, <varname>offset</varname>] [, <varname>default</varname>]) + OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock> + + <p> + The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The + window clause is not allowed. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + Sometimes used an an alternative to doing a self-join. + </p> + + <p conref="../shared/impala_common.xml#common/added_in_20"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example uses the same stock data created in <xref href="#window_clause"/>. The query analyzes + the closing price for a stock symbol, and for each day evaluates if the closing price for the following day + is higher or lower. + </p> + +<codeblock>select stock_symbol, closing_date, closing_price, + case + (lead(closing_price,1) + over (partition by stock_symbol order by closing_date) + - closing_price) > 0 + when true then "higher" + when false then "flat or lower" + end as "trending" +from stock_ticker + order by closing_date; ++--------------+---------------------+---------------+---------------+ +| stock_symbol | closing_date | closing_price | trending | ++--------------+---------------------+---------------+---------------+ +| JDR | 2014-09-13 00:00:00 | 12.86 | higher | +| JDR | 2014-09-14 00:00:00 | 12.89 | higher | +| JDR | 2014-09-15 00:00:00 | 12.94 | flat or lower | +| JDR | 2014-09-16 00:00:00 | 12.55 | higher | +| JDR | 2014-09-17 00:00:00 | 14.03 | higher | +| JDR | 2014-09-18 00:00:00 | 14.75 | flat or lower | +| JDR | 2014-09-19 00:00:00 | 13.98 | NULL | ++--------------+---------------------+---------------+---------------+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + This function is the converse of <xref href="impala_analytic_functions.xml#lag"/>. + </p> + + </conbody> + + </concept> + + <concept id="max_analytic"> + + <title>MAX() Function - Analytic Context</title> + + <conbody> + + <p> + You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic + function. See <xref href="impala_max.xml#max"/> for details and examples. + </p> + + </conbody> + + </concept> + + <concept id="min_analytic"> + + <title>MIN() Function - Analytic Context</title> + + <conbody> + + <p> + You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic + function. See <xref href="impala_min.xml#min"/> for details and examples. + </p> + + </conbody> + + </concept> + + <concept audience="Cloudera" rev="2.x.x" id="nth_value"> + + <title>NTH_VALUE() Function</title> + + <conbody> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + </conbody> + + </concept> + + <concept rev="2.3.0" id="ntile"> + + <title>NTILE() Function (CDH 5.5 or higher only)</title> + + <conbody> + + <p> + Returns the <q>bucket number</q> associated with each row, between 1 and the value of an expression. For + example, creating 100 buckets puts the lowest 1% of values in the first bucket, while creating 10 buckets + puts the lowest 10% of values in the first bucket. Each partition can have a different number of buckets. +<!-- What's the syntax or data distribution that would create a different number of buckets per partition? --> + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>NTILE (<varname>expr</varname> [, <varname>offset</varname> ...] + OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock> + + <p> + The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The + window clause is not allowed. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + The <q>ntile</q> name is derived from the practice of dividing result sets into fourths (quartile), tenths + (decile), and so on. The <codeph>NTILE()</codeph> function divides the result set based on an arbitrary + percentile value. + </p> + + <p> + The number of buckets must be a positive integer. + </p> + + <p> + The number of items in each bucket is identical or almost so, varying by at most 1. If the number of items + does not divide evenly between the buckets, the remaining N items are divided evenly among the first N + buckets. + </p> + + <p> + If the number of buckets N is greater than the number of input rows in the partition, then the first N + buckets each contain one item, and the remaining buckets are empty. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example shows divides groups of animals into 4 buckets based on their weight. The + <codeph>ORDER BY ... DESC</codeph> clause in the <codeph>OVER()</codeph> clause means that the heaviest 25% + are in the first group, and the lightest 25% are in the fourth group. (The <codeph>ORDER BY</codeph> in the + outermost part of the query shows how you can order the final result set independently from the order in + which the rows are evaluated by the <codeph>OVER()</codeph> clause.) Because there are 9 rows in the group, + divided into 4 buckets, the first bucket receives the extra item. + </p> + +<codeblock>create table animals (name string, kind string, kilos decimal(9,3)); + +insert into animals values + ('Elephant', 'Mammal', 4000), ('Giraffe', 'Mammal', 1200), ('Mouse', 'Mammal', 0.020), + ('Condor', 'Bird', 15), ('Horse', 'Mammal', 500), ('Owl', 'Bird', 2.5), + ('Ostrich', 'Bird', 145), ('Polar bear', 'Mammal', 700), ('Housecat', 'Mammal', 5); + +select name, ntile(4) over (order by kilos desc) as quarter + from animals +order by quarter desc; ++------------+---------+ +| name | quarter | ++------------+---------+ +| Owl | 4 | +| Mouse | 4 | +| Condor | 3 | +| Housecat | 3 | +| Horse | 2 | +| Ostrich | 2 | +| Elephant | 1 | +| Giraffe | 1 | +| Polar bear | 1 | ++------------+---------+ +</codeblock> + + <p> + The following examples show how the <codeph>PARTITION</codeph> clause works for the + <codeph>NTILE()</codeph> function. Here, we divide each kind of animal (mammal or bird) into 2 buckets, + the heavier half and the lighter half. + </p> + +<codeblock>select name, kind, ntile(2) over (partition by kind order by kilos desc) as half + from animals +order by kind; ++------------+--------+------+ +| name | kind | half | ++------------+--------+------+ +| Ostrich | Bird | 1 | +| Condor | Bird | 1 | +| Owl | Bird | 2 | +| Elephant | Mammal | 1 | +| Giraffe | Mammal | 1 | +| Polar bear | Mammal | 1 | +| Horse | Mammal | 2 | +| Housecat | Mammal | 2 | +| Mouse | Mammal | 2 | ++------------+--------+------+ +</codeblock> + + <p> + Again, the result set can be ordered independently + from the analytic evaluation. This next example lists all the animals heaviest to lightest, + showing that elephant and giraffe are in the <q>top half</q> of mammals by weight, while + housecat and mouse are in the <q>bottom half</q>. + </p> + +<codeblock>select name, kind, ntile(2) over (partition by kind order by kilos desc) as half + from animals +order by kilos desc; ++------------+--------+------+ +| name | kind | half | ++------------+--------+------+ +| Elephant | Mammal | 1 | +| Giraffe | Mammal | 1 | +| Polar bear | Mammal | 1 | +| Horse | Mammal | 2 | +| Ostrich | Bird | 1 | +| Condor | Bird | 1 | +| Housecat | Mammal | 2 | +| Owl | Bird | 2 | +| Mouse | Mammal | 2 | ++------------+--------+------+ +</codeblock> + + </conbody> + + </concept> + + <concept rev="2.3.0" id="percent_rank"> + + <title>PERCENT_RANK() Function (CDH 5.5 or higher only)</title> + + <conbody> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>PERCENT_RANK (<varname>expr</varname>) + OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>) +</codeblock> + + <p> + Calculates the rank, expressed as a percentage, of each row within a group of rows. + If <codeph>rank</codeph> is the value for that same row from the <codeph>RANK()</codeph> function (from 1 to the total number of rows in the partition group), + then the <codeph>PERCENT_RANK()</codeph> value is calculated as <codeph>(<varname>rank</varname> - 1) / (<varname>rows_in_group</varname> - 1)</codeph> . + If there is only a single item in the partition group, its <codeph>PERCENT_RANK()</codeph> value is 0. + </p> + + <p> + The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The + window clause is not allowed. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + This function is similar to the <codeph>RANK</codeph> and <codeph>CUME_DIST()</codeph> functions: it returns an ascending sequence representing the position of each + row within the rows of the same partition group. The actual numeric sequence is calculated differently, + and the handling of duplicate (tied) values is different. + </p> + + <p> + The return values range from 0 to 1 inclusive. + The first row in each partition group always has the value 0. + A <codeph>NULL</codeph> value is considered the lowest possible value. + In the case of duplicate input values, all the corresponding rows in the result set + have an identical value: the lowest <codeph>PERCENT_RANK()</codeph> value of those + tied rows. (In contrast to <codeph>CUME_DIST()</codeph>, where all tied rows have + the highest <codeph>CUME_DIST()</codeph> value.) + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example uses the same <codeph>ANIMALS</codeph> table as the examples for <codeph>CUME_DIST()</codeph> + and <codeph>NTILE()</codeph>, with a few additional rows to illustrate the results where some values are + <codeph>NULL</codeph> or there is only a single row in a partition group. + </p> + +<codeblock>insert into animals values ('Komodo dragon', 'Reptile', 70); +insert into animals values ('Unicorn', 'Mythical', NULL); +insert into animals values ('Fire-breathing dragon', 'Mythical', NULL); +</codeblock> + + <p> + As with <codeph>CUME_DIST()</codeph>, there is an ascending sequence for each kind of animal. + For example, the <q>Birds</q> and <q>Mammals</q> rows each have a <codeph>PERCENT_RANK()</codeph> sequence + that ranges from 0 to 1. + The <q>Reptile</q> row has a <codeph>PERCENT_RANK()</codeph> of 0 because that partition group contains only a single item. + Both <q>Mythical</q> animals have a <codeph>PERCENT_RANK()</codeph> of 0 because + a <codeph>NULL</codeph> is considered the lowest value within its partition group. + </p> + +<codeblock>select name, kind, percent_rank() over (partition by kind order by kilos) from animals; ++-----------------------+----------+--------------------------+ +| name | kind | percent_rank() OVER(...) | ++-----------------------+----------+--------------------------+ +| Mouse | Mammal | 0 | +| Housecat | Mammal | 0.2 | +| Horse | Mammal | 0.4 | +| Polar bear | Mammal | 0.6 | +| Giraffe | Mammal | 0.8 | +| Elephant | Mammal | 1 | +| Komodo dragon | Reptile | 0 | +| Owl | Bird | 0 | +| California Condor | Bird | 0.25 | +| Andean Condor | Bird | 0.25 | +| Condor | Bird | 0.25 | +| Ostrich | Bird | 1 | +| Fire-breathing dragon | Mythical | 0 | +| Unicorn | Mythical | 0 | ++-----------------------+----------+--------------------------+ +</codeblock> + </conbody> + + </concept> + + <concept rev="2.0.0" id="rank"> + + <title>RANK() Function</title> + + <conbody> + + <p> + Returns an ascending sequence of integers, starting with 1. The output sequence produces duplicate integers + for duplicate values of the <codeph>ORDER BY</codeph> expressions. After generating duplicate output values + for the <q>tied</q> input values, the function increments the sequence by the number of tied values. + Therefore, the sequence contains both duplicates and gaps when the input contains duplicates. Starts the + sequence over for each group produced by the <codeph>PARTITIONED BY</codeph> clause. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>RANK() OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock> + + <p> + The <codeph>PARTITION BY</codeph> clause is optional. The <codeph>ORDER BY</codeph> clause is required. The + window clause is not allowed. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + +<!-- Make a little tutorial to show these 3 functions side-by-side and illustrate their difference. --> + + <p> + Often used for top-N and bottom-N queries. For example, it could produce a <q>top 10</q> report including + several items that were tied for 10th place. + </p> + + <p> + Similar to <codeph>ROW_NUMBER</codeph> and <codeph>DENSE_RANK</codeph>. These functions differ in how they + treat duplicate combinations of values. + </p> + + <p conref="../shared/impala_common.xml#common/added_in_20"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example demonstrates how the <codeph>RANK()</codeph> function identifies where each value + <q>places</q> in the result set, producing the same result for duplicate values, and skipping values in the + sequence to account for the number of duplicates. For example, when results are ordered by the + <codeph>X</codeph> column, both <codeph>1</codeph> values are tied for first; both <codeph>2</codeph> + values are tied for third; and so on. + </p> + +<codeblock>select x, rank() over(order by x) as rank, property from int_t; ++----+------+----------+ +| x | rank | property | ++----+------+----------+ +| 1 | 1 | square | +| 1 | 1 | odd | +| 2 | 3 | even | +| 2 | 3 | prime | +| 3 | 5 | prime | +| 3 | 5 | odd | +| 4 | 7 | even | +| 4 | 7 | square | +| 5 | 9 | odd | +| 5 | 9 | prime | +| 6 | 11 | even | +| 6 | 11 | perfect | +| 7 | 13 | lucky | +| 7 | 13 | lucky | +| 7 | 13 | lucky | +| 7 | 13 | odd | +| 7 | 13 | prime | +| 8 | 18 | even | +| 9 | 19 | square | +| 9 | 19 | odd | +| 10 | 21 | round | +| 10 | 21 | even | ++----+------+----------+ +</codeblock> + + <p> + The following examples show how the <codeph>RANK()</codeph> function is affected by the + <codeph>PARTITION</codeph> property within the <codeph>ORDER BY</codeph> clause. + </p> + + <p> + Partitioning by the <codeph>PROPERTY</codeph> column groups all the even, odd, and so on values together, + and <codeph>RANK()</codeph> returns the place of each value within the group, producing several ascending + sequences. + </p> + +<codeblock>select x, rank() over(partition by property order by x) as rank, property from int_t; ++----+------+----------+ +| x | rank | property | ++----+------+----------+ +| 2 | 1 | even | +| 4 | 2 | even | +| 6 | 3 | even | +| 8 | 4 | even | +| 10 | 5 | even | +| 7 | 1 | lucky | +| 7 | 1 | lucky | +| 7 | 1 | lucky | +| 1 | 1 | odd | +| 3 | 2 | odd | +| 5 | 3 | odd | +| 7 | 4 | odd | +| 9 | 5 | odd | +| 6 | 1 | perfect | +| 2 | 1 | prime | +| 3 | 2 | prime | +| 5 | 3 | prime | +| 7 | 4 | prime | +| 10 | 1 | round | +| 1 | 1 | square | +| 4 | 2 | square | +| 9 | 3 | square | ++----+------+----------+ +</codeblock> + + <p> + Partitioning by the <codeph>X</codeph> column groups all the duplicate numbers together and returns the + place each each value within the group; because each value occurs only 1 or 2 times, + <codeph>RANK()</codeph> designates each <codeph>X</codeph> value as either first or second within its + group. + </p> + +<codeblock>select x, rank() over(partition by x order by property) as rank, property from int_t; ++----+------+----------+ +| x | rank | property | ++----+------+----------+ +| 1 | 1 | odd | +| 1 | 2 | square | +| 2 | 1 | even | +| 2 | 2 | prime | +| 3 | 1 | odd | +| 3 | 2 | prime | +| 4 | 1 | even | +| 4 | 2 | square | +| 5 | 1 | odd | +| 5 | 2 | prime | +| 6 | 1 | even | +| 6 | 2 | perfect | +| 7 | 1 | lucky | +| 7 | 1 | lucky | +| 7 | 1 | lucky | +| 7 | 4 | odd | +| 7 | 5 | prime | +| 8 | 1 | even | +| 9 | 1 | odd | +| 9 | 2 | square | +| 10 | 1 | even | +| 10 | 2 | round | ++----+------+----------+ +</codeblock> + + <p> + The following example shows how a magazine might prepare a list of history's wealthiest people. Croesus and + Midas are tied for second, then Crassus is fourth. + </p> + +<codeblock>select rank() over (order by net_worth desc) as rank, name, net_worth from wealth order by rank, name; ++------+---------+---------------+ +| rank | name | net_worth | ++------+---------+---------------+ +| 1 | Solomon | 2000000000.00 | +| 2 | Croesus | 1000000000.00 | +| 2 | Midas | 1000000000.00 | +| 4 | Crassus | 500000000.00 | +| 5 | Scrooge | 80000000.00 | ++------+---------+---------------+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_analytic_functions.xml#dense_rank"/>, + <xref href="impala_analytic_functions.xml#row_number"/> + </p> + + </conbody> + + </concept> + + <concept rev="2.0.0" id="row_number"> + + <title>ROW_NUMBER() Function</title> + + <conbody> + + <p> + Returns an ascending sequence of integers, starting with 1. Starts the sequence over for each group + produced by the <codeph>PARTITIONED BY</codeph> clause. The output sequence includes different values for + duplicate input values. Therefore, the sequence never contains any duplicates or gaps, regardless of + duplicate input values. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>ROW_NUMBER() OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock> + + <p> + The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The + window clause is not allowed. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + Often used for top-N and bottom-N queries where the input values are known to be unique, or precisely N + rows are needed regardless of duplicate values. + </p> + + <p> + Because its result value is different for each row in the result set (when used without a <codeph>PARTITION + BY</codeph> clause), <codeph>ROW_NUMBER()</codeph> can be used to synthesize unique numeric ID values, for + example for result sets involving unique values or tuples. + </p> + + <p> + Similar to <codeph>RANK</codeph> and <codeph>DENSE_RANK</codeph>. These functions differ in how they treat + duplicate combinations of values. + </p> + + <p conref="../shared/impala_common.xml#common/added_in_20"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example demonstrates how <codeph>ROW_NUMBER()</codeph> produces a continuous numeric + sequence, even though some values of <codeph>X</codeph> are repeated. + </p> + +<codeblock>select x, row_number() over(order by x, property) as row_number, property from int_t; ++----+------------+----------+ +| x | row_number | property | ++----+------------+----------+ +| 1 | 1 | odd | +| 1 | 2 | square | +| 2 | 3 | even | +| 2 | 4 | prime | +| 3 | 5 | odd | +| 3 | 6 | prime | +| 4 | 7 | even | +| 4 | 8 | square | +| 5 | 9 | odd | +| 5 | 10 | prime | +| 6 | 11 | even | +| 6 | 12 | perfect | +| 7 | 13 | lucky | +| 7 | 14 | lucky | +| 7 | 15 | lucky | +| 7 | 16 | odd | +| 7 | 17 | prime | +| 8 | 18 | even | +| 9 | 19 | odd | +| 9 | 20 | square | +| 10 | 21 | even | +| 10 | 22 | round | ++----+------------+----------+ +</codeblock> + + <p> + The following example shows how a financial institution might assign customer IDs to some of history's + wealthiest figures. Although two of the people have identical net worth figures, unique IDs are required + for this purpose. <codeph>ROW_NUMBER()</codeph> produces a sequence of five different values for the five + input rows. + </p> + +<codeblock>select row_number() over (order by net_worth desc) as account_id, name, net_worth + from wealth order by account_id, name; ++------------+---------+---------------+ +| account_id | name | net_worth | ++------------+---------+---------------+ +| 1 | Solomon | 2000000000.00 | +| 2 | Croesus | 1000000000.00 | +| 3 | Midas | 1000000000.00 | +| 4 | Crassus | 500000000.00 | +| 5 | Scrooge | 80000000.00 | ++------------+---------+---------------+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_analytic_functions.xml#rank"/>, <xref href="impala_analytic_functions.xml#dense_rank"/> + </p> + + </conbody> + + </concept> + + <concept id="sum_analytic"> + + <title>SUM() Function - Analytic Context</title> + + <conbody> + + <p> + You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic + function. See <xref href="impala_sum.xml#sum"/> for details and examples. + </p> + + </conbody> + + </concept> + +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_appx_count_distinct.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_appx_count_distinct.xml b/docs/topics/impala_appx_count_distinct.xml new file mode 100644 index 0000000..31a9679 --- /dev/null +++ b/docs/topics/impala_appx_count_distinct.xml @@ -0,0 +1,77 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="2.0.0" id="appx_count_distinct"> + + <title>APPX_COUNT_DISTINCT Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">APPX_COUNT_DISTINCT query option</indexterm> + Allows multiple <codeph>COUNT(DISTINCT)</codeph> operations within a single query, by internally rewriting + each <codeph>COUNT(DISTINCT)</codeph> to use the <codeph>NDV()</codeph> function. The resulting count is + approximate rather than precise. + </p> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + + <p conref="../shared/impala_common.xml#common/default_false_0"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following examples show how the <codeph>APPX_COUNT_DISTINCT</codeph> lets you work around the restriction + where a query can only evaluate <codeph>COUNT(DISTINCT <varname>col_name</varname>)</codeph> for a single + column. By default, you can count the distinct values of one column or another, but not both in a single + query: + </p> + +<codeblock>[localhost:21000] > select count(distinct x) from int_t; ++-------------------+ +| count(distinct x) | ++-------------------+ +| 10 | ++-------------------+ +[localhost:21000] > select count(distinct property) from int_t; ++--------------------------+ +| count(distinct property) | ++--------------------------+ +| 7 | ++--------------------------+ +[localhost:21000] > select count(distinct x), count(distinct property) from int_t; +ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters +as count(DISTINCT x); deviating function: count(DISTINCT property) +</codeblock> + + <p> + When you enable the <codeph>APPX_COUNT_DISTINCT</codeph> query option, now the query with multiple + <codeph>COUNT(DISTINCT)</codeph> works. The reason this behavior requires a query option is that each + <codeph>COUNT(DISTINCT)</codeph> is rewritten internally to use the <codeph>NDV()</codeph> function instead, + which provides an approximate result rather than a precise count. + </p> + +<codeblock>[localhost:21000] > set APPX_COUNT_DISTINCT=true; +[localhost:21000] > select count(distinct x), count(distinct property) from int_t; ++-------------------+--------------------------+ +| count(distinct x) | count(distinct property) | ++-------------------+--------------------------+ +| 10 | 7 | ++-------------------+--------------------------+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_count.xml#count"/>, + <xref href="impala_distinct.xml#distinct"/>, + <xref href="impala_ndv.xml#ndv"/> + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_appx_median.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_appx_median.xml b/docs/topics/impala_appx_median.xml new file mode 100644 index 0000000..d874ead --- /dev/null +++ b/docs/topics/impala_appx_median.xml @@ -0,0 +1,122 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.2.1" id="appx_median"> + + <title>APPX_MEDIAN Function</title> + <titlealts><navtitle>APPX_MEDIAN</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="Aggregate Functions"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">appx_median() function</indexterm> + An aggregate function that returns a value that is approximately the median (midpoint) of values in the set + of input values. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>APPX_MEDIAN([DISTINCT | ALL] <varname>expression</varname>) +</codeblock> + + <p> + This function works with any input type, because the only requirement is that the type supports less-than and + greater-than comparison operators. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + Because the return value represents the estimated midpoint, it might not reflect the precise midpoint value, + especially if the cardinality of the input values is very high. If the cardinality is low (up to + approximately 20,000), the result is more accurate because the sampling considers all or almost all of the + different values. + </p> + + <p conref="../shared/impala_common.xml#common/return_type_same_except_string"/> + + <p> + The return value is always the same as one of the input values, not an <q>in-between</q> value produced by + averaging. + </p> + +<!-- <p conref="/Content/impala_common_xi44078.xml#common/restrictions_sliding_window"/> --> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p conref="../shared/impala_common.xml#common/analytic_not_allowed_caveat"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example uses a table of a million random floating-point numbers ranging up to approximately + 50,000. The average is approximately 25,000. Because of the random distribution, we would expect the median + to be close to this same number. Computing the precise median is a more intensive operation than computing + the average, because it requires keeping track of every distinct value and how many times each occurs. The + <codeph>APPX_MEDIAN()</codeph> function uses a sampling algorithm to return an approximate result, which in + this case is close to the expected value. To make sure that the value is not substantially out of range due + to a skewed distribution, subsequent queries confirm that there are approximately 500,000 values higher than + the <codeph>APPX_MEDIAN()</codeph> value, and approximately 500,000 values lower than the + <codeph>APPX_MEDIAN()</codeph> value. + </p> + +<codeblock>[localhost:21000] > select min(x), max(x), avg(x) from million_numbers; ++-------------------+-------------------+-------------------+ +| min(x) | max(x) | avg(x) | ++-------------------+-------------------+-------------------+ +| 4.725693727250069 | 49994.56852674231 | 24945.38563793553 | ++-------------------+-------------------+-------------------+ +[localhost:21000] > select appx_median(x) from million_numbers; ++----------------+ +| appx_median(x) | ++----------------+ +| 24721.6 | ++----------------+ +[localhost:21000] > select count(x) as higher from million_numbers where x > (select appx_median(x) from million_numbers); ++--------+ +| higher | ++--------+ +| 502013 | ++--------+ +[localhost:21000] > select count(x) as lower from million_numbers where x < (select appx_median(x) from million_numbers); ++--------+ +| lower | ++--------+ +| 497987 | ++--------+ +</codeblock> + + <p> + The following example computes the approximate median using a subset of the values from the table, and then + confirms that the result is a reasonable estimate for the midpoint. + </p> + +<codeblock>[localhost:21000] > select appx_median(x) from million_numbers where x between 1000 and 5000; ++-------------------+ +| appx_median(x) | ++-------------------+ +| 3013.107787358159 | ++-------------------+ +[localhost:21000] > select count(x) as higher from million_numbers where x between 1000 and 5000 and x > 3013.107787358159; ++--------+ +| higher | ++--------+ +| 37692 | ++--------+ +[localhost:21000] > select count(x) as lower from million_numbers where x between 1000 and 5000 and x < 3013.107787358159; ++-------+ +| lower | ++-------+ +| 37089 | ++-------+ +</codeblock> + </conbody> +</concept>
