[19/22] incubator-impala git commit: First try at porting over the source files necessary for the Impala SQL Reference.

jrussell Tue, 26 Jul 2016 16:05:54 -0700

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_analytic_functions.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_analytic_functions.xml 
b/docs/topics/impala_analytic_functions.xml
new file mode 100644
index 0000000..293a512
--- /dev/null
+++ b/docs/topics/impala_analytic_functions.xml
@@ -0,0 +1,1742 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept rev="2.0.0" id="analytic_functions">
+
+  <title>Impala Analytic Functions</title>
+
+  <titlealts>
+
+    <navtitle>Analytic Functions</navtitle>
+
+  </titlealts>
+
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="SQL"/>
+      <data name="Category" value="Impala Functions"/>
+      <data name="Category" value="Aggregate Functions"/>
+      <data name="Category" value="Analytic Functions"/>
+      <data name="Category" value="Data Analysts"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Querying"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">analytic functions</indexterm>
+
+      <indexterm audience="Cloudera">window functions</indexterm>
+      Analytic functions (also known as window functions) are a special 
category of built-in functions. Like
+      aggregate functions, they examine the contents of multiple input rows to 
compute each output value. However,
+      rather than being limited to one result value per <codeph>GROUP 
BY</codeph> group, they operate on
+      <term>windows</term> where the input rows are ordered and grouped using 
flexible conditions expressed through
+      an <codeph>OVER()</codeph> clause.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/added_in_20"/>
+
+<!--
+    <p>
+      Analytic functions produce one output value for each input row, like 
scalar functions such as
+      <codeph>length()</codeph> or
+      <codeph>substr()</codeph>.
+    </p>
+-->
+
+    <p>
+      Some functions, such as <codeph>LAG()</codeph> and 
<codeph>RANK()</codeph>, can only be used in this analytic
+      context. Some aggregate functions do double duty: when you call the 
aggregation functions such as
+      <codeph>MAX()</codeph>, <codeph>SUM()</codeph>, <codeph>AVG()</codeph>, 
and so on with an
+      <codeph>OVER()</codeph> clause, they produce an output value for each 
row, based on computations across other
+      rows in the window.
+    </p>
+
+    <p>
+      Although analytic functions often compute the same value you would see 
from an aggregate function in a
+      <codeph>GROUP BY</codeph> query, the analytic functions produce a value 
for each row in the result set rather
+      than a single value for each group. This flexibility lets you include 
additional columns in the
+      <codeph>SELECT</codeph> list, offering more opportunities for organizing 
and filtering the result set.
+    </p>
+
+    <p>
+      Analytic function calls are only allowed in the <codeph>SELECT</codeph> 
list and in the outermost
+      <codeph>ORDER BY</codeph> clause of the query. During query processing, 
analytic functions are evaluated
+      after other query stages such as joins, <codeph>WHERE</codeph>, and 
<codeph>GROUP BY</codeph>,
+    </p>
+
+<!-- Oracle doesn't show examples until it gets to the actual functions, so 
let's follow that lead.
+    <p>
+      The following example shows a very simple call to <codeph>MAX()</codeph> 
in
+      an analytic context, and a similar query using a <codeph>GROUP 
BY</codeph> clause.
+    </p>
+-->
+
+<!--
+This basic query could be represented either as an analytic
+function call or an aggregation function call in a <codeph>GROUP BY</codeph> 
query.
+For more elaborate kinds of computations, the flexibility of the analytic 
window
+makes that the preferred option.
+-->
+
+<!-- TK: construct sample data and fill in query results. -->
+
+<!-- Other DB docs don't necessarily include examples up at this level, only 
for the individual functions.
+     So maybe take these placeholders out entirely.
+
+<codeblock>SELECT year, month, max(degrees_c) OVER (PARTITION BY year) FROM 
historical_temps;
+SELECT year, month, max(degrees_c) FROM historical_temps GROUP BY year;
+</codeblock>
+-->
+
+    <p>
+      The rows that are part of each partition are analyzed by computations 
across an ordered or unordered set of
+      rows. For example, <codeph>COUNT()</codeph> and <codeph>SUM()</codeph> 
might be applied to all the rows in
+      the partition, in which case the order of analysis does not matter. The 
<codeph>ORDER BY</codeph> clause
+      might be used inside the <codeph>OVER()</codeph> clause to defines the 
ordering that applies to functions
+      such as <codeph>LAG()</codeph> and <codeph>FIRST_VALUE()</codeph>.
+    </p>
+
+<!-- TK: output needed here also. -->
+
+<!--
+<codeblock>SELECT year, month, max(degrees_c) OVER (PARTITION BY year ORDER BY 
MONTH DESC) FROM historical_temps;
+</codeblock>
+-->
+
+    <p>
+      Analytic functions are frequently used in fields such as finance and 
science to provide trend, outlier, and
+      bucketed analysis for large data sets. You might also see the term 
<q>window functions</q> in database
+      literature, referring to the sequence of rows (the <q>window</q>) that 
the function call applies to,
+      particularly when the <codeph>OVER</codeph> clause includes a 
<codeph>ROWS</codeph> or <codeph>RANGE</codeph>
+      keyword.
+    </p>
+
+    <p>
+      The following sections describe the analytic query clauses and the pure 
analytic functions provided by
+      Impala. For usage information about aggregate functions in an analytic 
context, see
+      <xref href="impala_aggregate_functions.xml#aggregate_functions"/>.
+    </p>
+
+    <p outputclass="toc inpage"/>
+
+  </conbody>
+
+  <concept id="over">
+
+    <title>OVER Clause</title>
+
+    <conbody>
+
+      <p>
+        The <codeph>OVER</codeph> clause is required for calls to pure 
analytic functions such as
+        <codeph>LEAD()</codeph>, <codeph>RANK()</codeph>, and 
<codeph>FIRST_VALUE()</codeph>. When you include an
+        <codeph>OVER</codeph> clause with calls to aggregate functions such as 
<codeph>MAX()</codeph>,
+        <codeph>COUNT()</codeph>, or <codeph>SUM()</codeph>, they operate as 
analytic functions.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>function(<varname>args</varname>) 
OVER([<varname>partition_by_clause</varname>] 
[<varname>order_by_clause</varname> [<varname>window_clause</varname>]])
+
+partition_by_clause ::= PARTITION BY <varname>expr</varname> [, 
<varname>expr</varname> ...]
+order_by_clause ::= ORDER BY <varname>expr</varname>  [ASC | DESC] [NULLS 
FIRST | NULLS LAST] [, <varname>expr</varname> [ASC | DESC] [NULLS FIRST | 
NULLS LAST] ...]
+window_clause: See <xref href="#window_clause">Window Clause</xref>
+</codeblock>
+
+      <p>
+        <b>PARTITION BY clause:</b>
+      </p>
+
+      <p>
+        The <codeph>PARTITION BY</codeph> clause acts much like the 
<codeph>GROUP BY</codeph> clause in the
+        outermost block of a query. It divides the rows into groups containing 
identical values in one or more
+        columns. These logical groups are known as <term>partitions</term>. 
Throughout the discussion of analytic
+        functions, <q>partitions</q> refers to the groups produced by the 
<codeph>PARTITION BY</codeph> clause, not
+        to partitioned tables.
+      </p>
+
+      <p>
+        The sequence of results from an analytic function <q>resets</q> for 
each new partition in the result set.
+        That is, the set of preceding or following rows considered by the 
analytic function always come from a
+        single partition. Any <codeph>MAX()</codeph>, <codeph>SUM()</codeph>, 
<codeph>ROW_NUMBER()</codeph>, and so
+        on apply to each partition independently. Omit the <codeph>PARTITION 
BY</codeph> clause to apply the
+        analytic operation to all the rows in the table.
+      </p>
+
+      <p>
+        <b>ORDER BY clause:</b>
+      </p>
+
+      <p>
+        The <codeph>ORDER BY</codeph> clause works much like the <codeph>ORDER 
BY</codeph> clause in the outermost
+        block of a query. It defines the order in which rows are evaluated for 
the entire input set, or for each
+        group produced by a <codeph>PARTITION BY</codeph> clause. You can 
order by one or multiple expressions, and
+        for each expression optionally choose ascending or descending order 
and whether nulls come first or last in
+        the sort order. Because this <codeph>ORDER BY</codeph> clause only 
defines the order in which rows are
+        evaluated, if you want the results to be output in a specific order, 
also include an <codeph>ORDER
+        BY</codeph> clause in the outer block of the query.
+      </p>
+
+      <p>
+        When the <codeph>ORDER BY</codeph> clause is omitted, the analytic 
function applies to all items in the
+        group produced by the <codeph>PARTITION BY</codeph> clause. When the 
<codeph>ORDER BY</codeph> clause is
+        included, the analysis can apply to all or a subset of the items in 
the group, depending on the optional
+        window clause.
+      </p>
+
+      <p>
+        The order in which the rows are analyzed is only defined for those 
columns specified in <codeph>ORDER
+        BY</codeph> clauses.
+      </p>
+
+      <p>
+        One difference between the analytic and outer uses of the 
<codeph>ORDER BY</codeph> clause: inside the
+        <codeph>OVER</codeph> clause, <codeph>ORDER BY 1</codeph> or other 
integer value is interpreted as a
+        constant sort value (effectively a no-op) rather than referring to 
column 1.
+      </p>
+
+      <p>
+        <b>Window clause:</b>
+      </p>
+
+      <p>
+        The window clause is only allowed in combination with an <codeph>ORDER 
BY</codeph> clause. If the
+        <codeph>ORDER BY</codeph> clause is specified but the window clause is 
not, the default window is
+        <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>. See
+        <xref href="impala_analytic_functions.xml#window_clause"/> for full 
details.
+      </p>
+
+<!--
+      <p conref="/Content/impala_common_xi44078.xml#common/usage_notes_blurb"/>
+-->
+
+      <p conref="../shared/impala_common.xml#common/hbase_blurb"/>
+
+      <p>
+        Because HBase tables are optimized for single-row lookups rather than 
full scans, analytic functions using
+        the <codeph>OVER()</codeph> clause are not recommended for HBase 
tables. Although such queries work, their
+        performance is lower than on comparable tables using HDFS data files.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/parquet_blurb"/>
+
+      <p>
+        Analytic functions are very efficient for Parquet tables. The data 
that is examined during evaluation of
+        the <codeph>OVER()</codeph> clause comes from a specified set of 
columns, and the values for each column
+        are arranged sequentially within each data file.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/text_blurb"/>
+
+      <p>
+        Analytic functions are convenient to use with text tables for 
exploratory business intelligence. When the
+        volume of data is substantial, prefer to use Parquet tables for 
performance-critical analytic queries.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/added_in_20"/>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <p>
+        The following example shows how to synthesize a numeric sequence 
corresponding to all the rows in a table.
+        The new table has the same columns as the old one, plus an additional 
column <codeph>ID</codeph> containing
+        the integers 1, 2, 3, and so on, corresponding to the order of a 
<codeph>TIMESTAMP</codeph> column in the
+        original table.
+      </p>
+
+<!-- TK: synthesize some data and fill in output here. -->
+
+<codeblock>CREATE TABLE events_with_id AS
+  SELECT
+    row_number() OVER (ORDER BY date_and_time) AS id,
+    c1, c2, c3, c4
+  FROM events;
+</codeblock>
+
+      <p>
+        The following example shows how to determine the number of rows 
containing each value for a column. Unlike
+        a corresponding <codeph>GROUP BY</codeph> query, this one can analyze 
a single column and still return all
+        values (not just the distinct ones) from the other columns.
+      </p>
+
+<!-- TK: verify the 'unbounded' shortcut syntax. -->
+
+<codeblock>SELECT x, y, z,
+  count() OVER (PARTITION BY x) AS how_many_x
+FROM t1;
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
+
+      <p>
+        You cannot directly combine the <codeph>DISTINCT</codeph> operator 
with analytic function calls. You can
+        put the analytic function call in a <codeph>WITH</codeph> clause or an 
inline view, and apply the
+        <codeph>DISTINCT</codeph> operator to its result set.
+      </p>
+
+<codeblock>WITH t1 AS (SELECT x, sum(x) OVER (PARTITION BY x) AS total FROM t1)
+  SELECT DISTINCT x, total FROM t1;
+</codeblock>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="window_clause">
+
+    <title>Window Clause</title>
+
+    <conbody>
+
+      <p>
+        Certain analytic functions accept an optional <term>window 
clause</term>, which makes the function analyze
+        only certain rows <q>around</q> the current row rather than all rows 
in the partition. For example, you can
+        get a moving average by specifying some number of preceding and 
following rows, or a running count or
+        running total by specifying all rows up to the current position. This 
clause can result in different
+        analytic results for rows within the same partition.
+      </p>
+
+      <p>
+        The window clause is supported with the <codeph>AVG()</codeph>, 
<codeph>COUNT()</codeph>,
+        <codeph>FIRST_VALUE()</codeph>, <codeph>LAST_VALUE()</codeph>, and 
<codeph>SUM()</codeph> functions.
+<!-- To do: fill in this factoid under MAX and MIN also. -->
+        For <codeph>MAX()</codeph> and <codeph>MIN()</codeph>, the window 
clause only allowed if the start bound is
+        <codeph>UNBOUNDED PRECEDING</codeph>
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>ROWS BETWEEN [ { <varname>m</varname> | UNBOUNDED } PRECEDING | 
CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | <varname>n</varname> } 
FOLLOWING] ]
+RANGE BETWEEN [ {<varname>m</varname> | UNBOUNDED } PRECEDING | CURRENT ROW] [ 
AND [CURRENT ROW | { UNBOUNDED | <varname>n</varname> } FOLLOWING] ]</codeblock>
+
+      <p>
+        <codeph>ROWS BETWEEN</codeph> defines the size of the window in terms 
of the indexes of the rows in the
+        result set. The size of the window is predictable based on the clauses 
the position within the result set.
+      </p>
+
+      <p>
+        <codeph>RANGE BETWEEN</codeph> does not currently support numeric 
arguments to define a variable-size
+        sliding window.
+<!--
+Currently, it effectively works the same as the
+equivalent <codeph>ROWS BETWEEN</codeph> clause.
+-->
+      </p>
+
+<!--
+<p>
+<codeph>RANGE BETWEEN</codeph> defines the size of the window based on 
arithmetic comparisons
+of the values in the result set.
+The size of the window varies depending on the order and distribution of 
values.
+</p>
+-->
+
+<!--      <p 
conref="/Content/impala_common_xi44078.xml#common/restrictions_blurb"/> -->
+
+      <p>
+        Currently, Impala supports only some combinations of arguments to the 
<codeph>RANGE</codeph> clause:
+      </p>
+
+      <ul>
+        <li>
+          <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph> 
(the default when <codeph>ORDER
+          BY</codeph> is specified and the window clause is omitted)
+        </li>
+
+        <li>
+          <codeph>RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING</codeph>
+        </li>
+
+        <li>
+          <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING</codeph>
+        </li>
+      </ul>
+
+      <p>
+        When <codeph>RANGE</codeph> is used, <codeph>CURRENT ROW</codeph> 
includes not just the current row but all
+        rows that are tied with the current row based on the <codeph>ORDER 
BY</codeph> expressions.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/added_in_20"/>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <p>
+        The following examples show financial data for a fictional stock 
symbol <codeph>JDR</codeph>. The closing
+        price moves up and down each day.
+      </p>
+
+<codeblock>create table stock_ticker (stock_symbol string, closing_price 
decimal(8,2), closing_date timestamp);
+...load some data...
+select * from stock_ticker order by stock_symbol, closing_date
++--------------+---------------+---------------------+
+| stock_symbol | closing_price | closing_date        |
++--------------+---------------+---------------------+
+| JDR          | 12.86         | 2014-10-02 00:00:00 |
+| JDR          | 12.89         | 2014-10-03 00:00:00 |
+| JDR          | 12.94         | 2014-10-04 00:00:00 |
+| JDR          | 12.55         | 2014-10-05 00:00:00 |
+| JDR          | 14.03         | 2014-10-06 00:00:00 |
+| JDR          | 14.75         | 2014-10-07 00:00:00 |
+| JDR          | 13.98         | 2014-10-08 00:00:00 |
++--------------+---------------+---------------------+
+</codeblock>
+
+      <p>
+        The queries use analytic functions with window clauses to compute 
moving averages of the closing price. For
+        example, <codeph>ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING</codeph> 
produces an average of the value from a
+        3-day span, producing a different value for each row. The first row, 
which has no preceding row, only gets
+        averaged with the row following it. If the table contained more than 
one stock symbol, the
+        <codeph>PARTITION BY</codeph> clause would limit the window for the 
moving average to only consider the
+        prices for a single stock.
+      </p>
+
+<codeblock>select stock_symbol, closing_date, closing_price,
+  avg(closing_price) over (partition by stock_symbol order by closing_date
+    rows between 1 preceding and 1 following) as moving_average
+  from stock_ticker;
++--------------+---------------------+---------------+----------------+
+| stock_symbol | closing_date        | closing_price | moving_average |
++--------------+---------------------+---------------+----------------+
+| JDR          | 2014-10-02 00:00:00 | 12.86         | 12.87          |
+| JDR          | 2014-10-03 00:00:00 | 12.89         | 12.89          |
+| JDR          | 2014-10-04 00:00:00 | 12.94         | 12.79          |
+| JDR          | 2014-10-05 00:00:00 | 12.55         | 13.17          |
+| JDR          | 2014-10-06 00:00:00 | 14.03         | 13.77          |
+| JDR          | 2014-10-07 00:00:00 | 14.75         | 14.25          |
+| JDR          | 2014-10-08 00:00:00 | 13.98         | 14.36          |
++--------------+---------------------+---------------+----------------+
+</codeblock>
+
+      <p>
+        The clause <codeph>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT 
ROW</codeph> produces a cumulative moving
+        average, from the earliest data up to the value for each day.
+      </p>
+
+<codeblock>select stock_symbol, closing_date, closing_price,
+  avg(closing_price) over (partition by stock_symbol order by closing_date
+    rows between unbounded preceding and current row) as moving_average
+  from stock_ticker;
++--------------+---------------------+---------------+----------------+
+| stock_symbol | closing_date        | closing_price | moving_average |
++--------------+---------------------+---------------+----------------+
+| JDR          | 2014-10-02 00:00:00 | 12.86         | 12.86          |
+| JDR          | 2014-10-03 00:00:00 | 12.89         | 12.87          |
+| JDR          | 2014-10-04 00:00:00 | 12.94         | 12.89          |
+| JDR          | 2014-10-05 00:00:00 | 12.55         | 12.81          |
+| JDR          | 2014-10-06 00:00:00 | 14.03         | 13.05          |
+| JDR          | 2014-10-07 00:00:00 | 14.75         | 13.33          |
+| JDR          | 2014-10-08 00:00:00 | 13.98         | 13.42          |
++--------------+---------------------+---------------+----------------+
+</codeblock>
+
+<!-- Matt suggests not always true depending on data. Hiding until I can try 
myself.
+<p>
+The clause <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph> 
would produce the same
+output as above. Because <codeph>RANGE</codeph> currently does not support 
numeric offsets while
+<codeph>ROWS</codeph> does, currently the <codeph>ROWS</codeph> syntax is more 
flexible.
+</p>
+-->
+
+    </conbody>
+
+  </concept>
+
+  <concept id="avg_analytic">
+
+    <title>AVG() Function - Analytic Context</title>
+
+    <conbody>
+
+      <p>
+        You can include an <codeph>OVER</codeph> clause with a call to this 
function to use it as an analytic
+        function. See <xref href="impala_avg.xml#avg"/> for details and 
examples.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="count_analytic">
+
+    <title>COUNT() Function - Analytic Context</title>
+
+    <conbody>
+
+      <p>
+        You can include an <codeph>OVER</codeph> clause with a call to this 
function to use it as an analytic
+        function. See <xref href="impala_count.xml#count"/> for details and 
examples.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept rev="2.3.0" id="cume_dist">
+
+    <title>CUME_DIST() Function (CDH 5.5 or higher only)</title>
+
+    <conbody>
+
+      <p>
+        Returns the cumulative distribution of a value. The value for each row 
in the result set is greater than 0
+        and less than or equal to 1.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>CUME_DIST (<varname>expr</varname>)
+  OVER ([<varname>partition_by_clause</varname>] 
<varname>order_by_clause</varname>)
+</codeblock>
+
+      <p>
+        The <codeph>ORDER BY</codeph> clause is required. The 
<codeph>PARTITION BY</codeph> clause is optional. The
+        window clause is not allowed.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+      <p>
+        Within each partition of the result set, the 
<codeph>CUME_DIST()</codeph> value represents an ascending
+        sequence that ends at 1. Each value represents the proportion of rows 
in the partition whose values are
+        less than or equal to the value in the current row.
+      </p>
+
+      <p>
+        If the sequence of input values contains ties, the 
<codeph>CUME_DIST()</codeph> results are identical for the
+        tied values.
+      </p>
+
+      <p>
+        Impala only supports the <codeph>CUME_DIST()</codeph> function in an 
analytic context, not as a regular
+        aggregate function.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <p>
+        This example uses a table with 9 rows. The <codeph>CUME_DIST()</codeph>
+        function evaluates the entire table because there is no 
<codeph>PARTITION BY</codeph> clause,
+        with the rows ordered by the weight of the animal.
+        the sequence of values shows that 1/9 of the values are less than or 
equal to the lightest
+        animal (mouse), 2/9 of the values are less than or equal to the 
second-lightest animal,
+        and so on up to the heaviest animal (elephant), where 9/9 of the rows 
are less than or
+        equal to its weight.
+      </p>
+
+<codeblock>create table animals (name string, kind string, kilos decimal(9,3));
+insert into animals values
+  ('Elephant', 'Mammal', 4000), ('Giraffe', 'Mammal', 1200), ('Mouse', 
'Mammal', 0.020),
+  ('Condor', 'Bird', 15), ('Horse', 'Mammal', 500), ('Owl', 'Bird', 2.5),
+  ('Ostrich', 'Bird', 145), ('Polar bear', 'Mammal', 700), ('Housecat', 
'Mammal', 5);
+
+select name, cume_dist() over (order by kilos) from animals;
++------------+-----------------------+
+| name       | cume_dist() OVER(...) |
++------------+-----------------------+
+| Elephant   | 1                     |
+| Giraffe    | 0.8888888888888888    |
+| Polar bear | 0.7777777777777778    |
+| Horse      | 0.6666666666666666    |
+| Ostrich    | 0.5555555555555556    |
+| Condor     | 0.4444444444444444    |
+| Housecat   | 0.3333333333333333    |
+| Owl        | 0.2222222222222222    |
+| Mouse      | 0.1111111111111111    |
++------------+-----------------------+
+</codeblock>
+
+      <p>
+        Using a <codeph>PARTITION BY</codeph> clause produces a separate 
sequence for each partition
+        group, in this case one for mammals and one for birds. Because there 
are 3 birds and 6 mammals,
+        the sequence illustrates how 1/3 of the <q>Bird</q> rows have a 
<codeph>kilos</codeph> value that is less than or equal to
+        the lightest bird, 1/6 of the <q>Mammal</q> rows have a 
<codeph>kilos</codeph> value that is less than or equal to
+        the lightest mammal, and so on until both the heaviest bird and 
heaviest mammal have a <codeph>CUME_DIST()</codeph>
+        value of 1.
+      </p>
+
+<codeblock>select name, kind, cume_dist() over (partition by kind order by 
kilos) from animals
++------------+--------+-----------------------+
+| name       | kind   | cume_dist() OVER(...) |
++------------+--------+-----------------------+
+| Ostrich    | Bird   | 1                     |
+| Condor     | Bird   | 0.6666666666666666    |
+| Owl        | Bird   | 0.3333333333333333    |
+| Elephant   | Mammal | 1                     |
+| Giraffe    | Mammal | 0.8333333333333334    |
+| Polar bear | Mammal | 0.6666666666666666    |
+| Horse      | Mammal | 0.5                   |
+| Housecat   | Mammal | 0.3333333333333333    |
+| Mouse      | Mammal | 0.1666666666666667    |
++------------+--------+-----------------------+
+</codeblock>
+
+      <p>
+        We can reverse the ordering within each partition group by using an 
<codeph>ORDER BY ... DESC</codeph>
+        clause within the <codeph>OVER()</codeph> clause. Now the lightest 
(smallest value of <codeph>kilos</codeph>)
+        animal of each kind has a <codeph>CUME_DIST()</codeph> value of 1.
+      </p>
+
+<codeblock>select name, kind, cume_dist() over (partition by kind order by 
kilos desc) from animals
++------------+--------+-----------------------+
+| name       | kind   | cume_dist() OVER(...) |
++------------+--------+-----------------------+
+| Owl        | Bird   | 1                     |
+| Condor     | Bird   | 0.6666666666666666    |
+| Ostrich    | Bird   | 0.3333333333333333    |
+| Mouse      | Mammal | 1                     |
+| Housecat   | Mammal | 0.8333333333333334    |
+| Horse      | Mammal | 0.6666666666666666    |
+| Polar bear | Mammal | 0.5                   |
+| Giraffe    | Mammal | 0.3333333333333333    |
+| Elephant   | Mammal | 0.1666666666666667    |
++------------+--------+-----------------------+
+</codeblock>
+
+      <p>
+        The following example manufactures some rows with identical values in 
the <codeph>kilos</codeph> column,
+        to demonstrate how the results look in case of tie values. For 
simplicity, it only shows the <codeph>CUME_DIST()</codeph>
+        sequence for the <q>Bird</q> rows. Now with 3 rows all with a value of 
15, all of those rows have the same
+        <codeph>CUME_DIST()</codeph> value. 4/5 of the rows have a value for 
<codeph>kilos</codeph> that is less than or
+        equal to 15.
+      </p>
+
+<codeblock>insert into animals values ('California Condor', 'Bird', 15), 
('Andean Condor', 'Bird', 15)
+
+select name, kind, cume_dist() over (order by kilos) from animals where kind = 
'Bird';
++-------------------+------+-----------------------+
+| name              | kind | cume_dist() OVER(...) |
++-------------------+------+-----------------------+
+| Ostrich           | Bird | 1                     |
+| Condor            | Bird | 0.8                   |
+| California Condor | Bird | 0.8                   |
+| Andean Condor     | Bird | 0.8                   |
+| Owl               | Bird | 0.2                   |
++-------------------+------+-----------------------+
+</codeblock>
+
+      <p>
+        The following example shows how to use an <codeph>ORDER BY</codeph> 
clause in the outer block
+        to order the result set in case of ties. Here, all the <q>Bird</q> 
rows are together, then in descending order
+        by the result of the <codeph>CUME_DIST()</codeph> function, and all 
tied <codeph>CUME_DIST()</codeph>
+        values are ordered by the animal name.
+      </p>
+
+<codeblock>select name, kind, cume_dist() over (partition by kind order by 
kilos) as ordering
+  from animals
+where
+  kind = 'Bird'
+order by kind, ordering desc, name;
++-------------------+------+----------+
+| name              | kind | ordering |
++-------------------+------+----------+
+| Ostrich           | Bird | 1        |
+| Andean Condor     | Bird | 0.8      |
+| California Condor | Bird | 0.8      |
+| Condor            | Bird | 0.8      |
+| Owl               | Bird | 0.2      |
++-------------------+------+----------+
+</codeblock>
+
+    </conbody>
+
+  </concept>
+
+  <concept rev="2.0.0" id="dense_rank">
+
+    <title>DENSE_RANK() Function</title>
+
+    <conbody>
+
+      <p>
+        Returns an ascending sequence of integers, starting with 1. The output 
sequence produces duplicate integers
+        for duplicate values of the <codeph>ORDER BY</codeph> expressions. 
After generating duplicate output values
+        for the <q>tied</q> input values, the function continues the sequence 
with the next higher integer.
+        Therefore, the sequence contains duplicates but no gaps when the input 
contains duplicates. Starts the
+        sequence over for each group produced by the <codeph>PARTITIONED 
BY</codeph> clause.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>DENSE_RANK() OVER([<varname>partition_by_clause</varname>] 
<varname>order_by_clause</varname>)</codeblock>
+
+      <p>
+        The <codeph>PARTITION BY</codeph> clause is optional. The 
<codeph>ORDER BY</codeph> clause is required. The
+        window clause is not allowed.
+      </p>
+
+<!-- Can make the text for ROW_NUMBER, RANK, and DENSE_RANK identical
+     so it can be conref'ed in all 3 places. -->
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+      <p>
+        Often used for top-N and bottom-N queries. For example, it could 
produce a <q>top 10</q> report including
+        all the items with the 10 highest values, even if several items tied 
for 1st place.
+      </p>
+
+      <p>
+        Similar to <codeph>ROW_NUMBER</codeph> and <codeph>RANK</codeph>. 
These functions differ in how they treat
+        duplicate combinations of values.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/added_in_20"/>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <p>
+        The following example demonstrates how the 
<codeph>DENSE_RANK()</codeph> function identifies where each
+        value <q>places</q> in the result set, producing the same result for 
duplicate values, but with a strict
+        sequence from 1 to the number of groups. For example, when results are 
ordered by the <codeph>X</codeph>
+        column, both <codeph>1</codeph> values are tied for first; both 
<codeph>2</codeph> values are tied for
+        second; and so on.
+      </p>
+
+<codeblock>select x, dense_rank() over(order by x) as rank, property from 
int_t;
++----+------+----------+
+| x  | rank | property |
++----+------+----------+
+| 1  | 1    | square   |
+| 1  | 1    | odd      |
+| 2  | 2    | even     |
+| 2  | 2    | prime    |
+| 3  | 3    | prime    |
+| 3  | 3    | odd      |
+| 4  | 4    | even     |
+| 4  | 4    | square   |
+| 5  | 5    | odd      |
+| 5  | 5    | prime    |
+| 6  | 6    | even     |
+| 6  | 6    | perfect  |
+| 7  | 7    | lucky    |
+| 7  | 7    | lucky    |
+| 7  | 7    | lucky    |
+| 7  | 7    | odd      |
+| 7  | 7    | prime    |
+| 8  | 8    | even     |
+| 9  | 9    | square   |
+| 9  | 9    | odd      |
+| 10 | 10   | round    |
+| 10 | 10   | even     |
++----+------+----------+
+</codeblock>
+
+      <p>
+        The following examples show how the <codeph>DENSE_RANK()</codeph> 
function is affected by the
+        <codeph>PARTITION</codeph> property within the <codeph>ORDER 
BY</codeph> clause.
+      </p>
+
+      <p>
+        Partitioning by the <codeph>PROPERTY</codeph> column groups all the 
even, odd, and so on values together,
+        and <codeph>DENSE_RANK()</codeph> returns the place of each value 
within the group, producing several
+        ascending sequences.
+      </p>
+
+<codeblock>select x, dense_rank() over(partition by property order by x) as 
rank, property from int_t;
++----+------+----------+
+| x  | rank | property |
++----+------+----------+
+| 2  | 1    | even     |
+| 4  | 2    | even     |
+| 6  | 3    | even     |
+| 8  | 4    | even     |
+| 10 | 5    | even     |
+| 7  | 1    | lucky    |
+| 7  | 1    | lucky    |
+| 7  | 1    | lucky    |
+| 1  | 1    | odd      |
+| 3  | 2    | odd      |
+| 5  | 3    | odd      |
+| 7  | 4    | odd      |
+| 9  | 5    | odd      |
+| 6  | 1    | perfect  |
+| 2  | 1    | prime    |
+| 3  | 2    | prime    |
+| 5  | 3    | prime    |
+| 7  | 4    | prime    |
+| 10 | 1    | round    |
+| 1  | 1    | square   |
+| 4  | 2    | square   |
+| 9  | 3    | square   |
++----+------+----------+
+</codeblock>
+
+      <p>
+        Partitioning by the <codeph>X</codeph> column groups all the duplicate 
numbers together and returns the
+        place each each value within the group; because each value occurs only 
1 or 2 times,
+        <codeph>DENSE_RANK()</codeph> designates each <codeph>X</codeph> value 
as either first or second within its
+        group.
+      </p>
+
+<codeblock>select x, dense_rank() over(partition by x order by property) as 
rank, property from int_t;
++----+------+----------+
+| x  | rank | property |
++----+------+----------+
+| 1  | 1    | odd      |
+| 1  | 2    | square   |
+| 2  | 1    | even     |
+| 2  | 2    | prime    |
+| 3  | 1    | odd      |
+| 3  | 2    | prime    |
+| 4  | 1    | even     |
+| 4  | 2    | square   |
+| 5  | 1    | odd      |
+| 5  | 2    | prime    |
+| 6  | 1    | even     |
+| 6  | 2    | perfect  |
+| 7  | 1    | lucky    |
+| 7  | 1    | lucky    |
+| 7  | 1    | lucky    |
+| 7  | 2    | odd      |
+| 7  | 3    | prime    |
+| 8  | 1    | even     |
+| 9  | 1    | odd      |
+| 9  | 2    | square   |
+| 10 | 1    | even     |
+| 10 | 2    | round    |
++----+------+----------+
+</codeblock>
+
+      <p>
+        The following example shows how <codeph>DENSE_RANK()</codeph> produces 
a continuous sequence while still
+        allowing for ties. In this case, Croesus and Midas both have the 
second largest fortune, while Crassus has
+        the third largest. (In <xref 
href="impala_analytic_functions.xml#rank"/>, you see a similar query with the
+        <codeph>RANK()</codeph> function that shows that while Crassus has the 
third largest fortune, he is the
+        fourth richest person.)
+      </p>
+
+<codeblock>select dense_rank() over (order by net_worth desc) as placement, 
name, net_worth from wealth order by placement, name;
++-----------+---------+---------------+
+| placement | name    | net_worth     |
++-----------+---------+---------------+
+| 1         | Solomon | 2000000000.00 |
+| 2         | Croesus | 1000000000.00 |
+| 2         | Midas   | 1000000000.00 |
+| 3         | Crassus | 500000000.00  |
+| 4         | Scrooge | 80000000.00   |
++-----------+---------+---------------+
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/related_info"/>
+
+      <p>
+        <xref href="impala_analytic_functions.xml#rank"/>, <xref 
href="impala_analytic_functions.xml#row_number"/>
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept rev="2.0.0" id="first_value">
+
+    <title>FIRST_VALUE() Function</title>
+
+    <conbody>
+
+      <p>
+        Returns the expression value from the first row in the window. The 
return value is <codeph>NULL</codeph> if
+        the input expression is <codeph>NULL</codeph>.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>FIRST_VALUE(<varname>expr</varname>) 
OVER([<varname>partition_by_clause</varname>] 
<varname>order_by_clause</varname> 
[<varname>window_clause</varname>])</codeblock>
+
+      <p>
+        The <codeph>PARTITION BY</codeph> clause is optional. The 
<codeph>ORDER BY</codeph> clause is required. The
+        window clause is optional.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+      <p>
+        If any duplicate values occur in the tuples evaluated by the 
<codeph>ORDER BY</codeph> clause, the result
+        of this function is not deterministic. Consider adding additional 
<codeph>ORDER BY</codeph> columns to
+        ensure consistent ordering.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/added_in_20"/>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <p>
+        The following example shows a table with a wide variety of 
country-appropriate greetings. For consistency,
+        we want to standardize on a single greeting for each country. The 
<codeph>FIRST_VALUE()</codeph> function
+        helps to produce a mail merge report where every person from the same 
country is addressed with the same
+        greeting.
+      </p>
+
+<codeblock>select name, country, greeting from mail_merge
++---------+---------+--------------+
+| name    | country | greeting     |
++---------+---------+--------------+
+| Pete    | USA     | Hello        |
+| John    | USA     | Hi           |
+| Boris   | Germany | Guten tag    |
+| Michael | Germany | Guten morgen |
+| Bjorn   | Sweden  | Hej          |
+| Mats    | Sweden  | Tja          |
++---------+---------+--------------+
+
+select country, name,
+  first_value(greeting)
+    over (partition by country order by name, greeting) as greeting
+  from mail_merge;
++---------+---------+-----------+
+| country | name    | greeting  |
++---------+---------+-----------+
+| Germany | Boris   | Guten tag |
+| Germany | Michael | Guten tag |
+| Sweden  | Bjorn   | Hej       |
+| Sweden  | Mats    | Hej       |
+| USA     | John    | Hi        |
+| USA     | Pete    | Hi        |
++---------+---------+-----------+
+</codeblock>
+
+      <p>
+        Changing the order in which the names are evaluated changes which 
greeting is applied to each group.
+      </p>
+
+<codeblock>select country, name,
+  first_value(greeting)
+    over (partition by country order by name desc, greeting) as greeting
+  from mail_merge;
++---------+---------+--------------+
+| country | name    | greeting     |
++---------+---------+--------------+
+| Germany | Michael | Guten morgen |
+| Germany | Boris   | Guten morgen |
+| Sweden  | Mats    | Tja          |
+| Sweden  | Bjorn   | Tja          |
+| USA     | Pete    | Hello        |
+| USA     | John    | Hello        |
++---------+---------+--------------+
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/related_info"/>
+
+      <p>
+        <xref href="impala_analytic_functions.xml#last_value"/>
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept rev="2.0.0" id="lag">
+
+    <title>LAG() Function</title>
+
+    <conbody>
+
+      <p>
+        This function returns the value of an expression using column values 
from a preceding row. You specify an
+        integer offset, which designates a row position some number of rows 
previous to the current row. Any column
+        references in the expression argument refer to column values from that 
prior row. Typically, the table
+        contains a time sequence or numeric sequence column that clearly 
distinguishes the ordering of the rows.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>LAG (<varname>expr</varname> [, <varname>offset</varname>] [, 
<varname>default</varname>])
+  OVER ([<varname>partition_by_clause</varname>] 
<varname>order_by_clause</varname>)</codeblock>
+
+      <p>
+        The <codeph>ORDER BY</codeph> clause is required. The 
<codeph>PARTITION BY</codeph> clause is optional. The
+        window clause is not allowed.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+      <p>
+        Sometimes used an an alternative to doing a self-join.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/added_in_20"/>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <p>
+        The following example uses the same stock data created in <xref 
href="#window_clause"/>. For each day, the
+        query prints the closing price alongside the previous day's closing 
price. The first row for each stock
+        symbol has no previous row, so that <codeph>LAG()</codeph> value is 
<codeph>NULL</codeph>.
+      </p>
+
+<codeblock>select stock_symbol, closing_date, closing_price,
+    lag(closing_price,1) over (partition by stock_symbol order by 
closing_date) as "yesterday closing"
+  from stock_ticker
+    order by closing_date;
++--------------+---------------------+---------------+-------------------+
+| stock_symbol | closing_date        | closing_price | yesterday closing |
++--------------+---------------------+---------------+-------------------+
+| JDR          | 2014-09-13 00:00:00 | 12.86         | NULL              |
+| JDR          | 2014-09-14 00:00:00 | 12.89         | 12.86             |
+| JDR          | 2014-09-15 00:00:00 | 12.94         | 12.89             |
+| JDR          | 2014-09-16 00:00:00 | 12.55         | 12.94             |
+| JDR          | 2014-09-17 00:00:00 | 14.03         | 12.55             |
+| JDR          | 2014-09-18 00:00:00 | 14.75         | 14.03             |
+| JDR          | 2014-09-19 00:00:00 | 13.98         | 14.75             |
++--------------+---------------------+---------------+-------------------+
+</codeblock>
+
+      <p>
+        The following example does an arithmetic operation between the current 
row and a value from the previous
+        row, to produce a delta value for each day. This example also 
demonstrates how <codeph>ORDER BY</codeph>
+        works independently in the different parts of the query. The 
<codeph>ORDER BY closing_date</codeph> in the
+        <codeph>OVER</codeph> clause makes the query analyze the rows in 
chronological order. Then the outer query
+        block uses <codeph>ORDER BY closing_date DESC</codeph> to present the 
results with the most recent date
+        first.
+      </p>
+
+<codeblock>select stock_symbol, closing_date, closing_price,
+    cast(
+      closing_price - lag(closing_price,1) over
+        (partition by stock_symbol order by closing_date)
+      as decimal(8,2)
+    )
+    as "change from yesterday"
+  from stock_ticker
+    order by closing_date desc;
++--------------+---------------------+---------------+-----------------------+
+| stock_symbol | closing_date        | closing_price | change from yesterday |
++--------------+---------------------+---------------+-----------------------+
+| JDR          | 2014-09-19 00:00:00 | 13.98         | -0.76                 |
+| JDR          | 2014-09-18 00:00:00 | 14.75         | 0.72                  |
+| JDR          | 2014-09-17 00:00:00 | 14.03         | 1.47                  |
+| JDR          | 2014-09-16 00:00:00 | 12.55         | -0.38                 |
+| JDR          | 2014-09-15 00:00:00 | 12.94         | 0.04                  |
+| JDR          | 2014-09-14 00:00:00 | 12.89         | 0.03                  |
+| JDR          | 2014-09-13 00:00:00 | 12.86         | NULL                  |
++--------------+---------------------+---------------+-----------------------+
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/related_info"/>
+
+      <p>
+        This function is the converse of <xref 
href="impala_analytic_functions.xml#lead"/>.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept rev="2.0.0" id="last_value">
+
+    <title>LAST_VALUE() Function</title>
+
+    <conbody>
+
+      <p>
+        Returns the expression value from the last row in the window. This 
same value is repeated for all result
+        rows for the group. The return value is <codeph>NULL</codeph> if the 
input expression is
+        <codeph>NULL</codeph>.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>LAST_VALUE(<varname>expr</varname>) 
OVER([<varname>partition_by_clause</varname>] 
<varname>order_by_clause</varname> 
[<varname>window_clause</varname>])</codeblock>
+
+      <p>
+        The <codeph>PARTITION BY</codeph> clause is optional. The 
<codeph>ORDER BY</codeph> clause is required. The
+        window clause is optional.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+      <p>
+        If any duplicate values occur in the tuples evaluated by the 
<codeph>ORDER BY</codeph> clause, the result
+        of this function is not deterministic. Consider adding additional 
<codeph>ORDER BY</codeph> columns to
+        ensure consistent ordering.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/added_in_20"/>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <p>
+        The following example uses the same <codeph>MAIL_MERGE</codeph> table 
as in the example for
+        <xref href="impala_analytic_functions.xml#first_value"/>. Because the 
default window when <codeph>ORDER
+        BY</codeph> is used is <codeph>BETWEEN UNBOUNDED PRECEDING AND CURRENT 
ROW</codeph>, the query requires the
+        <codeph>UNBOUNDED FOLLOWING</codeph> to look ahead to subsequent rows 
and find the last value for each
+        country.
+      </p>
+
+<codeblock>select country, name,
+  last_value(greeting) over (
+    partition by country order by name, greeting
+    rows between unbounded preceding and unbounded following
+  ) as greeting
+  from mail_merge
++---------+---------+--------------+
+| country | name    | greeting     |
++---------+---------+--------------+
+| Germany | Boris   | Guten morgen |
+| Germany | Michael | Guten morgen |
+| Sweden  | Bjorn   | Tja          |
+| Sweden  | Mats    | Tja          |
+| USA     | John    | Hello        |
+| USA     | Pete    | Hello        |
++---------+---------+--------------+
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/related_info"/>
+
+      <p>
+        <xref href="impala_analytic_functions.xml#first_value"/>
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept rev="2.0.0" id="lead">
+
+    <title>LEAD() Function</title>
+
+    <conbody>
+
+      <p>
+        This function returns the value of an expression using column values 
from a following row. You specify an
+        integer offset, which designates a row position some number of rows 
after to the current row. Any column
+        references in the expression argument refer to column values from that 
later row. Typically, the table
+        contains a time sequence or numeric sequence column that clearly 
distinguishes the ordering of the rows.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>LEAD (<varname>expr</varname> [, <varname>offset</varname>] [, 
<varname>default</varname>])
+  OVER ([<varname>partition_by_clause</varname>] 
<varname>order_by_clause</varname>)</codeblock>
+
+      <p>
+        The <codeph>ORDER BY</codeph> clause is required. The 
<codeph>PARTITION BY</codeph> clause is optional. The
+        window clause is not allowed.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+      <p>
+        Sometimes used an an alternative to doing a self-join.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/added_in_20"/>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <p>
+        The following example uses the same stock data created in <xref 
href="#window_clause"/>. The query analyzes
+        the closing price for a stock symbol, and for each day evaluates if 
the closing price for the following day
+        is higher or lower.
+      </p>
+
+<codeblock>select stock_symbol, closing_date, closing_price,
+  case
+    (lead(closing_price,1)
+      over (partition by stock_symbol order by closing_date)
+        - closing_price) &gt; 0
+    when true then "higher"
+    when false then "flat or lower"
+  end as "trending"
+from stock_ticker
+  order by closing_date;
++--------------+---------------------+---------------+---------------+
+| stock_symbol | closing_date        | closing_price | trending      |
++--------------+---------------------+---------------+---------------+
+| JDR          | 2014-09-13 00:00:00 | 12.86         | higher        |
+| JDR          | 2014-09-14 00:00:00 | 12.89         | higher        |
+| JDR          | 2014-09-15 00:00:00 | 12.94         | flat or lower |
+| JDR          | 2014-09-16 00:00:00 | 12.55         | higher        |
+| JDR          | 2014-09-17 00:00:00 | 14.03         | higher        |
+| JDR          | 2014-09-18 00:00:00 | 14.75         | flat or lower |
+| JDR          | 2014-09-19 00:00:00 | 13.98         | NULL          |
++--------------+---------------------+---------------+---------------+
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/related_info"/>
+
+      <p>
+        This function is the converse of <xref 
href="impala_analytic_functions.xml#lag"/>.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="max_analytic">
+
+    <title>MAX() Function - Analytic Context</title>
+
+    <conbody>
+
+      <p>
+        You can include an <codeph>OVER</codeph> clause with a call to this 
function to use it as an analytic
+        function. See <xref href="impala_max.xml#max"/> for details and 
examples.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="min_analytic">
+
+    <title>MIN() Function - Analytic Context</title>
+
+    <conbody>
+
+      <p>
+        You can include an <codeph>OVER</codeph> clause with a call to this 
function to use it as an analytic
+        function. See <xref href="impala_min.xml#min"/> for details and 
examples.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept audience="Cloudera" rev="2.x.x" id="nth_value">
+
+    <title>NTH_VALUE() Function</title>
+
+    <conbody>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+    </conbody>
+
+  </concept>
+
+  <concept rev="2.3.0" id="ntile">
+
+    <title>NTILE() Function (CDH 5.5 or higher only)</title>
+
+    <conbody>
+
+      <p>
+        Returns the <q>bucket number</q> associated with each row, between 1 
and the value of an expression. For
+        example, creating 100 buckets puts the lowest 1% of values in the 
first bucket, while creating 10 buckets
+        puts the lowest 10% of values in the first bucket. Each partition can 
have a different number of buckets.
+<!-- What's the syntax or data distribution that would create a different 
number of buckets per partition? -->
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>NTILE (<varname>expr</varname> [, <varname>offset</varname> ...]
+  OVER ([<varname>partition_by_clause</varname>] 
<varname>order_by_clause</varname>)</codeblock>
+
+      <p>
+        The <codeph>ORDER BY</codeph> clause is required. The 
<codeph>PARTITION BY</codeph> clause is optional. The
+        window clause is not allowed.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+      <p>
+        The <q>ntile</q> name is derived from the practice of dividing result 
sets into fourths (quartile), tenths
+        (decile), and so on. The <codeph>NTILE()</codeph> function divides the 
result set based on an arbitrary
+        percentile value.
+      </p>
+
+      <p>
+        The number of buckets must be a positive integer.
+      </p>
+
+      <p>
+        The number of items in each bucket is identical or almost so, varying 
by at most 1. If the number of items
+        does not divide evenly between the buckets, the remaining N items are 
divided evenly among the first N
+        buckets.
+      </p>
+
+      <p>
+        If the number of buckets N is greater than the number of input rows in 
the partition, then the first N
+        buckets each contain one item, and the remaining buckets are empty.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <p>
+        The following example shows divides groups of animals into 4 buckets 
based on their weight. The
+        <codeph>ORDER BY ... DESC</codeph> clause in the 
<codeph>OVER()</codeph> clause means that the heaviest 25%
+        are in the first group, and the lightest 25% are in the fourth group. 
(The <codeph>ORDER BY</codeph> in the
+        outermost part of the query shows how you can order the final result 
set independently from the order in
+        which the rows are evaluated by the <codeph>OVER()</codeph> clause.) 
Because there are 9 rows in the group,
+        divided into 4 buckets, the first bucket receives the extra item.
+      </p>
+
+<codeblock>create table animals (name string, kind string, kilos decimal(9,3));
+
+insert into animals values
+  ('Elephant', 'Mammal', 4000), ('Giraffe', 'Mammal', 1200), ('Mouse', 
'Mammal', 0.020),
+  ('Condor', 'Bird', 15), ('Horse', 'Mammal', 500), ('Owl', 'Bird', 2.5),
+  ('Ostrich', 'Bird', 145), ('Polar bear', 'Mammal', 700), ('Housecat', 
'Mammal', 5);
+
+select name, ntile(4) over (order by kilos desc) as quarter
+  from animals
+order by quarter desc;
++------------+---------+
+| name       | quarter |
++------------+---------+
+| Owl        | 4       |
+| Mouse      | 4       |
+| Condor     | 3       |
+| Housecat   | 3       |
+| Horse      | 2       |
+| Ostrich    | 2       |
+| Elephant   | 1       |
+| Giraffe    | 1       |
+| Polar bear | 1       |
++------------+---------+
+</codeblock>
+
+      <p>
+        The following examples show how the <codeph>PARTITION</codeph> clause 
works for the
+        <codeph>NTILE()</codeph> function. Here, we divide each kind of animal 
(mammal or bird) into 2 buckets,
+        the heavier half and the lighter half.
+      </p>
+
+<codeblock>select name, kind, ntile(2) over (partition by kind order by kilos 
desc) as half
+  from animals
+order by kind;
++------------+--------+------+
+| name       | kind   | half |
++------------+--------+------+
+| Ostrich    | Bird   | 1    |
+| Condor     | Bird   | 1    |
+| Owl        | Bird   | 2    |
+| Elephant   | Mammal | 1    |
+| Giraffe    | Mammal | 1    |
+| Polar bear | Mammal | 1    |
+| Horse      | Mammal | 2    |
+| Housecat   | Mammal | 2    |
+| Mouse      | Mammal | 2    |
++------------+--------+------+
+</codeblock>
+
+      <p>
+        Again, the result set can be ordered independently
+        from the analytic evaluation. This next example lists all the animals 
heaviest to lightest,
+        showing that elephant and giraffe are in the <q>top half</q> of 
mammals by weight, while
+        housecat and mouse are in the <q>bottom half</q>.
+      </p>
+
+<codeblock>select name, kind, ntile(2) over (partition by kind order by kilos 
desc) as half
+  from animals
+order by kilos desc;
++------------+--------+------+
+| name       | kind   | half |
++------------+--------+------+
+| Elephant   | Mammal | 1    |
+| Giraffe    | Mammal | 1    |
+| Polar bear | Mammal | 1    |
+| Horse      | Mammal | 2    |
+| Ostrich    | Bird   | 1    |
+| Condor     | Bird   | 1    |
+| Housecat   | Mammal | 2    |
+| Owl        | Bird   | 2    |
+| Mouse      | Mammal | 2    |
++------------+--------+------+
+</codeblock>
+
+    </conbody>
+
+  </concept>
+
+  <concept rev="2.3.0" id="percent_rank">
+
+    <title>PERCENT_RANK() Function (CDH 5.5 or higher only)</title>
+
+    <conbody>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>PERCENT_RANK (<varname>expr</varname>)
+  OVER ([<varname>partition_by_clause</varname>] 
<varname>order_by_clause</varname>)
+</codeblock>
+
+      <p>
+      Calculates the rank, expressed as a percentage, of each row within a 
group of rows.
+      If <codeph>rank</codeph> is the value for that same row from the 
<codeph>RANK()</codeph> function (from 1 to the total number of rows in the 
partition group),
+      then the <codeph>PERCENT_RANK()</codeph> value is calculated as 
<codeph>(<varname>rank</varname> - 1) / (<varname>rows_in_group</varname> - 
1)</codeph> .
+      If there is only a single item in the partition group, its 
<codeph>PERCENT_RANK()</codeph> value is 0.
+      </p>
+
+      <p>
+        The <codeph>ORDER BY</codeph> clause is required. The 
<codeph>PARTITION BY</codeph> clause is optional. The
+        window clause is not allowed.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+      <p>
+        This function is similar to the <codeph>RANK</codeph> and 
<codeph>CUME_DIST()</codeph> functions: it returns an ascending sequence 
representing the position of each
+        row within the rows of the same partition group. The actual numeric 
sequence is calculated differently,
+        and the handling of duplicate (tied) values is different.
+      </p>
+
+      <p>
+        The return values range from 0 to 1 inclusive.
+        The first row in each partition group always has the value 0.
+        A <codeph>NULL</codeph> value is considered the lowest possible value.
+        In the case of duplicate input values, all the corresponding rows in 
the result set
+        have an identical value: the lowest <codeph>PERCENT_RANK()</codeph> 
value of those
+        tied rows. (In contrast to <codeph>CUME_DIST()</codeph>, where all 
tied rows have
+        the highest <codeph>CUME_DIST()</codeph> value.)
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <p>
+        The following example uses the same <codeph>ANIMALS</codeph> table as 
the examples for <codeph>CUME_DIST()</codeph>
+        and <codeph>NTILE()</codeph>, with a few additional rows to illustrate 
the results where some values are
+        <codeph>NULL</codeph> or there is only a single row in a partition 
group.
+      </p>
+
+<codeblock>insert into animals values ('Komodo dragon', 'Reptile', 70);
+insert into animals values ('Unicorn', 'Mythical', NULL);
+insert into animals values ('Fire-breathing dragon', 'Mythical', NULL);
+</codeblock>
+
+      <p>
+        As with <codeph>CUME_DIST()</codeph>, there is an ascending sequence 
for each kind of animal.
+        For example, the <q>Birds</q> and <q>Mammals</q> rows each have a 
<codeph>PERCENT_RANK()</codeph> sequence
+        that ranges from 0 to 1.
+        The <q>Reptile</q> row has a <codeph>PERCENT_RANK()</codeph> of 0 
because that partition group contains only a single item.
+        Both <q>Mythical</q> animals have a <codeph>PERCENT_RANK()</codeph> of 
0 because
+        a <codeph>NULL</codeph> is considered the lowest value within its 
partition group.
+      </p>
+
+<codeblock>select name, kind, percent_rank() over (partition by kind order by 
kilos) from animals;
++-----------------------+----------+--------------------------+
+| name                  | kind     | percent_rank() OVER(...) |
++-----------------------+----------+--------------------------+
+| Mouse                 | Mammal   | 0                        |
+| Housecat              | Mammal   | 0.2                      |
+| Horse                 | Mammal   | 0.4                      |
+| Polar bear            | Mammal   | 0.6                      |
+| Giraffe               | Mammal   | 0.8                      |
+| Elephant              | Mammal   | 1                        |
+| Komodo dragon         | Reptile  | 0                        |
+| Owl                   | Bird     | 0                        |
+| California Condor     | Bird     | 0.25                     |
+| Andean Condor         | Bird     | 0.25                     |
+| Condor                | Bird     | 0.25                     |
+| Ostrich               | Bird     | 1                        |
+| Fire-breathing dragon | Mythical | 0                        |
+| Unicorn               | Mythical | 0                        |
++-----------------------+----------+--------------------------+
+</codeblock>
+    </conbody>
+
+  </concept>
+
+  <concept rev="2.0.0" id="rank">
+
+    <title>RANK() Function</title>
+
+    <conbody>
+
+      <p>
+        Returns an ascending sequence of integers, starting with 1. The output 
sequence produces duplicate integers
+        for duplicate values of the <codeph>ORDER BY</codeph> expressions. 
After generating duplicate output values
+        for the <q>tied</q> input values, the function increments the sequence 
by the number of tied values.
+        Therefore, the sequence contains both duplicates and gaps when the 
input contains duplicates. Starts the
+        sequence over for each group produced by the <codeph>PARTITIONED 
BY</codeph> clause.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>RANK() OVER([<varname>partition_by_clause</varname>] 
<varname>order_by_clause</varname>)</codeblock>
+
+      <p>
+        The <codeph>PARTITION BY</codeph> clause is optional. The 
<codeph>ORDER BY</codeph> clause is required. The
+        window clause is not allowed.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+<!-- Make a little tutorial to show these 3 functions side-by-side and 
illustrate their difference. -->
+
+      <p>
+        Often used for top-N and bottom-N queries. For example, it could 
produce a <q>top 10</q> report including
+        several items that were tied for 10th place.
+      </p>
+
+      <p>
+        Similar to <codeph>ROW_NUMBER</codeph> and 
<codeph>DENSE_RANK</codeph>. These functions differ in how they
+        treat duplicate combinations of values.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/added_in_20"/>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <p>
+        The following example demonstrates how the <codeph>RANK()</codeph> 
function identifies where each value
+        <q>places</q> in the result set, producing the same result for 
duplicate values, and skipping values in the
+        sequence to account for the number of duplicates. For example, when 
results are ordered by the
+        <codeph>X</codeph> column, both <codeph>1</codeph> values are tied for 
first; both <codeph>2</codeph>
+        values are tied for third; and so on.
+      </p>
+
+<codeblock>select x, rank() over(order by x) as rank, property from int_t;
++----+------+----------+
+| x  | rank | property |
++----+------+----------+
+| 1  | 1    | square   |
+| 1  | 1    | odd      |
+| 2  | 3    | even     |
+| 2  | 3    | prime    |
+| 3  | 5    | prime    |
+| 3  | 5    | odd      |
+| 4  | 7    | even     |
+| 4  | 7    | square   |
+| 5  | 9    | odd      |
+| 5  | 9    | prime    |
+| 6  | 11   | even     |
+| 6  | 11   | perfect  |
+| 7  | 13   | lucky    |
+| 7  | 13   | lucky    |
+| 7  | 13   | lucky    |
+| 7  | 13   | odd      |
+| 7  | 13   | prime    |
+| 8  | 18   | even     |
+| 9  | 19   | square   |
+| 9  | 19   | odd      |
+| 10 | 21   | round    |
+| 10 | 21   | even     |
++----+------+----------+
+</codeblock>
+
+      <p>
+        The following examples show how the <codeph>RANK()</codeph> function 
is affected by the
+        <codeph>PARTITION</codeph> property within the <codeph>ORDER 
BY</codeph> clause.
+      </p>
+
+      <p>
+        Partitioning by the <codeph>PROPERTY</codeph> column groups all the 
even, odd, and so on values together,
+        and <codeph>RANK()</codeph> returns the place of each value within the 
group, producing several ascending
+        sequences.
+      </p>
+
+<codeblock>select x, rank() over(partition by property order by x) as rank, 
property from int_t;
++----+------+----------+
+| x  | rank | property |
++----+------+----------+
+| 2  | 1    | even     |
+| 4  | 2    | even     |
+| 6  | 3    | even     |
+| 8  | 4    | even     |
+| 10 | 5    | even     |
+| 7  | 1    | lucky    |
+| 7  | 1    | lucky    |
+| 7  | 1    | lucky    |
+| 1  | 1    | odd      |
+| 3  | 2    | odd      |
+| 5  | 3    | odd      |
+| 7  | 4    | odd      |
+| 9  | 5    | odd      |
+| 6  | 1    | perfect  |
+| 2  | 1    | prime    |
+| 3  | 2    | prime    |
+| 5  | 3    | prime    |
+| 7  | 4    | prime    |
+| 10 | 1    | round    |
+| 1  | 1    | square   |
+| 4  | 2    | square   |
+| 9  | 3    | square   |
++----+------+----------+
+</codeblock>
+
+      <p>
+        Partitioning by the <codeph>X</codeph> column groups all the duplicate 
numbers together and returns the
+        place each each value within the group; because each value occurs only 
1 or 2 times,
+        <codeph>RANK()</codeph> designates each <codeph>X</codeph> value as 
either first or second within its
+        group.
+      </p>
+
+<codeblock>select x, rank() over(partition by x order by property) as rank, 
property from int_t;
++----+------+----------+
+| x  | rank | property |
++----+------+----------+
+| 1  | 1    | odd      |
+| 1  | 2    | square   |
+| 2  | 1    | even     |
+| 2  | 2    | prime    |
+| 3  | 1    | odd      |
+| 3  | 2    | prime    |
+| 4  | 1    | even     |
+| 4  | 2    | square   |
+| 5  | 1    | odd      |
+| 5  | 2    | prime    |
+| 6  | 1    | even     |
+| 6  | 2    | perfect  |
+| 7  | 1    | lucky    |
+| 7  | 1    | lucky    |
+| 7  | 1    | lucky    |
+| 7  | 4    | odd      |
+| 7  | 5    | prime    |
+| 8  | 1    | even     |
+| 9  | 1    | odd      |
+| 9  | 2    | square   |
+| 10 | 1    | even     |
+| 10 | 2    | round    |
++----+------+----------+
+</codeblock>
+
+      <p>
+        The following example shows how a magazine might prepare a list of 
history's wealthiest people. Croesus and
+        Midas are tied for second, then Crassus is fourth.
+      </p>
+
+<codeblock>select rank() over (order by net_worth desc) as rank, name, 
net_worth from wealth order by rank, name;
++------+---------+---------------+
+| rank | name    | net_worth     |
++------+---------+---------------+
+| 1    | Solomon | 2000000000.00 |
+| 2    | Croesus | 1000000000.00 |
+| 2    | Midas   | 1000000000.00 |
+| 4    | Crassus | 500000000.00  |
+| 5    | Scrooge | 80000000.00   |
++------+---------+---------------+
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/related_info"/>
+
+      <p>
+        <xref href="impala_analytic_functions.xml#dense_rank"/>,
+        <xref href="impala_analytic_functions.xml#row_number"/>
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept rev="2.0.0" id="row_number">
+
+    <title>ROW_NUMBER() Function</title>
+
+    <conbody>
+
+      <p>
+        Returns an ascending sequence of integers, starting with 1. Starts the 
sequence over for each group
+        produced by the <codeph>PARTITIONED BY</codeph> clause. The output 
sequence includes different values for
+        duplicate input values. Therefore, the sequence never contains any 
duplicates or gaps, regardless of
+        duplicate input values.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>ROW_NUMBER() OVER([<varname>partition_by_clause</varname>] 
<varname>order_by_clause</varname>)</codeblock>
+
+      <p>
+        The <codeph>ORDER BY</codeph> clause is required. The 
<codeph>PARTITION BY</codeph> clause is optional. The
+        window clause is not allowed.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+      <p>
+        Often used for top-N and bottom-N queries where the input values are 
known to be unique, or precisely N
+        rows are needed regardless of duplicate values.
+      </p>
+
+      <p>
+        Because its result value is different for each row in the result set 
(when used without a <codeph>PARTITION
+        BY</codeph> clause), <codeph>ROW_NUMBER()</codeph> can be used to 
synthesize unique numeric ID values, for
+        example for result sets involving unique values or tuples.
+      </p>
+
+      <p>
+        Similar to <codeph>RANK</codeph> and <codeph>DENSE_RANK</codeph>. 
These functions differ in how they treat
+        duplicate combinations of values.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/added_in_20"/>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <p>
+        The following example demonstrates how <codeph>ROW_NUMBER()</codeph> 
produces a continuous numeric
+        sequence, even though some values of <codeph>X</codeph> are repeated.
+      </p>
+
+<codeblock>select x, row_number() over(order by x, property) as row_number, 
property from int_t;
++----+------------+----------+
+| x  | row_number | property |
++----+------------+----------+
+| 1  | 1          | odd      |
+| 1  | 2          | square   |
+| 2  | 3          | even     |
+| 2  | 4          | prime    |
+| 3  | 5          | odd      |
+| 3  | 6          | prime    |
+| 4  | 7          | even     |
+| 4  | 8          | square   |
+| 5  | 9          | odd      |
+| 5  | 10         | prime    |
+| 6  | 11         | even     |
+| 6  | 12         | perfect  |
+| 7  | 13         | lucky    |
+| 7  | 14         | lucky    |
+| 7  | 15         | lucky    |
+| 7  | 16         | odd      |
+| 7  | 17         | prime    |
+| 8  | 18         | even     |
+| 9  | 19         | odd      |
+| 9  | 20         | square   |
+| 10 | 21         | even     |
+| 10 | 22         | round    |
++----+------------+----------+
+</codeblock>
+
+      <p>
+        The following example shows how a financial institution might assign 
customer IDs to some of history's
+        wealthiest figures. Although two of the people have identical net 
worth figures, unique IDs are required
+        for this purpose. <codeph>ROW_NUMBER()</codeph> produces a sequence of 
five different values for the five
+        input rows.
+      </p>
+
+<codeblock>select row_number() over (order by net_worth desc) as account_id, 
name, net_worth
+  from wealth order by account_id, name;
++------------+---------+---------------+
+| account_id | name    | net_worth     |
++------------+---------+---------------+
+| 1          | Solomon | 2000000000.00 |
+| 2          | Croesus | 1000000000.00 |
+| 3          | Midas   | 1000000000.00 |
+| 4          | Crassus | 500000000.00  |
+| 5          | Scrooge | 80000000.00   |
++------------+---------+---------------+
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/related_info"/>
+
+      <p>
+        <xref href="impala_analytic_functions.xml#rank"/>, <xref 
href="impala_analytic_functions.xml#dense_rank"/>
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="sum_analytic">
+
+    <title>SUM() Function - Analytic Context</title>
+
+    <conbody>
+
+      <p>
+        You can include an <codeph>OVER</codeph> clause with a call to this 
function to use it as an analytic
+        function. See <xref href="impala_sum.xml#sum"/> for details and 
examples.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+</concept>


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_appx_count_distinct.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_appx_count_distinct.xml 
b/docs/topics/impala_appx_count_distinct.xml
new file mode 100644
index 0000000..31a9679
--- /dev/null
+++ b/docs/topics/impala_appx_count_distinct.xml
@@ -0,0 +1,77 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept rev="2.0.0" id="appx_count_distinct">
+
+  <title>APPX_COUNT_DISTINCT Query Option</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Impala Query Options"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">APPX_COUNT_DISTINCT query 
option</indexterm>
+      Allows multiple <codeph>COUNT(DISTINCT)</codeph> operations within a 
single query, by internally rewriting
+      each <codeph>COUNT(DISTINCT)</codeph> to use the <codeph>NDV()</codeph> 
function. The resulting count is
+      approximate rather than precise.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/type_boolean"/>
+
+    <p conref="../shared/impala_common.xml#common/default_false_0"/>
+
+    <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+    <p>
+      The following examples show how the <codeph>APPX_COUNT_DISTINCT</codeph> 
lets you work around the restriction
+      where a query can only evaluate <codeph>COUNT(DISTINCT 
<varname>col_name</varname>)</codeph> for a single
+      column. By default, you can count the distinct values of one column or 
another, but not both in a single
+      query:
+    </p>
+
+<codeblock>[localhost:21000] &gt; select count(distinct x) from int_t;
++-------------------+
+| count(distinct x) |
++-------------------+
+| 10                |
++-------------------+
+[localhost:21000] &gt; select count(distinct property) from int_t;
++--------------------------+
+| count(distinct property) |
++--------------------------+
+| 7                        |
++--------------------------+
+[localhost:21000] &gt; select count(distinct x), count(distinct property) from 
int_t;
+ERROR: AnalysisException: all DISTINCT aggregate functions need to have the 
same set of parameters
+as count(DISTINCT x); deviating function: count(DISTINCT property)
+</codeblock>
+
+    <p>
+      When you enable the <codeph>APPX_COUNT_DISTINCT</codeph> query option, 
now the query with multiple
+      <codeph>COUNT(DISTINCT)</codeph> works. The reason this behavior 
requires a query option is that each
+      <codeph>COUNT(DISTINCT)</codeph> is rewritten internally to use the 
<codeph>NDV()</codeph> function instead,
+      which provides an approximate result rather than a precise count.
+    </p>
+
+<codeblock>[localhost:21000] &gt; set APPX_COUNT_DISTINCT=true;
+[localhost:21000] &gt; select count(distinct x), count(distinct property) from 
int_t;
++-------------------+--------------------------+
+| count(distinct x) | count(distinct property) |
++-------------------+--------------------------+
+| 10                | 7                        |
++-------------------+--------------------------+
+</codeblock>
+
+    <p conref="../shared/impala_common.xml#common/related_info"/>
+
+    <p>
+      <xref href="impala_count.xml#count"/>,
+      <xref href="impala_distinct.xml#distinct"/>,
+      <xref href="impala_ndv.xml#ndv"/>
+    </p>
+
+  </conbody>
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_appx_median.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_appx_median.xml 
b/docs/topics/impala_appx_median.xml
new file mode 100644
index 0000000..d874ead
--- /dev/null
+++ b/docs/topics/impala_appx_median.xml
@@ -0,0 +1,122 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept rev="1.2.1" id="appx_median">
+
+  <title>APPX_MEDIAN Function</title>
+  <titlealts><navtitle>APPX_MEDIAN</navtitle></titlealts>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="SQL"/>
+      <data name="Category" value="Impala Functions"/>
+      <data name="Category" value="Aggregate Functions"/>
+      <data name="Category" value="Querying"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">appx_median() function</indexterm>
+      An aggregate function that returns a value that is approximately the 
median (midpoint) of values in the set
+      of input values.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>APPX_MEDIAN([DISTINCT | ALL] <varname>expression</varname>)
+</codeblock>
+
+    <p>
+      This function works with any input type, because the only requirement is 
that the type supports less-than and
+      greater-than comparison operators.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+    <p>
+      Because the return value represents the estimated midpoint, it might not 
reflect the precise midpoint value,
+      especially if the cardinality of the input values is very high. If the 
cardinality is low (up to
+      approximately 20,000), the result is more accurate because the sampling 
considers all or almost all of the
+      different values.
+    </p>
+
+    <p 
conref="../shared/impala_common.xml#common/return_type_same_except_string"/>
+
+    <p>
+      The return value is always the same as one of the input values, not an 
<q>in-between</q> value produced by
+      averaging.
+    </p>
+
+<!-- <p 
conref="/Content/impala_common_xi44078.xml#common/restrictions_sliding_window"/>
 -->
+
+    <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
+
+    <p 
conref="../shared/impala_common.xml#common/analytic_not_allowed_caveat"/>
+
+    <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+    <p>
+      The following example uses a table of a million random floating-point 
numbers ranging up to approximately
+      50,000. The average is approximately 25,000. Because of the random 
distribution, we would expect the median
+      to be close to this same number. Computing the precise median is a more 
intensive operation than computing
+      the average, because it requires keeping track of every distinct value 
and how many times each occurs. The
+      <codeph>APPX_MEDIAN()</codeph> function uses a sampling algorithm to 
return an approximate result, which in
+      this case is close to the expected value. To make sure that the value is 
not substantially out of range due
+      to a skewed distribution, subsequent queries confirm that there are 
approximately 500,000 values higher than
+      the <codeph>APPX_MEDIAN()</codeph> value, and approximately 500,000 
values lower than the
+      <codeph>APPX_MEDIAN()</codeph> value.
+    </p>
+
+<codeblock>[localhost:21000] &gt; select min(x), max(x), avg(x) from 
million_numbers;
++-------------------+-------------------+-------------------+
+| min(x)            | max(x)            | avg(x)            |
++-------------------+-------------------+-------------------+
+| 4.725693727250069 | 49994.56852674231 | 24945.38563793553 |
++-------------------+-------------------+-------------------+
+[localhost:21000] &gt; select appx_median(x) from million_numbers;
++----------------+
+| appx_median(x) |
++----------------+
+| 24721.6        |
++----------------+
+[localhost:21000] &gt; select count(x) as higher from million_numbers where x 
&gt; (select appx_median(x) from million_numbers);
++--------+
+| higher |
++--------+
+| 502013 |
++--------+
+[localhost:21000] &gt; select count(x) as lower from million_numbers where x 
&lt; (select appx_median(x) from million_numbers);
++--------+
+| lower  |
++--------+
+| 497987 |
++--------+
+</codeblock>
+
+    <p>
+      The following example computes the approximate median using a subset of 
the values from the table, and then
+      confirms that the result is a reasonable estimate for the midpoint.
+    </p>
+
+<codeblock>[localhost:21000] &gt; select appx_median(x) from million_numbers 
where x between 1000 and 5000;
++-------------------+
+| appx_median(x)    |
++-------------------+
+| 3013.107787358159 |
++-------------------+
+[localhost:21000] &gt; select count(x) as higher from million_numbers where x 
between 1000 and 5000 and x &gt; 3013.107787358159;
++--------+
+| higher |
++--------+
+| 37692  |
++--------+
+[localhost:21000] &gt; select count(x) as lower from million_numbers where x 
between 1000 and 5000 and x &lt; 3013.107787358159;
++-------+
+| lower |
++-------+
+| 37089 |
++-------+
+</codeblock>
+  </conbody>
+</concept>

[19/22] incubator-impala git commit: First try at porting over the source files necessary for the Impala SQL Reference.

Reply via email to