[07/22] incubator-impala git commit: First try at porting over the source files necessary for the Impala SQL Reference.

jrussell Tue, 26 Jul 2016 16:05:13 -0700

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_operators.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_operators.xml b/docs/topics/impala_operators.xml
new file mode 100644
index 0000000..da3dab3
--- /dev/null
+++ b/docs/topics/impala_operators.xml
@@ -0,0 +1,1262 @@
+<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD 
DITA Concept//EN" "concept.dtd">
+<concept id="operators">
+
+  <title>SQL Operators</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="SQL"/>
+      <data name="Category" value="Data Analysts"/>
+      <data name="Category" value="Developers"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">operators</indexterm>
+      SQL operators are a class of comparison functions that are widely used 
within the <codeph>WHERE</codeph>
+      clauses of <codeph>SELECT</codeph> statements.
+    </p>
+
+    <p outputclass="toc inpage"/>
+  </conbody>
+
+  <concept rev="1.4.0" id="arithmetic_operators">
+
+    <title>Arithmetic Operators</title>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">arithmetic operators</indexterm>
+        The arithmetic operators use expressions with a left-hand argument, 
the operator, and then (in most cases)
+        a right-hand argument.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock><varname>left_hand_arg</varname> <varname>binary_operator</varname> 
<varname>right_hand_arg</varname>
+<varname>unary_operator</varname> <varname>single_arg</varname>
+</codeblock>
+
+      <ul>
+        <li>
+          <codeph>+</codeph> and <codeph>-</codeph>: Can be used either as 
unary or binary operators.
+          <ul>
+            <li>
+              <p>
+                With unary notation, such as <codeph>+5</codeph>, 
<codeph>-2.5</codeph>, or
+                <codeph>-<varname>col_name</varname></codeph>, they multiply 
their single numeric argument by
+                <codeph>+1</codeph> or <codeph>-1</codeph>. Therefore, unary 
<codeph>+</codeph> returns its
+                argument unchanged, while unary <codeph>-</codeph> flips the 
sign of its argument. Although you can
+                double up these operators in expressions such as 
<codeph>++5</codeph> (always positive) or
+                <codeph>-+2</codeph> or <codeph>+-2</codeph> (both always 
negative), you cannot double the unary
+                minus operator because <codeph>--</codeph> is interpreted as 
the start of a comment. (You can use a
+                double unary minus operator if you separate the 
<codeph>-</codeph> characters, for example with a
+                space or parentheses.)
+              </p>
+            </li>
+
+            <li>
+              <p>
+                With binary notation, such as <codeph>2+2</codeph>, 
<codeph>5-2.5</codeph>, or
+                <codeph><varname>col1</varname> + 
<varname>col2</varname></codeph>, they add or subtract
+                respectively the right-hand argument to (or from) the 
left-hand argument. Both arguments must be of
+                numeric types.
+              </p>
+            </li>
+          </ul>
+        </li>
+
+        <li>
+          <p>
+            <codeph>*</codeph> and <codeph>/</codeph>: Multiplication and 
division respectively. Both arguments
+            must be of numeric types.
+          </p>
+          <p>
+            When multiplying, the shorter argument is promoted if necessary 
(such as <codeph>SMALLINT</codeph> to
+            <codeph>INT</codeph> or <codeph>BIGINT</codeph>, or 
<codeph>FLOAT</codeph> to <codeph>DOUBLE</codeph>),
+            and then the result is promoted again to the next larger type. 
Thus, multiplying a
+            <codeph>TINYINT</codeph> and an <codeph>INT</codeph> produces a 
<codeph>BIGINT</codeph> result.
+            Multiplying a <codeph>FLOAT</codeph> and a <codeph>FLOAT</codeph> 
produces a <codeph>DOUBLE</codeph>
+            result. Multiplying a <codeph>FLOAT</codeph> and a 
<codeph>DOUBLE</codeph> or a <codeph>DOUBLE</codeph>
+            and a <codeph>DOUBLE</codeph> produces a 
<codeph>DECIMAL(38,17)</codeph>, because
+            <codeph>DECIMAL</codeph> values can represent much larger and more 
precise values than
+            <codeph>DOUBLE</codeph>.
+          </p>
+          <p>
+            When dividing, Impala always treats the arguments and result as 
<codeph>DOUBLE</codeph> values to avoid
+            losing precision. If you need to insert the results of a division 
operation into a
+            <codeph>FLOAT</codeph> column, use the <codeph>CAST()</codeph> 
function to convert the result to the
+            correct type.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            <codeph>%</codeph>: Modulo operator. Returns the remainder of the 
left-hand argument divided by the
+            right-hand argument. Both arguments must be of one of the integer 
types.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            <codeph>&amp;</codeph>, <codeph>|</codeph>, <codeph>~</codeph>, 
and <codeph>^</codeph>: Bitwise operators that return the
+            logical AND, logical OR, <codeph>NOT</codeph>, or logical XOR 
(exclusive OR) of their argument values. Both arguments must be
+            of one of the integer types. If the arguments are of different 
type, the argument with the smaller type
+            is implicitly extended to match the argument with the longer type.
+          </p>
+        </li>
+      </ul>
+
+      <p>
+        You can chain a sequence of arithmetic expressions, optionally 
grouping them with parentheses.
+      </p>
+
+      <p>
+        The arithmetic operators generally do not have equivalent calling 
conventions using functional notation.
+        For example, prior to Impala 2.2.0 / CDH 5.4.0, there is no 
<codeph>MOD()</codeph> function equivalent to the <codeph>%</codeph> modulo
+        operator. Conversely, there are some arithmetic functions that do not 
have a corresponding operator. For
+        example, for exponentiation you use the <codeph>POW()</codeph> 
function, but there is no
+        <codeph>**</codeph> exponentiation operator. See <xref 
href="impala_math_functions.xml#math_functions"/>
+        for the arithmetic functions you can use.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
+
+      <p 
conref="../shared/impala_common.xml#common/complex_types_aggregation_explanation"/>
+      
+      <p 
conref="../shared/impala_common.xml#common/complex_types_aggregation_example"/>
+
+      <p 
conref="../shared/impala_common.xml#common/complex_types_caveat_no_operator"/>
+
+      <p rev="2.3.0">
+        The following example shows how to do an arithmetic operation using a 
numeric field of a <codeph>STRUCT</codeph> type
+        that is an item within an <codeph>ARRAY</codeph> column. Once the 
scalar numeric value <codeph>R_NATIONKEY</codeph>
+        is extracted, it can be used in an arithmetic expression, such as 
multiplying by 10:
+      </p>
+
+<codeblock rev="2.3.0">
+-- The SMALLINT is a field within an array of structs.
+describe region;
++-------------+-------------------------+---------+
+| name        | type                    | comment |
++-------------+-------------------------+---------+
+| r_regionkey | smallint                |         |
+| r_name      | string                  |         |
+| r_comment   | string                  |         |
+| r_nations   | array&lt;struct&lt;           |         |
+|             |   n_nationkey:smallint, |         |
+|             |   n_name:string,        |         |
+|             |   n_comment:string      |         |
+|             | &gt;&gt;                      |         |
++-------------+-------------------------+---------+
+
+-- When we refer to the scalar value using dot notation,
+-- we can use arithmetic and comparison operators on it
+-- like any other number.
+select r_name, nation.item.n_name, nation.item.n_nationkey * 10
+  from region, region.r_nations as nation
+where nation.item.n_nationkey &lt; 5;
++-------------+-------------+------------------------------+
+| r_name      | item.n_name | nation.item.n_nationkey * 10 |
++-------------+-------------+------------------------------+
+| AMERICA     | CANADA      | 30                           |
+| AMERICA     | BRAZIL      | 20                           |
+| AMERICA     | ARGENTINA   | 10                           |
+| MIDDLE EAST | EGYPT       | 40                           |
+| AFRICA      | ALGERIA     | 0                            |
++-------------+-------------+------------------------------+
+</codeblock>
+
+    </conbody>
+  </concept>
+
+  <concept id="between">
+
+    <title>BETWEEN Operator</title>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">BETWEEN operator</indexterm>
+        In a <codeph>WHERE</codeph> clause, compares an expression to both a 
lower and upper bound. The comparison
+        is successful is the expression is greater than or equal to the lower 
bound, and less than or equal to the
+        upper bound. If the bound values are switched, so the lower bound is 
greater than the upper bound, does not
+        match any values.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock><varname>expression</varname> BETWEEN 
<varname>lower_bound</varname> AND <varname>upper_bound</varname></codeblock>
+
+      <p>
+        <b>Data types:</b> Typically used with numeric data types. Works with 
any data type, although not very
+        practical for <codeph>BOOLEAN</codeph> values. (<codeph>BETWEEN false 
AND true</codeph> will match all
+        <codeph>BOOLEAN</codeph> values.) Use <codeph>CAST()</codeph> if 
necessary to ensure the lower and upper
+        bound values are compatible types. Call string or date/time functions 
if necessary to extract or transform
+        the relevant portion to compare, especially if the value can be 
transformed into a number.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+      <p>
+        Be careful when using short string operands. A longer string that 
starts with the upper bound value will
+        not be included, because it is considered greater than the upper 
bound. For example, <codeph>BETWEEN 'A'
+        and 'M'</codeph> would not match the string value 
<codeph>'Midway'</codeph>. Use functions such as
+        <codeph>upper()</codeph>, <codeph>lower()</codeph>, 
<codeph>substr()</codeph>, <codeph>trim()</codeph>, and
+        so on if necessary to ensure the comparison works as expected.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
+
+      <p 
conref="../shared/impala_common.xml#common/complex_types_caveat_no_operator"/>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+<codeblock>-- Retrieve data for January through June, inclusive.
+select c1 from t1 where month <b>between 1 and 6</b>;
+
+-- Retrieve data for names beginning with 'A' through 'M' inclusive.
+-- Only test the first letter to ensure all the values starting with 'M' are 
matched.
+-- Do a case-insensitive comparison to match names with various capitalization 
conventions.
+select last_name from customers where upper(substr(last_name,1,1)) <b>between 
'A' and 'M'</b>;
+
+-- Retrieve data for only the first week of each month.
+select count(distinct visitor_id)) from web_traffic where 
dayofmonth(when_viewed) <b>between 1 and 7</b>;</codeblock>
+
+      <p rev="2.3.0">
+        The following example shows how to do a <codeph>BETWEEN</codeph> 
comparison using a numeric field of a <codeph>STRUCT</codeph> type
+        that is an item within an <codeph>ARRAY</codeph> column. Once the 
scalar numeric value <codeph>R_NATIONKEY</codeph>
+        is extracted, it can be used in a comparison operator:
+      </p>
+
+<codeblock rev="2.3.0">
+-- The SMALLINT is a field within an array of structs.
+describe region;
++-------------+-------------------------+---------+
+| name        | type                    | comment |
++-------------+-------------------------+---------+
+| r_regionkey | smallint                |         |
+| r_name      | string                  |         |
+| r_comment   | string                  |         |
+| r_nations   | array&lt;struct&lt;           |         |
+|             |   n_nationkey:smallint, |         |
+|             |   n_name:string,        |         |
+|             |   n_comment:string      |         |
+|             | &gt;&gt;                      |         |
++-------------+-------------------------+---------+
+
+-- When we refer to the scalar value using dot notation,
+-- we can use arithmetic and comparison operators on it
+-- like any other number.
+select r_name, nation.item.n_name, nation.item.n_nationkey
+from region, region.r_nations as nation
+where nation.item.n_nationkey between 3 and 5
++-------------+-------------+------------------+
+| r_name      | item.n_name | item.n_nationkey |
++-------------+-------------+------------------+
+| AMERICA     | CANADA      | 3                |
+| MIDDLE EAST | EGYPT       | 4                |
+| AFRICA      | ETHIOPIA    | 5                |
++-------------+-------------+------------------+
+</codeblock>
+
+    </conbody>
+  </concept>
+
+  <concept id="comparison_operators">
+
+    <title>Comparison Operators</title>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">comparison operators</indexterm>
+        Impala supports the familiar comparison operators for checking 
equality and sort order for the column data
+        types:
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock><varname>left_hand_expression</varname> 
<varname>comparison_operator</varname> 
<varname>right_hand_expression</varname></codeblock>
+
+      <ul>
+        <li>
+          <codeph>=</codeph>, <codeph>!=</codeph>, <codeph>&lt;&gt;</codeph>: 
apply to all types.
+        </li>
+
+        <li>
+          <codeph>&lt;</codeph>, <codeph>&lt;=</codeph>, 
<codeph>&gt;</codeph>, <codeph>&gt;=</codeph>: apply to
+          all types; for <codeph>BOOLEAN</codeph>, <codeph>TRUE</codeph> is 
considered greater than
+          <codeph>FALSE</codeph>.
+        </li>
+      </ul>
+
+      <p>
+        <b>Alternatives:</b>
+      </p>
+
+      <p>
+        The <codeph>IN</codeph> and <codeph>BETWEEN</codeph> operators provide 
shorthand notation for expressing
+        combinations of equality, less than, and greater than comparisons with 
a single operator.
+      </p>
+
+      <p>
+        Because comparing any value to <codeph>NULL</codeph> produces 
<codeph>NULL</codeph> rather than
+        <codeph>TRUE</codeph> or <codeph>FALSE</codeph>, use the <codeph>IS 
NULL</codeph> and <codeph>IS NOT
+        NULL</codeph> operators to check if a value is <codeph>NULL</codeph> 
or not.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
+
+      <p 
conref="../shared/impala_common.xml#common/complex_types_caveat_no_operator"/>
+
+      <p rev="2.3.0">
+        The following example shows how to do an arithmetic operation using a 
numeric field of a <codeph>STRUCT</codeph> type
+        that is an item within an <codeph>ARRAY</codeph> column. Once the 
scalar numeric value <codeph>R_NATIONKEY</codeph>
+        is extracted, it can be used with a comparison operator such as 
<codeph>&lt;</codeph>:
+      </p>
+
+<codeblock rev="2.3.0">
+-- The SMALLINT is a field within an array of structs.
+describe region;
++-------------+-------------------------+---------+
+| name        | type                    | comment |
++-------------+-------------------------+---------+
+| r_regionkey | smallint                |         |
+| r_name      | string                  |         |
+| r_comment   | string                  |         |
+| r_nations   | array&lt;struct&lt;           |         |
+|             |   n_nationkey:smallint, |         |
+|             |   n_name:string,        |         |
+|             |   n_comment:string      |         |
+|             | &gt;&gt;                      |         |
++-------------+-------------------------+---------+
+
+-- When we refer to the scalar value using dot notation,
+-- we can use arithmetic and comparison operators on it
+-- like any other number.
+select r_name, nation.item.n_name, nation.item.n_nationkey
+from region, region.r_nations as nation
+where nation.item.n_nationkey &lt; 5
++-------------+-------------+------------------+
+| r_name      | item.n_name | item.n_nationkey |
++-------------+-------------+------------------+
+| AMERICA     | CANADA      | 3                |
+| AMERICA     | BRAZIL      | 2                |
+| AMERICA     | ARGENTINA   | 1                |
+| MIDDLE EAST | EGYPT       | 4                |
+| AFRICA      | ALGERIA     | 0                |
++-------------+-------------+------------------+
+</codeblock>
+
+    </conbody>
+  </concept>
+
+  <concept audience="Cloudera" rev="2.1.0" id="except">
+
+    <title>EXCEPT Operator</title>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">EXCEPT operator</indexterm>
+      </p>
+    </conbody>
+  </concept>
+
+  <concept rev="2.0.0" id="exists">
+
+    <title>EXISTS Operator</title>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">EXISTS operator</indexterm>
+        <indexterm audience="Cloudera">NOT EXISTS operator</indexterm>
+        The <codeph>EXISTS</codeph> operator tests whether a subquery returns 
any results.
+        You typically use it to find values from one table that have 
corresponding values in another table.
+      </p>
+
+      <p>
+        The converse, <codeph>NOT EXISTS</codeph>, helps to find all the 
values from one table that do not have any
+        corresponding values in another table.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>EXISTS (<varname>subquery</varname>)
+NOT EXISTS (<varname>subquery</varname>)
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+      <p>
+        The subquery can refer to a different table than the outer query 
block, or the same table. For example, you
+        might use <codeph>EXISTS</codeph> or <codeph>NOT EXISTS</codeph> to 
check the existence of parent/child
+        relationships between two columns of the same table.
+      </p>
+
+      <p>
+        You can also use operators and function calls within the subquery to 
test for other kinds of relationships
+        other than strict equality. For example, you might use a call to 
<codeph>COUNT()</codeph> in the subquery
+        to check whether the number of matching values is higher or lower than 
some limit. You might call a UDF in
+        the subquery to check whether values in one table matches a hashed 
representation of those same values in a
+        different table.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/null_blurb"/>
+
+      <p>
+        If the subquery returns any value at all (even <codeph>NULL</codeph>), 
<codeph>EXISTS</codeph> returns
+        <codeph>TRUE</codeph> and <codeph>NOT EXISTS</codeph> returns false.
+      </p>
+
+      <p>
+        The following example shows how even when the subquery returns only 
<codeph>NULL</codeph> values,
+        <codeph>EXISTS</codeph> still returns <codeph>TRUE</codeph> and thus 
matches all the rows from the table in
+        the outer query block.
+      </p>
+
+<codeblock>[localhost:21000] &gt; create table all_nulls (x int);
+[localhost:21000] &gt; insert into all_nulls values (null), (null), (null);
+[localhost:21000] &gt; select y from t2 where exists (select x from all_nulls);
++---+
+| y |
++---+
+| 2 |
+| 4 |
+| 6 |
++---+
+</codeblock>
+
+      <p>
+        However, if the table in the subquery is empty and so the subquery 
returns an empty result set,
+        <codeph>EXISTS</codeph> returns <codeph>FALSE</codeph>:
+      </p>
+
+<codeblock>[localhost:21000] &gt; create table empty (x int);
+[localhost:21000] &gt; select y from t2 where exists (select x from empty);
+[localhost:21000] &gt;
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/added_in_20"/>
+
+      <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
+
+      <p conref="../shared/impala_common.xml#common/subquery_no_limit"/>
+
+      <p>
+        The <codeph>NOT EXISTS</codeph> operator requires a correlated 
subquery.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
+
+      <p 
conref="../shared/impala_common.xml#common/complex_types_caveat_no_operator"/>
+
+<!-- To do: construct an EXISTS / NOT EXISTS example for complex types. -->
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <p>
+<!-- Maybe turn this into a conref if the same set of tables gets used for 
subqueries, EXISTS, other places. -->
+<!-- Yes, the material was reused under Subqueries for anti-joins. -->
+        The following examples refer to these simple tables containing small 
sets of integers or strings:
+<codeblock>[localhost:21000] &gt; create table t1 (x int);
+[localhost:21000] &gt; insert into t1 values (1), (2), (3), (4), (5), (6);
+
+[localhost:21000] &gt; create table t2 (y int);
+[localhost:21000] &gt; insert into t2 values (2), (4), (6);
+
+[localhost:21000] &gt; create table t3 (z int);
+[localhost:21000] &gt; insert into t3 values (1), (3), (5);
+
+[localhost:21000] &gt; create table month_names (m string);
+[localhost:21000] &gt; insert into month_names values
+                  &gt; ('January'), ('February'), ('March'),
+                  &gt; ('April'), ('May'), ('June'), ('July'),
+                  &gt; ('August'), ('September'), ('October'),
+                  &gt; ('November'), ('December');
+</codeblock>
+      </p>
+
+      <p>
+        The following example shows a correlated subquery that finds all the 
values in one table that exist in
+        another table. For each value <codeph>X</codeph> from 
<codeph>T1</codeph>, the query checks if the
+        <codeph>Y</codeph> column of <codeph>T2</codeph> contains an identical 
value, and the
+        <codeph>EXISTS</codeph> operator returns <codeph>TRUE</codeph> or 
<codeph>FALSE</codeph> as appropriate in
+        each case.
+      </p>
+
+<codeblock>localhost:21000] &gt; select x from t1 where exists (select y from 
t2 where t1.x = y);
++---+
+| x |
++---+
+| 2 |
+| 4 |
+| 6 |
++---+
+</codeblock>
+
+      <p>
+        An uncorrelated query is less interesting in this case. Because the 
subquery always returns
+        <codeph>TRUE</codeph>, all rows from <codeph>T1</codeph> are returned. 
If the table contents where changed
+        so that the subquery did not match any rows, none of the rows from 
<codeph>T1</codeph> would be returned.
+      </p>
+
+<codeblock>[localhost:21000] &gt; select x from t1 where exists (select y from 
t2 where y &gt; 5);
++---+
+| x |
++---+
+| 1 |
+| 2 |
+| 3 |
+| 4 |
+| 5 |
+| 6 |
++---+
+</codeblock>
+
+      <p>
+        The following example shows how an uncorrelated subquery can test for 
the existence of some condition
+        within a table. By using <codeph>LIMIT 1</codeph> or an aggregate 
function, the query returns a single
+        result or no result based on whether the subquery matches any rows. 
Here, we know that <codeph>T1</codeph>
+        and <codeph>T2</codeph> contain some even numbers, but 
<codeph>T3</codeph> does not.
+      </p>
+
+<codeblock>[localhost:21000] &gt; select "contains an even number" from t1 
where exists (select x from t1 where x % 2 = 0) limit 1;
++---------------------------+
+| 'contains an even number' |
++---------------------------+
+| contains an even number   |
++---------------------------+
+[localhost:21000] &gt; select "contains an even number" as assertion from t1 
where exists (select x from t1 where x % 2 = 0) limit 1;
++-------------------------+
+| assertion               |
++-------------------------+
+| contains an even number |
++-------------------------+
+[localhost:21000] &gt; select "contains an even number" as assertion from t2 
where exists (select x from t2 where y % 2 = 0) limit 1;
+ERROR: AnalysisException: couldn't resolve column reference: 'x'
+[localhost:21000] &gt; select "contains an even number" as assertion from t2 
where exists (select y from t2 where y % 2 = 0) limit 1;
++-------------------------+
+| assertion               |
++-------------------------+
+| contains an even number |
++-------------------------+
+[localhost:21000] &gt; select "contains an even number" as assertion from t3 
where exists (select z from t3 where z % 2 = 0) limit 1;
+[localhost:21000] &gt;
+</codeblock>
+
+      <p>
+        The following example finds numbers in one table that are 1 greater 
than numbers from another table. The
+        <codeph>EXISTS</codeph> notation is simpler than an equivalent 
<codeph>CROSS JOIN</codeph> between the
+        tables. (The example then also illustrates how the same test could be 
performed using an
+        <codeph>IN</codeph> operator.)
+      </p>
+
+<codeblock>[localhost:21000] &gt; select x from t1 where exists (select y from 
t2 where x = y + 1);
++---+
+| x |
++---+
+| 3 |
+| 5 |
++---+
+[localhost:21000] &gt; select x from t1 where x in (select y + 1 from t2);
++---+
+| x |
++---+
+| 3 |
+| 5 |
++---+
+</codeblock>
+
+      <p>
+        The following example finds values from one table that do not exist in 
another table.
+      </p>
+
+<codeblock>[localhost:21000] &gt; select x from t1 where not exists (select y 
from t2 where x = y);
++---+
+| x |
++---+
+| 1 |
+| 3 |
+| 5 |
++---+
+</codeblock>
+
+      <p>
+        The following example uses the <codeph>NOT EXISTS</codeph> operator to 
find all the leaf nodes in
+        tree-structured data. This simplified <q>tree of life</q> has multiple 
levels (class, order, family, and so
+        on), with each item pointing upward through a <codeph>PARENT</codeph> 
pointer. The example runs an outer
+        query and a subquery on the same table, returning only those items 
whose <codeph>ID</codeph> value is
+        <i>not</i> referenced by the <codeph>PARENT</codeph> of any other item.
+      </p>
+
+<codeblock>[localhost:21000] &gt; create table tree (id int, parent int, name 
string);
+[localhost:21000] &gt; insert overwrite tree values
+                  &gt; (0, null, "animals"),
+                  &gt; (1, 0, "placentals"),
+                  &gt; (2, 0, "marsupials"),
+                  &gt; (3, 1, "bats"),
+                  &gt; (4, 1, "cats"),
+                  &gt; (5, 2, "kangaroos"),
+                  &gt; (6, 4, "lions"),
+                  &gt; (7, 4, "tigers"),
+                  &gt; (8, 5, "red kangaroo"),
+                  &gt; (9, 2, "wallabies");
+[localhost:21000] &gt; select name as "leaf node" from tree one
+                  &gt; where not exists (select parent from tree two where 
one.id = two.parent);
++--------------+
+| leaf node    |
++--------------+
+| bats         |
+| lions        |
+| tigers       |
+| red kangaroo |
+| wallabies    |
++--------------+
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/related_info"/>
+
+      <p>
+        <xref href="impala_subqueries.xml#subqueries"/>
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="in">
+
+    <title>IN Operator</title>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">IN operator</indexterm>
+        <indexterm audience="Cloudera">NOT IN operator</indexterm>
+        The <codeph>IN</codeph> operator compares an argument value to a set 
of values, and returns
+        <codeph>TRUE</codeph> if the argument matches any value in the set. 
The <codeph>NOT IN</codeph> operator
+        reverses the comparison, and checks if the argument value is not part 
of a set of values.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock rev="2.0.0"><varname>expression</varname> IN 
(<varname>expression</varname> [, <varname>expression</varname>])
+<varname>expression</varname> IN (<varname>subquery</varname>)
+
+<varname>expression</varname> NOT IN (<varname>expression</varname> [, 
<varname>expression</varname>])
+<varname>expression</varname> NOT IN (<varname>subquery</varname>)
+</codeblock>
+
+      <p>
+        The left-hand expression and the set of comparison values must be of 
compatible types.
+      </p>
+
+      <p>
+        The left-hand expression must consist only of a single value, not a 
tuple. Although the left-hand
+        expression is typically a column name, it could also be some other 
value. For example, the
+        <codeph>WHERE</codeph> clauses <codeph>WHERE id IN (5)</codeph> and 
<codeph>WHERE 5 IN (id)</codeph>
+        produce the same results.
+      </p>
+
+      <p>
+        The set of values to check against can be specified as constants, 
function calls, column names, or other
+        expressions in the query text. When the values are listed explicitly, 
the maximum number of expressions is
+        10,000.
+      </p>
+
+      <p rev="2.0.0">
+        In Impala 2.0 and higher, the set of values can also be generated by a 
subquery. <codeph>IN</codeph> can
+        evaluate an unlimited number of results using a subquery.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+      <p>
+        Any expression using the <codeph>IN</codeph> operator could be 
rewritten as a series of equality tests
+        connected with <codeph>OR</codeph>, but the <codeph>IN</codeph> syntax 
is often clearer, more concise, and
+        easier for Impala to optimize. For example, with partitioned tables, 
queries frequently use
+        <codeph>IN</codeph> clauses to filter data by comparing the partition 
key columns to specific values.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/null_blurb"/>
+
+      <p>
+        If there really is a matching non-null value, <codeph>IN</codeph> 
returns <codeph>TRUE</codeph>:
+      </p>
+
+<codeblock>[localhost:21000] &gt; select 1 in (1,null,2,3);
++----------------------+
+| 1 in (1, null, 2, 3) |
++----------------------+
+| true                 |
++----------------------+
+[localhost:21000] &gt; select 1 not in (1,null,2,3);
++--------------------------+
+| 1 not in (1, null, 2, 3) |
++--------------------------+
+| false                    |
++--------------------------+
+</codeblock>
+
+      <p>
+        If the searched value is not found in the comparison values, and the 
comparison values include
+        <codeph>NULL</codeph>, the result is <codeph>NULL</codeph>:
+      </p>
+
+<codeblock>[localhost:21000] &gt; select 5 in (1,null,2,3);
++----------------------+
+| 5 in (1, null, 2, 3) |
++----------------------+
+| NULL                 |
++----------------------+
+[localhost:21000] &gt; select 5 not in (1,null,2,3);
++--------------------------+
+| 5 not in (1, null, 2, 3) |
++--------------------------+
+| NULL                     |
++--------------------------+
+[localhost:21000] &gt; select 1 in (null);
++-------------+
+| 1 in (null) |
++-------------+
+| NULL        |
++-------------+
+[localhost:21000] &gt; select 1 not in (null);
++-----------------+
+| 1 not in (null) |
++-----------------+
+| NULL            |
++-----------------+
+</codeblock>
+
+      <p>
+        If the left-hand argument is <codeph>NULL</codeph>, 
<codeph>IN</codeph> always returns
+        <codeph>NULL</codeph>. This rule applies even if the comparison values 
include <codeph>NULL</codeph>.
+      </p>
+
+<codeblock>[localhost:21000] &gt; select null in (1,2,3);
++-------------------+
+| null in (1, 2, 3) |
++-------------------+
+| NULL              |
++-------------------+
+[localhost:21000] &gt; select null not in (1,2,3);
++-----------------------+
+| null not in (1, 2, 3) |
++-----------------------+
+| NULL                  |
++-----------------------+
+[localhost:21000] &gt; select null in (null);
++----------------+
+| null in (null) |
++----------------+
+| NULL           |
++----------------+
+[localhost:21000] &gt; select null not in (null);
++--------------------+
+| null not in (null) |
++--------------------+
+| NULL               |
++--------------------+
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/enhanced_in_20"/>
+
+      <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
+
+      <p 
conref="../shared/impala_common.xml#common/complex_types_caveat_no_operator"/>
+
+      <p rev="2.3.0">
+        The following example shows how to do an arithmetic operation using a 
numeric field of a <codeph>STRUCT</codeph> type
+        that is an item within an <codeph>ARRAY</codeph> column. Once the 
scalar numeric value <codeph>R_NATIONKEY</codeph>
+        is extracted, it can be used in an arithmetic expression, such as 
multiplying by 10:
+      </p>
+
+<codeblock rev="2.3.0">
+-- The SMALLINT is a field within an array of structs.
+describe region;
++-------------+-------------------------+---------+
+| name        | type                    | comment |
++-------------+-------------------------+---------+
+| r_regionkey | smallint                |         |
+| r_name      | string                  |         |
+| r_comment   | string                  |         |
+| r_nations   | array&lt;struct&lt;           |         |
+|             |   n_nationkey:smallint, |         |
+|             |   n_name:string,        |         |
+|             |   n_comment:string      |         |
+|             | &gt;&gt;                      |         |
++-------------+-------------------------+---------+
+
+-- When we refer to the scalar value using dot notation,
+-- we can use arithmetic and comparison operators on it
+-- like any other number.
+select r_name, nation.item.n_name, nation.item.n_nationkey
+from region, region.r_nations as nation
+where nation.item.n_nationkey in (1,3,5)
++---------+-------------+------------------+
+| r_name  | item.n_name | item.n_nationkey |
++---------+-------------+------------------+
+| AMERICA | CANADA      | 3                |
+| AMERICA | ARGENTINA   | 1                |
+| AFRICA  | ETHIOPIA    | 5                |
++---------+-------------+------------------+
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
+
+      <p conref="../shared/impala_common.xml#common/subquery_no_limit"/>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+<codeblock>-- Using IN is concise and self-documenting.
+SELECT * FROM t1 WHERE c1 IN (1,2,10);
+-- Equivalent to series of = comparisons ORed together.
+SELECT * FROM t1 WHERE c1 = 1 OR c1 = 2 OR c1 = 10;
+
+SELECT c1 AS "starts with vowel" FROM t2 WHERE upper(substr(c1,1,1)) IN 
('A','E','I','O','U');
+
+SELECT COUNT(DISTINCT(visitor_id)) FROM web_traffic WHERE month IN 
('January','June','July');</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/related_info"/>
+
+      <p>
+        <xref href="impala_subqueries.xml#subqueries"/>
+      </p>
+    </conbody>
+  </concept>
+
+  <concept audience="Cloudera" rev="2.1.0" id="intersect">
+
+    <title>INTERSECT Operator</title>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">INTERSECT operator</indexterm>
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="is_null">
+
+    <title>IS NULL Operator</title>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">IS NULL operator</indexterm>
+        <indexterm audience="Cloudera">IS NOT NULL operator</indexterm>
+        The <codeph>IS NULL</codeph> operator, and its converse the <codeph>IS 
NOT NULL</codeph> operator, test
+        whether a specified value is <codeph><xref 
href="impala_literals.xml#null">NULL</xref></codeph>. Because
+        using <codeph>NULL</codeph> with any of the other comparison operators 
such as <codeph>=</codeph> or
+        <codeph>!=</codeph> also returns <codeph>NULL</codeph> rather than 
<codeph>TRUE</codeph> or
+        <codeph>FALSE</codeph>, you use a special-purpose comparison operator 
to check for this special condition.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock><varname>expression</varname> IS NULL
+<varname>expression</varname> IS NOT NULL
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+      <p>
+        In many cases, <codeph>NULL</codeph> values indicate some incorrect or 
incomplete processing during data
+        ingestion or conversion. You might check whether any values in a 
column are <codeph>NULL</codeph>, and if
+        so take some followup action to fill them in.
+      </p>
+
+      <p>
+        With sparse data, often represented in <q>wide</q> tables, it is 
common for most values to be
+        <codeph>NULL</codeph> with only an occasional 
non-<codeph>NULL</codeph> value. In those cases, you can use
+        the <codeph>IS NOT NULL</codeph> operator to identify the rows 
containing any data at all for a particular
+        column, regardless of the actual value.
+      </p>
+
+      <p>
+        With a well-designed database schema, effective use of 
<codeph>NULL</codeph> values and <codeph>IS
+        NULL</codeph> and <codeph>IS NOT NULL</codeph> operators can save 
having to design custom logic around
+        special values such as 0, -1, <codeph>'N/A'</codeph>, empty string, 
and so on. <codeph>NULL</codeph> lets
+        you distinguish between a value that is known to be 0, false, or 
empty, and a truly unknown value.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+<codeblock>-- If this value is non-zero, something is wrong.
+select count(*) from employees where employee_id is null;
+
+-- With data from disparate sources, some fields might be blank.
+-- Not necessarily an error condition.
+select count(*) from census where household_income is null;
+
+-- Sometimes we expect fields to be null, and followup action
+-- is needed when they are not.
+select count(*) from web_traffic where weird_http_code is not null;</codeblock>
+    </conbody>
+  </concept>
+
+  <concept id="like">
+
+    <title>LIKE Operator</title>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">LIKE operator</indexterm>
+        A comparison operator for <codeph>STRING</codeph> data, with basic 
wildcard capability using
+        <codeph>_</codeph> to match a single character and <codeph>%</codeph> 
to match multiple characters. The
+        argument expression must match the entire string value. Typically, it 
is more efficient to put any
+        <codeph>%</codeph> wildcard match at the end of the string.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock><varname>string_expression</varname> LIKE 
<varname>wildcard_expression</varname>
+<varname>string_expression</varname> NOT LIKE 
<varname>wildcard_expression</varname>
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
+
+      <p 
conref="../shared/impala_common.xml#common/complex_types_caveat_no_operator"/>
+
+<!-- To do: construct a LIKE example for complex types. -->
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+<codeblock>select distinct c_last_name from customer where c_last_name like 
'Mc%' or c_last_name like 'Mac%';
+select count(c_last_name) from customer where c_last_name like 'M%';
+select c_email_address from customer where c_email_address like '%.edu';
+
+-- We can find 4-letter names beginning with 'M' by calling functions...
+select distinct c_last_name from customer where length(c_last_name) = 4 and 
substr(c_last_name,1,1) = 'M';
+-- ...or in a more readable way by matching M followed by exactly 3 characters.
+select distinct c_last_name from customer where c_last_name like 
'M___';</codeblock>
+
+      <p>
+        For a more general kind of search operator using regular expressions, 
see
+        <xref href="impala_operators.xml#regexp"/>.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="logical_operators">
+
+    <title>Logical Operators</title>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">logical operators</indexterm>
+        Logical operators return a <codeph>BOOLEAN</codeph> value, based on a 
binary or unary logical operation
+        between arguments that are also Booleans. Typically, the argument 
expressions use
+        <xref href="impala_operators.xml#comparison_operators">comparison 
operators</xref>.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock><varname>boolean_expression</varname> 
<varname>binary_logical_operator</varname> <varname>boolean_expression</varname>
+<varname>unary_logical_operator</varname> <varname>boolean_expression</varname>
+</codeblock>
+
+      <p>
+        The Impala logical operators are:
+      </p>
+
+      <ul>
+        <li>
+          <codeph>AND</codeph>: A binary operator that returns 
<codeph>true</codeph> if its left-hand and
+          right-hand arguments both evaluate to <codeph>true</codeph>, 
<codeph>NULL</codeph> if either argument is
+          <codeph>NULL</codeph>, and <codeph>false</codeph> otherwise.
+        </li>
+
+        <li>
+          <codeph>OR</codeph>: A binary operator that returns 
<codeph>true</codeph> if either of its left-hand and
+          right-hand arguments evaluate to <codeph>true</codeph>, 
<codeph>NULL</codeph> if one argument is
+          <codeph>NULL</codeph> and the other is either <codeph>NULL</codeph> 
or <codeph>false</codeph>, and
+          <codeph>false</codeph> otherwise.
+        </li>
+
+        <li>
+          <codeph>NOT</codeph>: A unary operator that flips the state of a 
Boolean expression from
+          <codeph>true</codeph> to <codeph>false</codeph>, or 
<codeph>false</codeph> to <codeph>true</codeph>. If
+          the argument expression is <codeph>NULL</codeph>, the result remains 
<codeph>NULL</codeph>. (When
+          <codeph>NOT</codeph> is used this way as a unary logical operator, 
it works differently than the
+          <codeph>IS NOT NULL</codeph> comparison operator, which returns 
<codeph>true</codeph> when applied to a
+          <codeph>NULL</codeph>.)
+        </li>
+      </ul>
+
+      <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
+
+      <p 
conref="../shared/impala_common.xml#common/complex_types_caveat_no_operator"/>
+
+      <p rev="2.3.0">
+        The following example shows how to do an arithmetic operation using a 
numeric field of a <codeph>STRUCT</codeph> type
+        that is an item within an <codeph>ARRAY</codeph> column. Once the 
scalar numeric value <codeph>R_NATIONKEY</codeph>
+        is extracted, it can be used in an arithmetic expression, such as 
multiplying by 10:
+      </p>
+
+<codeblock rev="2.3.0">
+-- The SMALLINT is a field within an array of structs.
+describe region;
++-------------+-------------------------+---------+
+| name        | type                    | comment |
++-------------+-------------------------+---------+
+| r_regionkey | smallint                |         |
+| r_name      | string                  |         |
+| r_comment   | string                  |         |
+| r_nations   | array&lt;struct&lt;           |         |
+|             |   n_nationkey:smallint, |         |
+|             |   n_name:string,        |         |
+|             |   n_comment:string      |         |
+|             | &gt;&gt;                      |         |
++-------------+-------------------------+---------+
+
+-- When we refer to the scalar value using dot notation,
+-- we can use arithmetic and comparison operators on it
+-- like any other number.
+select r_name, nation.item.n_name, nation.item.n_nationkey
+  from region, region.r_nations as nation
+where
+  nation.item.n_nationkey between 3 and 5
+  or nation.item.n_nationkey &lt; 15;
++-------------+----------------+------------------+
+| r_name      | item.n_name    | item.n_nationkey |
++-------------+----------------+------------------+
+| EUROPE      | UNITED KINGDOM | 23               |
+| EUROPE      | RUSSIA         | 22               |
+| EUROPE      | ROMANIA        | 19               |
+| ASIA        | VIETNAM        | 21               |
+| ASIA        | CHINA          | 18               |
+| AMERICA     | UNITED STATES  | 24               |
+| AMERICA     | PERU           | 17               |
+| AMERICA     | CANADA         | 3                |
+| MIDDLE EAST | SAUDI ARABIA   | 20               |
+| MIDDLE EAST | EGYPT          | 4                |
+| AFRICA      | MOZAMBIQUE     | 16               |
+| AFRICA      | ETHIOPIA       | 5                |
++-------------+----------------+------------------+
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <p>
+        These examples demonstrate the <codeph>AND</codeph> operator:
+      </p>
+
+<codeblock>[localhost:21000] &gt; select true and true;
++---------------+
+| true and true |
++---------------+
+| true          |
++---------------+
+[localhost:21000] &gt; select true and false;
++----------------+
+| true and false |
++----------------+
+| false          |
++----------------+
+[localhost:21000] &gt; select false and false;
++-----------------+
+| false and false |
++-----------------+
+| false           |
++-----------------+
+[localhost:21000] &gt; select true and null;
++---------------+
+| true and null |
++---------------+
+| NULL          |
++---------------+
+[localhost:21000] &gt; select (10 &gt; 2) and (6 != 9);
++-----------------------+
+| (10 &gt; 2) and (6 != 9) |
++-----------------------+
+| true                  |
++-----------------------+
+</codeblock>
+
+      <p>
+        These examples demonstrate the <codeph>OR</codeph> operator:
+      </p>
+
+<codeblock>[localhost:21000] &gt; select true or true;
++--------------+
+| true or true |
++--------------+
+| true         |
++--------------+
+[localhost:21000] &gt; select true or false;
++---------------+
+| true or false |
++---------------+
+| true          |
++---------------+
+[localhost:21000] &gt; select false or false;
++----------------+
+| false or false |
++----------------+
+| false          |
++----------------+
+[localhost:21000] &gt; select true or null;
++--------------+
+| true or null |
++--------------+
+| true         |
++--------------+
+[localhost:21000] &gt; select null or true;
++--------------+
+| null or true |
++--------------+
+| true         |
++--------------+
+[localhost:21000] &gt; select false or null;
++---------------+
+| false or null |
++---------------+
+| NULL          |
++---------------+
+[localhost:21000] &gt; select (1 = 1) or ('hello' = 'world');
++--------------------------------+
+| (1 = 1) or ('hello' = 'world') |
++--------------------------------+
+| true                           |
++--------------------------------+
+[localhost:21000] &gt; select (2 + 2 != 4) or (-1 &gt; 0);
++--------------------------+
+| (2 + 2 != 4) or (-1 &gt; 0) |
++--------------------------+
+| false                    |
++--------------------------+
+</codeblock>
+
+      <p>
+        These examples demonstrate the <codeph>NOT</codeph> operator:
+      </p>
+
+<codeblock>[localhost:21000] &gt; select not true;
++----------+
+| not true |
++----------+
+| false    |
++----------+
+[localhost:21000] &gt; select not false;
++-----------+
+| not false |
++-----------+
+| true      |
++-----------+
+[localhost:21000] &gt; select not null;
++----------+
+| not null |
++----------+
+| NULL     |
++----------+
+[localhost:21000] &gt; select not (1=1);
++-------------+
+| not (1 = 1) |
++-------------+
+| false       |
++-------------+
+</codeblock>
+    </conbody>
+  </concept>
+
+  <concept id="regexp">
+
+    <title>REGEXP Operator</title>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">REGEXP operator</indexterm>
+        Tests whether a value matches a regular expression. Uses the POSIX 
regular expression syntax where
+        <codeph>^</codeph> and <codeph>$</codeph> match the beginning and end 
of the string, <codeph>.</codeph>
+        represents any single character, <codeph>*</codeph> represents a 
sequence of zero or more items,
+        <codeph>+</codeph> represents a sequence of one or more items, 
<codeph>?</codeph> produces a non-greedy
+        match, and so on.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+<codeblock><varname>string_expression</varname> REGEXP 
<varname>regular_expression</varname>
+</codeblock>
+
+      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+      <p>
+        The regular expression must match the entire value, not just occur 
somewhere inside it. Use
+        <codeph>.*</codeph> at the beginning and/or the end if you only need 
to match characters anywhere in the
+        middle. Thus, the <codeph>^</codeph> and <codeph>$</codeph> atoms are 
often redundant, although you might
+        already have them in your expression strings that you reuse from 
elsewhere.
+      </p>
+
+      <p>
+        The <codeph>RLIKE</codeph> operator is a synonym for 
<codeph>REGEXP</codeph>.
+      </p>
+
+      <p>
+        The <codeph>|</codeph> symbol is the alternation operator, typically 
used within <codeph>()</codeph> to
+        match different sequences. The <codeph>()</codeph> groups do not allow 
backreferences. To retrieve the part
+        of a value matched within a <codeph>()</codeph> section, use the
+        <codeph><xref 
href="impala_string_functions.xml#string_functions/regexp_extract">regexp_extract()</xref></codeph>
+        built-in function.
+      </p>
+
+      <note rev="1.3.1">
+        <p conref="../shared/impala_common.xml#common/regexp_matching"/>
+      </note>
+
+      <p conref="../shared/impala_common.xml#common/regexp_re2"/>
+
+      <p conref="../shared/impala_common.xml#common/regexp_re2_warning"/>
+
+      <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
+
+      <p 
conref="../shared/impala_common.xml#common/complex_types_caveat_no_operator"/>
+
+<!-- To do: construct a REGEXP example for complex types. -->
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <p>
+        The following examples demonstrate the identical syntax for the 
<codeph>REGEXP</codeph> and
+        <codeph>RLIKE</codeph> operators.
+      </p>
+
+<!-- Same examples shown for both REGEXP and RLIKE operators. -->
+
+<codeblock conref="../shared/impala_common.xml#common/regexp_rlike_examples"/>
+    </conbody>
+  </concept>
+
+  <concept id="rlike">
+
+    <title>RLIKE Operator</title>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">RLIKE operator</indexterm>
+        Synonym for the <codeph>REGEXP</codeph> operator. See <xref 
href="impala_operators.xml#regexp"/> for
+        details.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+      <p>
+        The following examples demonstrate the identical syntax for the 
<codeph>REGEXP</codeph> and
+        <codeph>RLIKE</codeph> operators.
+      </p>
+
+<!-- Same examples shown for both REGEXP and RLIKE operators. -->
+
+<codeblock conref="../shared/impala_common.xml#common/regexp_rlike_examples"/>
+    </conbody>
+  </concept>
+</concept>


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_order_by.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_order_by.xml b/docs/topics/impala_order_by.xml
new file mode 100644
index 0000000..f3042e5
--- /dev/null
+++ b/docs/topics/impala_order_by.xml
@@ -0,0 +1,316 @@
+<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD 
DITA Concept//EN" "concept.dtd">
+<concept id="order_by">
+
+  <title>ORDER BY Clause</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="SQL"/>
+      <data name="Category" value="Querying"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      The familiar <codeph>ORDER BY</codeph> clause of a 
<codeph>SELECT</codeph> statement sorts the result set
+      based on the values from one or more columns.
+    </p>
+
+    <p>
+      For distributed queries, this is a relatively expensive operation, 
because the entire result set must be
+      produced and transferred to one node before the sorting can happen. This 
can require more memory capacity
+      than a query without <codeph>ORDER BY</codeph>. Even if the query takes 
approximately the same time to finish
+      with or without the <codeph>ORDER BY</codeph> clause, subjectively it 
can appear slower because no results
+      are available until all processing is finished, rather than results 
coming back gradually as rows matching
+      the <codeph>WHERE</codeph> clause are found. Therefore, if you only need 
the first N results from the sorted
+      result set, also include the <codeph>LIMIT</codeph> clause, which 
reduces network overhead and the memory
+      requirement on the coordinator node.
+    </p>
+
+    <note>
+      <p rev="1.4.0 obwl">
+        In Impala 1.4.0 and higher, the <codeph>LIMIT</codeph> clause is now 
optional (rather than required) for
+        queries that use the <codeph>ORDER BY</codeph> clause. Impala 
automatically uses a temporary disk work area
+        to perform the sort if the sort operation would otherwise exceed the 
Impala memory limit for a particular
+        data node.
+      </p>
+    </note>
+
+    <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+    <p>
+      The full syntax for the <codeph>ORDER BY</codeph> clause is:
+    </p>
+
+<codeblock rev="1.2.1">ORDER BY <varname>col_ref</varname> [, 
<varname>col_ref</varname> ...] [ASC | DESC] [NULLS FIRST | NULLS LAST]
+
+col_ref ::= <varname>column_name</varname> | <varname>integer_literal</varname>
+</codeblock>
+
+    <p>
+      Although the most common usage is <codeph>ORDER BY 
<varname>column_name</varname></codeph>, you can also
+      specify <codeph>ORDER BY 1</codeph> to sort by the first column of the 
result set, <codeph>ORDER BY
+      2</codeph> to sort by the second column, and so on. The number must be a 
numeric literal, not some other kind
+      of constant expression. (If the argument is some other expression, even 
a <codeph>STRING</codeph> value, the
+      query succeeds but the order of results is undefined.)
+    </p>
+
+    <p>
+      <codeph>ORDER BY <varname>column_number</varname></codeph> can only be 
used when the query explicitly lists
+      the columns in the <codeph>SELECT</codeph> list, not with <codeph>SELECT 
*</codeph> queries.
+    </p>
+
+    <p>
+      <b>Ascending and descending sorts:</b>
+    </p>
+
+    <p>
+      The default sort order (the same as using the <codeph>ASC</codeph> 
keyword) puts the smallest values at the
+      start of the result set, and the largest values at the end. Specifying 
the <codeph>DESC</codeph> keyword
+      reverses that order.
+    </p>
+
+    <p>
+      <b>Sort order for NULL values:</b>
+    </p>
+
+    <p rev="1.2.1">
+      See <xref href="impala_literals.xml#null"/> for details about how 
<codeph>NULL</codeph> values are positioned
+      in the sorted result set, and how to use the <codeph>NULLS 
FIRST</codeph> and <codeph>NULLS LAST</codeph>
+      clauses. (The sort position for <codeph>NULL</codeph> values in 
<codeph>ORDER BY ... DESC</codeph> queries is
+      changed in Impala 1.2.1 and higher to be more standards-compliant, and 
the <codeph>NULLS FIRST</codeph> and
+      <codeph>NULLS LAST</codeph> keywords are new in Impala 1.2.1.)
+    </p>
+
+    <p rev="obwl" conref="../shared/impala_common.xml#common/order_by_limit"/>
+
+    <!-- Good to show an example of cases where ORDER BY does and doesn't work 
with complex types. -->
+    <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
+
+    <p rev="2.3.0">
+      In CDH 5.5 / Impala 2.3 and higher, the complex data types 
<codeph>STRUCT</codeph>,
+      <codeph>ARRAY</codeph>, and <codeph>MAP</codeph> are available. These 
columns cannot
+      be referenced directly in the <codeph>ORDER BY</codeph> clause.
+      When you query a complex type column, you use join notation to 
<q>unpack</q> the elements
+      of the complex type, and within the join query you can include an 
<codeph>ORDER BY</codeph>
+      clause to control the order in the result set of the scalar elements 
from the complex type.
+      See <xref href="impala_complex_types.xml#complex_types"/> for details 
about Impala support for complex types.
+    </p>
+
+    <p>
+      The following query shows how a complex type column cannot be directly 
used in an <codeph>ORDER BY</codeph> clause:
+    </p>
+
+<codeblock>CREATE TABLE games (id BIGINT, score ARRAY &lt;BIGINT&gt;) STORED 
AS PARQUET;
+...use LOAD DATA to load externally created Parquet files into the table...
+SELECT id FROM games ORDER BY score DESC;
+ERROR: AnalysisException: ORDER BY expression 'score' with complex type 
'ARRAY&lt;BIGINT&gt;' is not supported.
+</codeblock>
+
+    <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+    <p>
+      The following query retrieves the user ID and score, only for scores 
greater than one million,
+      with the highest scores for each user listed first.
+      Because the individual array elements are now represented as separate 
rows in the result set,
+      they can be used in the <codeph>ORDER BY</codeph> clause, referenced 
using the <codeph>ITEM</codeph>
+      pseudocolumn that represents each array element.
+    </p>
+
+<codeblock>SELECT id, item FROM games, games.score
+  WHERE item &gt; 1000000
+ORDER BY id, item desc;
+</codeblock>
+
+    <p>
+      The following queries use similar <codeph>ORDER BY</codeph> techniques 
with variations of the <codeph>GAMES</codeph>
+      table, where the complex type is an <codeph>ARRAY</codeph> containing 
<codeph>STRUCT</codeph> or <codeph>MAP</codeph>
+      elements to represent additional details about each game that was played.
+      For an array of structures, the fields of the structure are referenced 
as <codeph>ITEM.<varname>field_name</varname></codeph>.
+      For an array of maps, the keys and values within each array element are 
referenced as <codeph>ITEM.KEY</codeph>
+      and <codeph>ITEM.VALUE</codeph>.
+    </p>
+
+<codeblock>CREATE TABLE games2 (id BIGINT, play array &lt; struct 
&lt;game_name: string, score: BIGINT, high_score: boolean&gt; &gt;) STORED AS 
PARQUET
+...use LOAD DATA to load externally created Parquet files into the table...
+SELECT id, item.game_name, item.score FROM games2, games2.play
+  WHERE item.score &gt; 1000000
+ORDER BY id, item.score DESC;
+
+CREATE TABLE games3 (id BIGINT, play ARRAY &lt; MAP &lt;STRING, BIGINT&gt; 
&gt;) STORED AS PARQUET;  
+...use LOAD DATA to load externally created Parquet files into the table...
+SELECT id, info.key AS k, info.value AS v from games3, games3.play AS plays, 
games3.play.item AS info
+  WHERE info.KEY = 'score' AND info.VALUE &gt; 1000000
+ORDER BY id, info.value desc;
+</codeblock>
+
+    <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+    <p>
+      Although the <codeph>LIMIT</codeph> clause is now optional on 
<codeph>ORDER BY</codeph> queries, if your
+      query only needs some number of rows that you can predict in advance, 
use the <codeph>LIMIT</codeph> clause
+      to reduce unnecessary processing. For example, if the query has a clause 
<codeph>LIMIT 10</codeph>, each data
+      node sorts its portion of the relevant result set and only returns 10 
rows to the coordinator node. The
+      coordinator node picks the 10 highest or lowest row values out of this 
small intermediate result set.
+    </p>
+
+    <p>
+      If an <codeph>ORDER BY</codeph> clause is applied to an early phase of 
query processing, such as a subquery
+      or a view definition, Impala ignores the <codeph>ORDER BY</codeph> 
clause. To get ordered results from a
+      subquery or view, apply an <codeph>ORDER BY</codeph> clause to the 
outermost or final <codeph>SELECT</codeph>
+      level.
+    </p>
+
+    <p>
+      <codeph>ORDER BY</codeph> is often used in combination with 
<codeph>LIMIT</codeph> to perform <q>top-N</q>
+      queries:
+    </p>
+
+<codeblock>SELECT user_id AS "Top 10 Visitors", SUM(page_views) FROM web_stats
+  GROUP BY page_views, user_id
+  ORDER BY SUM(page_views) DESC LIMIT 10;
+</codeblock>
+
+    <p>
+      <codeph>ORDER BY</codeph> is sometimes used in combination with 
<codeph>OFFSET</codeph> and
+      <codeph>LIMIT</codeph> to paginate query results, although it is 
relatively inefficient to issue multiple
+      queries like this against the large tables typically used with Impala:
+    </p>
+
+<codeblock>SELECT page_title AS "Page 1 of search results", page_url FROM 
search_content
+  WHERE LOWER(page_title) LIKE '%game%')
+  ORDER BY page_title LIMIT 10 OFFSET 0;
+SELECT page_title AS "Page 2 of search results", page_url FROM search_content
+  WHERE LOWER(page_title) LIKE '%game%')
+  ORDER BY page_title LIMIT 10 OFFSET 10;
+SELECT page_title AS "Page 3 of search results", page_url FROM search_content
+  WHERE LOWER(page_title) LIKE '%game%')
+  ORDER BY page_title LIMIT 10 OFFSET 20;
+</codeblock>
+
+    <p conref="../shared/impala_common.xml#common/internals_blurb"/>
+
+    <p>
+      Impala sorts the intermediate results of an <codeph>ORDER BY</codeph> 
clause in memory whenever practical. In
+      a cluster of N data nodes, each node sorts roughly 1/Nth of the result 
set, the exact proportion varying
+      depending on how the data matching the query is distributed in HDFS.
+    </p>
+
+    <p>
+      If the size of the sorted intermediate result set on any data node would 
cause the query to exceed the Impala
+      memory limit, Impala sorts as much as practical in memory, then writes 
partially sorted data to disk. (This
+      technique is known in industry terminology as <q>external sorting</q> 
and <q>spilling to disk</q>.) As each
+      8 MB batch of data is written to disk, Impala frees the corresponding 
memory to sort a new 8 MB batch of
+      data. When all the data has been processed, a final merge sort operation 
is performed to correctly order the
+      in-memory and on-disk results as the result set is transmitted back to 
the coordinator node. When external
+      sorting becomes necessary, Impala requires approximately 60 MB of RAM at 
a minimum for the buffers needed to
+      read, write, and sort the intermediate results. If more RAM is available 
on the data node, Impala will use
+      the additional RAM to minimize the amount of disk I/O for sorting.
+    </p>
+
+    <p>
+      This external sort technique is used as appropriate on each data node 
(possibly including the coordinator
+      node) to sort the portion of the result set that is processed on that 
node. When the sorted intermediate
+      results are sent back to the coordinator node to produce the final 
result set, the coordinator node uses a
+      merge sort technique to produce a final sorted result set without using 
any extra resources on the
+      coordinator node.
+    </p>
+
+    <p rev="obwl">
+      <b>Configuration for disk usage:</b>
+    </p>
+
+    <p rev="obwl" 
conref="../shared/impala_common.xml#common/order_by_scratch_dir"/>
+
+<!-- Here is actually the more logical place to collect all those examples, 
move them from SELECT and cross-reference to here. -->
+
+<!--     <p rev="obwl" 
conref="/Content/impala_common_xi44078.xml#common/restrictions_blurb"/> -->
+
+    <p rev="obwl" 
conref="../shared/impala_common.xml#common/insert_sort_blurb"/>
+
+    <p rev="obwl" 
conref="../shared/impala_common.xml#common/order_by_view_restriction"/>
+
+    <p>
+      With the lifting of the requirement to include a <codeph>LIMIT</codeph> 
clause in every <codeph>ORDER
+      BY</codeph> query (in Impala 1.4 and higher):
+    </p>
+
+    <ul>
+      <li>
+        <p>
+          Now the use of scratch disk space raises the possibility of an 
<q>out of disk space</q> error on a
+          particular data node, as opposed to the previous possibility of an 
<q>out of memory</q> error. Make sure
+          to keep at least 1 GB free on the filesystem used for temporary 
sorting work.
+        </p>
+      </li>
+
+      <li>
+        <p>
+          The query options
+          <xref 
href="impala_default_order_by_limit.xml#default_order_by_limit">DEFAULT_ORDER_BY_LIMIT</xref>
 and
+          <xref 
href="impala_abort_on_default_limit_exceeded.xml#abort_on_default_limit_exceeded">ABORT_ON_DEFAULT_LIMIT_EXCEEDED</xref>,
+          which formerly controlled the behavior of <codeph>ORDER BY</codeph> 
queries with no limit specified, are
+          now ignored.
+        </p>
+      </li>
+    </ul>
+
+    <p rev="obwl" 
conref="../shared/impala_common.xml#common/null_sorting_change"/>
+<codeblock>[localhost:21000] > create table numbers (x int);
+[localhost:21000] > insert into numbers values (1), (null), (2), (null), (3);
+[localhost:21000] > select x from numbers order by x nulls first;
++------+
+| x    |
++------+
+| NULL |
+| NULL |
+| 1    |
+| 2    |
+| 3    |
++------+
+[localhost:21000] > select x from numbers order by x desc nulls first;
++------+
+| x    |
++------+
+| NULL |
+| NULL |
+| 3    |
+| 2    |
+| 1    |
++------+
+[localhost:21000] > select x from numbers order by x nulls last;
++------+
+| x    |
++------+
+| 1    |
+| 2    |
+| 3    |
+| NULL |
+| NULL |
++------+
+[localhost:21000] > select x from numbers order by x desc nulls last;
++------+
+| x    |
++------+
+| 3    |
+| 2    |
+| 1    |
+| NULL |
+| NULL |
++------+
+</codeblock>
+
+    <p rev="obwl" conref="../shared/impala_common.xml#common/related_info"/>
+
+    <p rev="obwl">
+      See <xref href="impala_select.xml#select"/> for further examples of 
queries with the <codeph>ORDER
+      BY</codeph> clause.
+    </p>
+
+    <p>
+      Analytic functions use the <codeph>ORDER BY</codeph> clause in a 
different context to define the sequence in
+      which rows are analyzed. See <xref 
href="impala_analytic_functions.xml#analytic_functions"/> for details.
+    </p>
+  </conbody>
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_parquet_compression_codec.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_parquet_compression_codec.xml 
b/docs/topics/impala_parquet_compression_codec.xml
new file mode 100644
index 0000000..d178a0d
--- /dev/null
+++ b/docs/topics/impala_parquet_compression_codec.xml
@@ -0,0 +1,25 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="parquet_compression_codec">
+
+  <title>PARQUET_COMPRESSION_CODEC Query Option</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Parquet"/>
+      <data name="Category" value="File Formats"/>
+      <data name="Category" value="Impala Query Options"/>
+      <data name="Category" value="Deprecated Features"/>
+      <data name="Category" value="Compression"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">PARQUET_COMPRESSION_CODEC query 
option</indexterm>
+      Deprecated. Use <codeph>COMPRESSION_CODEC</codeph> in Impala 2.0 and 
later. See
+      <xref href="impala_compression_codec.xml#compression_codec"/> for 
details.
+    </p>
+  </conbody>
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_parquet_file_size.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_parquet_file_size.xml 
b/docs/topics/impala_parquet_file_size.xml
new file mode 100644
index 0000000..396fa92
--- /dev/null
+++ b/docs/topics/impala_parquet_file_size.xml
@@ -0,0 +1,82 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept rev="parquet_block_size" id="parquet_file_size">
+
+  <title>PARQUET_FILE_SIZE Query Option</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Parquet"/>
+      <data name="Category" value="File Formats"/>
+      <data name="Category" value="Impala Query Options"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">PARQUET_FILE_SIZE query option</indexterm>
+      Specifies the maximum size of each Parquet data file produced by Impala 
<codeph>INSERT</codeph> statements.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+    <p>
+      Specify the size in bytes, or with a trailing <codeph>m</codeph> or 
<codeph>g</codeph> character to indicate
+      megabytes or gigabytes. For example:
+    </p>
+
+<codeblock>-- 128 megabytes.
+set PARQUET_FILE_SIZE=134217728
+INSERT OVERWRITE parquet_table SELECT * FROM text_table;
+
+-- 512 megabytes.
+set PARQUET_FILE_SIZE=512m;
+INSERT OVERWRITE parquet_table SELECT * FROM text_table;
+
+-- 1 gigabyte.
+set PARQUET_FILE_SIZE=1g;
+INSERT OVERWRITE parquet_table SELECT * FROM text_table;
+</codeblock>
+
+    <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+    <p>
+      With tables that are small or finely partitioned, the default Parquet 
block size (formerly 1 GB, now 256 MB
+      in Impala 2.0 and later) could be much larger than needed for each data 
file. For <codeph>INSERT</codeph>
+      operations into such tables, you can increase parallelism by specifying 
a smaller
+      <codeph>PARQUET_FILE_SIZE</codeph> value, resulting in more HDFS blocks 
that can be processed by different
+      nodes.
+<!-- Reducing the file size also reduces the memory required to buffer each 
block before writing it to disk. -->
+    </p>
+
+    <p>
+      <b>Type:</b> numeric, with optional unit specifier
+    </p>
+
+    <note type="important">
+    <p>
+      Currently, the maximum value for this setting is 1 gigabyte 
(<codeph>1g</codeph>).
+      Setting a value higher than 1 gigabyte could result in errors during
+      an <codeph>INSERT</codeph> operation.
+    </p>
+    </note>
+
+    <p>
+      <b>Default:</b> 0 (produces files with a target size of 256 MB; files 
might be larger for very wide tables)
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/isilon_blurb"/>
+    <p conref="../shared/impala_common.xml#common/isilon_block_size_caveat"/>
+
+    <p conref="../shared/impala_common.xml#common/related_info"/>
+
+    <p>
+      For information about the Parquet file format, and how the number and 
size of data files affects query
+      performance, see <xref href="impala_parquet.xml#parquet"/>.
+    </p>
+
+<!-- Examples actually folded into Syntax earlier.   <p 
conref="../shared/impala_common.xml#common/example_blurb"/> -->
+
+  </conbody>
+</concept>

[07/22] incubator-impala git commit: First try at porting over the source files necessary for the Impala SQL Reference.

Reply via email to