http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_operators.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_operators.xml b/docs/topics/impala_operators.xml new file mode 100644 index 0000000..da3dab3 --- /dev/null +++ b/docs/topics/impala_operators.xml @@ -0,0 +1,1262 @@ +<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="operators"> + + <title>SQL Operators</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">operators</indexterm> + SQL operators are a class of comparison functions that are widely used within the <codeph>WHERE</codeph> + clauses of <codeph>SELECT</codeph> statements. + </p> + + <p outputclass="toc inpage"/> + </conbody> + + <concept rev="1.4.0" id="arithmetic_operators"> + + <title>Arithmetic Operators</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">arithmetic operators</indexterm> + The arithmetic operators use expressions with a left-hand argument, the operator, and then (in most cases) + a right-hand argument. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock><varname>left_hand_arg</varname> <varname>binary_operator</varname> <varname>right_hand_arg</varname> +<varname>unary_operator</varname> <varname>single_arg</varname> +</codeblock> + + <ul> + <li> + <codeph>+</codeph> and <codeph>-</codeph>: Can be used either as unary or binary operators. + <ul> + <li> + <p> + With unary notation, such as <codeph>+5</codeph>, <codeph>-2.5</codeph>, or + <codeph>-<varname>col_name</varname></codeph>, they multiply their single numeric argument by + <codeph>+1</codeph> or <codeph>-1</codeph>. Therefore, unary <codeph>+</codeph> returns its + argument unchanged, while unary <codeph>-</codeph> flips the sign of its argument. Although you can + double up these operators in expressions such as <codeph>++5</codeph> (always positive) or + <codeph>-+2</codeph> or <codeph>+-2</codeph> (both always negative), you cannot double the unary + minus operator because <codeph>--</codeph> is interpreted as the start of a comment. (You can use a + double unary minus operator if you separate the <codeph>-</codeph> characters, for example with a + space or parentheses.) + </p> + </li> + + <li> + <p> + With binary notation, such as <codeph>2+2</codeph>, <codeph>5-2.5</codeph>, or + <codeph><varname>col1</varname> + <varname>col2</varname></codeph>, they add or subtract + respectively the right-hand argument to (or from) the left-hand argument. Both arguments must be of + numeric types. + </p> + </li> + </ul> + </li> + + <li> + <p> + <codeph>*</codeph> and <codeph>/</codeph>: Multiplication and division respectively. Both arguments + must be of numeric types. + </p> + <p> + When multiplying, the shorter argument is promoted if necessary (such as <codeph>SMALLINT</codeph> to + <codeph>INT</codeph> or <codeph>BIGINT</codeph>, or <codeph>FLOAT</codeph> to <codeph>DOUBLE</codeph>), + and then the result is promoted again to the next larger type. Thus, multiplying a + <codeph>TINYINT</codeph> and an <codeph>INT</codeph> produces a <codeph>BIGINT</codeph> result. + Multiplying a <codeph>FLOAT</codeph> and a <codeph>FLOAT</codeph> produces a <codeph>DOUBLE</codeph> + result. Multiplying a <codeph>FLOAT</codeph> and a <codeph>DOUBLE</codeph> or a <codeph>DOUBLE</codeph> + and a <codeph>DOUBLE</codeph> produces a <codeph>DECIMAL(38,17)</codeph>, because + <codeph>DECIMAL</codeph> values can represent much larger and more precise values than + <codeph>DOUBLE</codeph>. + </p> + <p> + When dividing, Impala always treats the arguments and result as <codeph>DOUBLE</codeph> values to avoid + losing precision. If you need to insert the results of a division operation into a + <codeph>FLOAT</codeph> column, use the <codeph>CAST()</codeph> function to convert the result to the + correct type. + </p> + </li> + + <li> + <p> + <codeph>%</codeph>: Modulo operator. Returns the remainder of the left-hand argument divided by the + right-hand argument. Both arguments must be of one of the integer types. + </p> + </li> + + <li> + <p> + <codeph>&</codeph>, <codeph>|</codeph>, <codeph>~</codeph>, and <codeph>^</codeph>: Bitwise operators that return the + logical AND, logical OR, <codeph>NOT</codeph>, or logical XOR (exclusive OR) of their argument values. Both arguments must be + of one of the integer types. If the arguments are of different type, the argument with the smaller type + is implicitly extended to match the argument with the longer type. + </p> + </li> + </ul> + + <p> + You can chain a sequence of arithmetic expressions, optionally grouping them with parentheses. + </p> + + <p> + The arithmetic operators generally do not have equivalent calling conventions using functional notation. + For example, prior to Impala 2.2.0 / CDH 5.4.0, there is no <codeph>MOD()</codeph> function equivalent to the <codeph>%</codeph> modulo + operator. Conversely, there are some arithmetic functions that do not have a corresponding operator. For + example, for exponentiation you use the <codeph>POW()</codeph> function, but there is no + <codeph>**</codeph> exponentiation operator. See <xref href="impala_math_functions.xml#math_functions"/> + for the arithmetic functions you can use. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/complex_types_aggregation_explanation"/> + + <p conref="../shared/impala_common.xml#common/complex_types_aggregation_example"/> + + <p conref="../shared/impala_common.xml#common/complex_types_caveat_no_operator"/> + + <p rev="2.3.0"> + The following example shows how to do an arithmetic operation using a numeric field of a <codeph>STRUCT</codeph> type + that is an item within an <codeph>ARRAY</codeph> column. Once the scalar numeric value <codeph>R_NATIONKEY</codeph> + is extracted, it can be used in an arithmetic expression, such as multiplying by 10: + </p> + +<codeblock rev="2.3.0"> +-- The SMALLINT is a field within an array of structs. +describe region; ++-------------+-------------------------+---------+ +| name | type | comment | ++-------------+-------------------------+---------+ +| r_regionkey | smallint | | +| r_name | string | | +| r_comment | string | | +| r_nations | array<struct< | | +| | n_nationkey:smallint, | | +| | n_name:string, | | +| | n_comment:string | | +| | >> | | ++-------------+-------------------------+---------+ + +-- When we refer to the scalar value using dot notation, +-- we can use arithmetic and comparison operators on it +-- like any other number. +select r_name, nation.item.n_name, nation.item.n_nationkey * 10 + from region, region.r_nations as nation +where nation.item.n_nationkey < 5; ++-------------+-------------+------------------------------+ +| r_name | item.n_name | nation.item.n_nationkey * 10 | ++-------------+-------------+------------------------------+ +| AMERICA | CANADA | 30 | +| AMERICA | BRAZIL | 20 | +| AMERICA | ARGENTINA | 10 | +| MIDDLE EAST | EGYPT | 40 | +| AFRICA | ALGERIA | 0 | ++-------------+-------------+------------------------------+ +</codeblock> + + </conbody> + </concept> + + <concept id="between"> + + <title>BETWEEN Operator</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">BETWEEN operator</indexterm> + In a <codeph>WHERE</codeph> clause, compares an expression to both a lower and upper bound. The comparison + is successful is the expression is greater than or equal to the lower bound, and less than or equal to the + upper bound. If the bound values are switched, so the lower bound is greater than the upper bound, does not + match any values. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock><varname>expression</varname> BETWEEN <varname>lower_bound</varname> AND <varname>upper_bound</varname></codeblock> + + <p> + <b>Data types:</b> Typically used with numeric data types. Works with any data type, although not very + practical for <codeph>BOOLEAN</codeph> values. (<codeph>BETWEEN false AND true</codeph> will match all + <codeph>BOOLEAN</codeph> values.) Use <codeph>CAST()</codeph> if necessary to ensure the lower and upper + bound values are compatible types. Call string or date/time functions if necessary to extract or transform + the relevant portion to compare, especially if the value can be transformed into a number. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + Be careful when using short string operands. A longer string that starts with the upper bound value will + not be included, because it is considered greater than the upper bound. For example, <codeph>BETWEEN 'A' + and 'M'</codeph> would not match the string value <codeph>'Midway'</codeph>. Use functions such as + <codeph>upper()</codeph>, <codeph>lower()</codeph>, <codeph>substr()</codeph>, <codeph>trim()</codeph>, and + so on if necessary to ensure the comparison works as expected. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/complex_types_caveat_no_operator"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<codeblock>-- Retrieve data for January through June, inclusive. +select c1 from t1 where month <b>between 1 and 6</b>; + +-- Retrieve data for names beginning with 'A' through 'M' inclusive. +-- Only test the first letter to ensure all the values starting with 'M' are matched. +-- Do a case-insensitive comparison to match names with various capitalization conventions. +select last_name from customers where upper(substr(last_name,1,1)) <b>between 'A' and 'M'</b>; + +-- Retrieve data for only the first week of each month. +select count(distinct visitor_id)) from web_traffic where dayofmonth(when_viewed) <b>between 1 and 7</b>;</codeblock> + + <p rev="2.3.0"> + The following example shows how to do a <codeph>BETWEEN</codeph> comparison using a numeric field of a <codeph>STRUCT</codeph> type + that is an item within an <codeph>ARRAY</codeph> column. Once the scalar numeric value <codeph>R_NATIONKEY</codeph> + is extracted, it can be used in a comparison operator: + </p> + +<codeblock rev="2.3.0"> +-- The SMALLINT is a field within an array of structs. +describe region; ++-------------+-------------------------+---------+ +| name | type | comment | ++-------------+-------------------------+---------+ +| r_regionkey | smallint | | +| r_name | string | | +| r_comment | string | | +| r_nations | array<struct< | | +| | n_nationkey:smallint, | | +| | n_name:string, | | +| | n_comment:string | | +| | >> | | ++-------------+-------------------------+---------+ + +-- When we refer to the scalar value using dot notation, +-- we can use arithmetic and comparison operators on it +-- like any other number. +select r_name, nation.item.n_name, nation.item.n_nationkey +from region, region.r_nations as nation +where nation.item.n_nationkey between 3 and 5 ++-------------+-------------+------------------+ +| r_name | item.n_name | item.n_nationkey | ++-------------+-------------+------------------+ +| AMERICA | CANADA | 3 | +| MIDDLE EAST | EGYPT | 4 | +| AFRICA | ETHIOPIA | 5 | ++-------------+-------------+------------------+ +</codeblock> + + </conbody> + </concept> + + <concept id="comparison_operators"> + + <title>Comparison Operators</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">comparison operators</indexterm> + Impala supports the familiar comparison operators for checking equality and sort order for the column data + types: + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock><varname>left_hand_expression</varname> <varname>comparison_operator</varname> <varname>right_hand_expression</varname></codeblock> + + <ul> + <li> + <codeph>=</codeph>, <codeph>!=</codeph>, <codeph><></codeph>: apply to all types. + </li> + + <li> + <codeph><</codeph>, <codeph><=</codeph>, <codeph>></codeph>, <codeph>>=</codeph>: apply to + all types; for <codeph>BOOLEAN</codeph>, <codeph>TRUE</codeph> is considered greater than + <codeph>FALSE</codeph>. + </li> + </ul> + + <p> + <b>Alternatives:</b> + </p> + + <p> + The <codeph>IN</codeph> and <codeph>BETWEEN</codeph> operators provide shorthand notation for expressing + combinations of equality, less than, and greater than comparisons with a single operator. + </p> + + <p> + Because comparing any value to <codeph>NULL</codeph> produces <codeph>NULL</codeph> rather than + <codeph>TRUE</codeph> or <codeph>FALSE</codeph>, use the <codeph>IS NULL</codeph> and <codeph>IS NOT + NULL</codeph> operators to check if a value is <codeph>NULL</codeph> or not. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/complex_types_caveat_no_operator"/> + + <p rev="2.3.0"> + The following example shows how to do an arithmetic operation using a numeric field of a <codeph>STRUCT</codeph> type + that is an item within an <codeph>ARRAY</codeph> column. Once the scalar numeric value <codeph>R_NATIONKEY</codeph> + is extracted, it can be used with a comparison operator such as <codeph><</codeph>: + </p> + +<codeblock rev="2.3.0"> +-- The SMALLINT is a field within an array of structs. +describe region; ++-------------+-------------------------+---------+ +| name | type | comment | ++-------------+-------------------------+---------+ +| r_regionkey | smallint | | +| r_name | string | | +| r_comment | string | | +| r_nations | array<struct< | | +| | n_nationkey:smallint, | | +| | n_name:string, | | +| | n_comment:string | | +| | >> | | ++-------------+-------------------------+---------+ + +-- When we refer to the scalar value using dot notation, +-- we can use arithmetic and comparison operators on it +-- like any other number. +select r_name, nation.item.n_name, nation.item.n_nationkey +from region, region.r_nations as nation +where nation.item.n_nationkey < 5 ++-------------+-------------+------------------+ +| r_name | item.n_name | item.n_nationkey | ++-------------+-------------+------------------+ +| AMERICA | CANADA | 3 | +| AMERICA | BRAZIL | 2 | +| AMERICA | ARGENTINA | 1 | +| MIDDLE EAST | EGYPT | 4 | +| AFRICA | ALGERIA | 0 | ++-------------+-------------+------------------+ +</codeblock> + + </conbody> + </concept> + + <concept audience="Cloudera" rev="2.1.0" id="except"> + + <title>EXCEPT Operator</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">EXCEPT operator</indexterm> + </p> + </conbody> + </concept> + + <concept rev="2.0.0" id="exists"> + + <title>EXISTS Operator</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">EXISTS operator</indexterm> + <indexterm audience="Cloudera">NOT EXISTS operator</indexterm> + The <codeph>EXISTS</codeph> operator tests whether a subquery returns any results. + You typically use it to find values from one table that have corresponding values in another table. + </p> + + <p> + The converse, <codeph>NOT EXISTS</codeph>, helps to find all the values from one table that do not have any + corresponding values in another table. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>EXISTS (<varname>subquery</varname>) +NOT EXISTS (<varname>subquery</varname>) +</codeblock> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + The subquery can refer to a different table than the outer query block, or the same table. For example, you + might use <codeph>EXISTS</codeph> or <codeph>NOT EXISTS</codeph> to check the existence of parent/child + relationships between two columns of the same table. + </p> + + <p> + You can also use operators and function calls within the subquery to test for other kinds of relationships + other than strict equality. For example, you might use a call to <codeph>COUNT()</codeph> in the subquery + to check whether the number of matching values is higher or lower than some limit. You might call a UDF in + the subquery to check whether values in one table matches a hashed representation of those same values in a + different table. + </p> + + <p conref="../shared/impala_common.xml#common/null_blurb"/> + + <p> + If the subquery returns any value at all (even <codeph>NULL</codeph>), <codeph>EXISTS</codeph> returns + <codeph>TRUE</codeph> and <codeph>NOT EXISTS</codeph> returns false. + </p> + + <p> + The following example shows how even when the subquery returns only <codeph>NULL</codeph> values, + <codeph>EXISTS</codeph> still returns <codeph>TRUE</codeph> and thus matches all the rows from the table in + the outer query block. + </p> + +<codeblock>[localhost:21000] > create table all_nulls (x int); +[localhost:21000] > insert into all_nulls values (null), (null), (null); +[localhost:21000] > select y from t2 where exists (select x from all_nulls); ++---+ +| y | ++---+ +| 2 | +| 4 | +| 6 | ++---+ +</codeblock> + + <p> + However, if the table in the subquery is empty and so the subquery returns an empty result set, + <codeph>EXISTS</codeph> returns <codeph>FALSE</codeph>: + </p> + +<codeblock>[localhost:21000] > create table empty (x int); +[localhost:21000] > select y from t2 where exists (select x from empty); +[localhost:21000] > +</codeblock> + + <p conref="../shared/impala_common.xml#common/added_in_20"/> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p conref="../shared/impala_common.xml#common/subquery_no_limit"/> + + <p> + The <codeph>NOT EXISTS</codeph> operator requires a correlated subquery. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/complex_types_caveat_no_operator"/> + +<!-- To do: construct an EXISTS / NOT EXISTS example for complex types. --> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> +<!-- Maybe turn this into a conref if the same set of tables gets used for subqueries, EXISTS, other places. --> +<!-- Yes, the material was reused under Subqueries for anti-joins. --> + The following examples refer to these simple tables containing small sets of integers or strings: +<codeblock>[localhost:21000] > create table t1 (x int); +[localhost:21000] > insert into t1 values (1), (2), (3), (4), (5), (6); + +[localhost:21000] > create table t2 (y int); +[localhost:21000] > insert into t2 values (2), (4), (6); + +[localhost:21000] > create table t3 (z int); +[localhost:21000] > insert into t3 values (1), (3), (5); + +[localhost:21000] > create table month_names (m string); +[localhost:21000] > insert into month_names values + > ('January'), ('February'), ('March'), + > ('April'), ('May'), ('June'), ('July'), + > ('August'), ('September'), ('October'), + > ('November'), ('December'); +</codeblock> + </p> + + <p> + The following example shows a correlated subquery that finds all the values in one table that exist in + another table. For each value <codeph>X</codeph> from <codeph>T1</codeph>, the query checks if the + <codeph>Y</codeph> column of <codeph>T2</codeph> contains an identical value, and the + <codeph>EXISTS</codeph> operator returns <codeph>TRUE</codeph> or <codeph>FALSE</codeph> as appropriate in + each case. + </p> + +<codeblock>localhost:21000] > select x from t1 where exists (select y from t2 where t1.x = y); ++---+ +| x | ++---+ +| 2 | +| 4 | +| 6 | ++---+ +</codeblock> + + <p> + An uncorrelated query is less interesting in this case. Because the subquery always returns + <codeph>TRUE</codeph>, all rows from <codeph>T1</codeph> are returned. If the table contents where changed + so that the subquery did not match any rows, none of the rows from <codeph>T1</codeph> would be returned. + </p> + +<codeblock>[localhost:21000] > select x from t1 where exists (select y from t2 where y > 5); ++---+ +| x | ++---+ +| 1 | +| 2 | +| 3 | +| 4 | +| 5 | +| 6 | ++---+ +</codeblock> + + <p> + The following example shows how an uncorrelated subquery can test for the existence of some condition + within a table. By using <codeph>LIMIT 1</codeph> or an aggregate function, the query returns a single + result or no result based on whether the subquery matches any rows. Here, we know that <codeph>T1</codeph> + and <codeph>T2</codeph> contain some even numbers, but <codeph>T3</codeph> does not. + </p> + +<codeblock>[localhost:21000] > select "contains an even number" from t1 where exists (select x from t1 where x % 2 = 0) limit 1; ++---------------------------+ +| 'contains an even number' | ++---------------------------+ +| contains an even number | ++---------------------------+ +[localhost:21000] > select "contains an even number" as assertion from t1 where exists (select x from t1 where x % 2 = 0) limit 1; ++-------------------------+ +| assertion | ++-------------------------+ +| contains an even number | ++-------------------------+ +[localhost:21000] > select "contains an even number" as assertion from t2 where exists (select x from t2 where y % 2 = 0) limit 1; +ERROR: AnalysisException: couldn't resolve column reference: 'x' +[localhost:21000] > select "contains an even number" as assertion from t2 where exists (select y from t2 where y % 2 = 0) limit 1; ++-------------------------+ +| assertion | ++-------------------------+ +| contains an even number | ++-------------------------+ +[localhost:21000] > select "contains an even number" as assertion from t3 where exists (select z from t3 where z % 2 = 0) limit 1; +[localhost:21000] > +</codeblock> + + <p> + The following example finds numbers in one table that are 1 greater than numbers from another table. The + <codeph>EXISTS</codeph> notation is simpler than an equivalent <codeph>CROSS JOIN</codeph> between the + tables. (The example then also illustrates how the same test could be performed using an + <codeph>IN</codeph> operator.) + </p> + +<codeblock>[localhost:21000] > select x from t1 where exists (select y from t2 where x = y + 1); ++---+ +| x | ++---+ +| 3 | +| 5 | ++---+ +[localhost:21000] > select x from t1 where x in (select y + 1 from t2); ++---+ +| x | ++---+ +| 3 | +| 5 | ++---+ +</codeblock> + + <p> + The following example finds values from one table that do not exist in another table. + </p> + +<codeblock>[localhost:21000] > select x from t1 where not exists (select y from t2 where x = y); ++---+ +| x | ++---+ +| 1 | +| 3 | +| 5 | ++---+ +</codeblock> + + <p> + The following example uses the <codeph>NOT EXISTS</codeph> operator to find all the leaf nodes in + tree-structured data. This simplified <q>tree of life</q> has multiple levels (class, order, family, and so + on), with each item pointing upward through a <codeph>PARENT</codeph> pointer. The example runs an outer + query and a subquery on the same table, returning only those items whose <codeph>ID</codeph> value is + <i>not</i> referenced by the <codeph>PARENT</codeph> of any other item. + </p> + +<codeblock>[localhost:21000] > create table tree (id int, parent int, name string); +[localhost:21000] > insert overwrite tree values + > (0, null, "animals"), + > (1, 0, "placentals"), + > (2, 0, "marsupials"), + > (3, 1, "bats"), + > (4, 1, "cats"), + > (5, 2, "kangaroos"), + > (6, 4, "lions"), + > (7, 4, "tigers"), + > (8, 5, "red kangaroo"), + > (9, 2, "wallabies"); +[localhost:21000] > select name as "leaf node" from tree one + > where not exists (select parent from tree two where one.id = two.parent); ++--------------+ +| leaf node | ++--------------+ +| bats | +| lions | +| tigers | +| red kangaroo | +| wallabies | ++--------------+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_subqueries.xml#subqueries"/> + </p> + </conbody> + </concept> + + <concept id="in"> + + <title>IN Operator</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">IN operator</indexterm> + <indexterm audience="Cloudera">NOT IN operator</indexterm> + The <codeph>IN</codeph> operator compares an argument value to a set of values, and returns + <codeph>TRUE</codeph> if the argument matches any value in the set. The <codeph>NOT IN</codeph> operator + reverses the comparison, and checks if the argument value is not part of a set of values. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock rev="2.0.0"><varname>expression</varname> IN (<varname>expression</varname> [, <varname>expression</varname>]) +<varname>expression</varname> IN (<varname>subquery</varname>) + +<varname>expression</varname> NOT IN (<varname>expression</varname> [, <varname>expression</varname>]) +<varname>expression</varname> NOT IN (<varname>subquery</varname>) +</codeblock> + + <p> + The left-hand expression and the set of comparison values must be of compatible types. + </p> + + <p> + The left-hand expression must consist only of a single value, not a tuple. Although the left-hand + expression is typically a column name, it could also be some other value. For example, the + <codeph>WHERE</codeph> clauses <codeph>WHERE id IN (5)</codeph> and <codeph>WHERE 5 IN (id)</codeph> + produce the same results. + </p> + + <p> + The set of values to check against can be specified as constants, function calls, column names, or other + expressions in the query text. When the values are listed explicitly, the maximum number of expressions is + 10,000. + </p> + + <p rev="2.0.0"> + In Impala 2.0 and higher, the set of values can also be generated by a subquery. <codeph>IN</codeph> can + evaluate an unlimited number of results using a subquery. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + Any expression using the <codeph>IN</codeph> operator could be rewritten as a series of equality tests + connected with <codeph>OR</codeph>, but the <codeph>IN</codeph> syntax is often clearer, more concise, and + easier for Impala to optimize. For example, with partitioned tables, queries frequently use + <codeph>IN</codeph> clauses to filter data by comparing the partition key columns to specific values. + </p> + + <p conref="../shared/impala_common.xml#common/null_blurb"/> + + <p> + If there really is a matching non-null value, <codeph>IN</codeph> returns <codeph>TRUE</codeph>: + </p> + +<codeblock>[localhost:21000] > select 1 in (1,null,2,3); ++----------------------+ +| 1 in (1, null, 2, 3) | ++----------------------+ +| true | ++----------------------+ +[localhost:21000] > select 1 not in (1,null,2,3); ++--------------------------+ +| 1 not in (1, null, 2, 3) | ++--------------------------+ +| false | ++--------------------------+ +</codeblock> + + <p> + If the searched value is not found in the comparison values, and the comparison values include + <codeph>NULL</codeph>, the result is <codeph>NULL</codeph>: + </p> + +<codeblock>[localhost:21000] > select 5 in (1,null,2,3); ++----------------------+ +| 5 in (1, null, 2, 3) | ++----------------------+ +| NULL | ++----------------------+ +[localhost:21000] > select 5 not in (1,null,2,3); ++--------------------------+ +| 5 not in (1, null, 2, 3) | ++--------------------------+ +| NULL | ++--------------------------+ +[localhost:21000] > select 1 in (null); ++-------------+ +| 1 in (null) | ++-------------+ +| NULL | ++-------------+ +[localhost:21000] > select 1 not in (null); ++-----------------+ +| 1 not in (null) | ++-----------------+ +| NULL | ++-----------------+ +</codeblock> + + <p> + If the left-hand argument is <codeph>NULL</codeph>, <codeph>IN</codeph> always returns + <codeph>NULL</codeph>. This rule applies even if the comparison values include <codeph>NULL</codeph>. + </p> + +<codeblock>[localhost:21000] > select null in (1,2,3); ++-------------------+ +| null in (1, 2, 3) | ++-------------------+ +| NULL | ++-------------------+ +[localhost:21000] > select null not in (1,2,3); ++-----------------------+ +| null not in (1, 2, 3) | ++-----------------------+ +| NULL | ++-----------------------+ +[localhost:21000] > select null in (null); ++----------------+ +| null in (null) | ++----------------+ +| NULL | ++----------------+ +[localhost:21000] > select null not in (null); ++--------------------+ +| null not in (null) | ++--------------------+ +| NULL | ++--------------------+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/enhanced_in_20"/> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/complex_types_caveat_no_operator"/> + + <p rev="2.3.0"> + The following example shows how to do an arithmetic operation using a numeric field of a <codeph>STRUCT</codeph> type + that is an item within an <codeph>ARRAY</codeph> column. Once the scalar numeric value <codeph>R_NATIONKEY</codeph> + is extracted, it can be used in an arithmetic expression, such as multiplying by 10: + </p> + +<codeblock rev="2.3.0"> +-- The SMALLINT is a field within an array of structs. +describe region; ++-------------+-------------------------+---------+ +| name | type | comment | ++-------------+-------------------------+---------+ +| r_regionkey | smallint | | +| r_name | string | | +| r_comment | string | | +| r_nations | array<struct< | | +| | n_nationkey:smallint, | | +| | n_name:string, | | +| | n_comment:string | | +| | >> | | ++-------------+-------------------------+---------+ + +-- When we refer to the scalar value using dot notation, +-- we can use arithmetic and comparison operators on it +-- like any other number. +select r_name, nation.item.n_name, nation.item.n_nationkey +from region, region.r_nations as nation +where nation.item.n_nationkey in (1,3,5) ++---------+-------------+------------------+ +| r_name | item.n_name | item.n_nationkey | ++---------+-------------+------------------+ +| AMERICA | CANADA | 3 | +| AMERICA | ARGENTINA | 1 | +| AFRICA | ETHIOPIA | 5 | ++---------+-------------+------------------+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p conref="../shared/impala_common.xml#common/subquery_no_limit"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<codeblock>-- Using IN is concise and self-documenting. +SELECT * FROM t1 WHERE c1 IN (1,2,10); +-- Equivalent to series of = comparisons ORed together. +SELECT * FROM t1 WHERE c1 = 1 OR c1 = 2 OR c1 = 10; + +SELECT c1 AS "starts with vowel" FROM t2 WHERE upper(substr(c1,1,1)) IN ('A','E','I','O','U'); + +SELECT COUNT(DISTINCT(visitor_id)) FROM web_traffic WHERE month IN ('January','June','July');</codeblock> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_subqueries.xml#subqueries"/> + </p> + </conbody> + </concept> + + <concept audience="Cloudera" rev="2.1.0" id="intersect"> + + <title>INTERSECT Operator</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">INTERSECT operator</indexterm> + </p> + </conbody> + </concept> + + <concept id="is_null"> + + <title>IS NULL Operator</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">IS NULL operator</indexterm> + <indexterm audience="Cloudera">IS NOT NULL operator</indexterm> + The <codeph>IS NULL</codeph> operator, and its converse the <codeph>IS NOT NULL</codeph> operator, test + whether a specified value is <codeph><xref href="impala_literals.xml#null">NULL</xref></codeph>. Because + using <codeph>NULL</codeph> with any of the other comparison operators such as <codeph>=</codeph> or + <codeph>!=</codeph> also returns <codeph>NULL</codeph> rather than <codeph>TRUE</codeph> or + <codeph>FALSE</codeph>, you use a special-purpose comparison operator to check for this special condition. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock><varname>expression</varname> IS NULL +<varname>expression</varname> IS NOT NULL +</codeblock> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + In many cases, <codeph>NULL</codeph> values indicate some incorrect or incomplete processing during data + ingestion or conversion. You might check whether any values in a column are <codeph>NULL</codeph>, and if + so take some followup action to fill them in. + </p> + + <p> + With sparse data, often represented in <q>wide</q> tables, it is common for most values to be + <codeph>NULL</codeph> with only an occasional non-<codeph>NULL</codeph> value. In those cases, you can use + the <codeph>IS NOT NULL</codeph> operator to identify the rows containing any data at all for a particular + column, regardless of the actual value. + </p> + + <p> + With a well-designed database schema, effective use of <codeph>NULL</codeph> values and <codeph>IS + NULL</codeph> and <codeph>IS NOT NULL</codeph> operators can save having to design custom logic around + special values such as 0, -1, <codeph>'N/A'</codeph>, empty string, and so on. <codeph>NULL</codeph> lets + you distinguish between a value that is known to be 0, false, or empty, and a truly unknown value. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<codeblock>-- If this value is non-zero, something is wrong. +select count(*) from employees where employee_id is null; + +-- With data from disparate sources, some fields might be blank. +-- Not necessarily an error condition. +select count(*) from census where household_income is null; + +-- Sometimes we expect fields to be null, and followup action +-- is needed when they are not. +select count(*) from web_traffic where weird_http_code is not null;</codeblock> + </conbody> + </concept> + + <concept id="like"> + + <title>LIKE Operator</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">LIKE operator</indexterm> + A comparison operator for <codeph>STRING</codeph> data, with basic wildcard capability using + <codeph>_</codeph> to match a single character and <codeph>%</codeph> to match multiple characters. The + argument expression must match the entire string value. Typically, it is more efficient to put any + <codeph>%</codeph> wildcard match at the end of the string. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock><varname>string_expression</varname> LIKE <varname>wildcard_expression</varname> +<varname>string_expression</varname> NOT LIKE <varname>wildcard_expression</varname> +</codeblock> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/complex_types_caveat_no_operator"/> + +<!-- To do: construct a LIKE example for complex types. --> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<codeblock>select distinct c_last_name from customer where c_last_name like 'Mc%' or c_last_name like 'Mac%'; +select count(c_last_name) from customer where c_last_name like 'M%'; +select c_email_address from customer where c_email_address like '%.edu'; + +-- We can find 4-letter names beginning with 'M' by calling functions... +select distinct c_last_name from customer where length(c_last_name) = 4 and substr(c_last_name,1,1) = 'M'; +-- ...or in a more readable way by matching M followed by exactly 3 characters. +select distinct c_last_name from customer where c_last_name like 'M___';</codeblock> + + <p> + For a more general kind of search operator using regular expressions, see + <xref href="impala_operators.xml#regexp"/>. + </p> + </conbody> + </concept> + + <concept id="logical_operators"> + + <title>Logical Operators</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">logical operators</indexterm> + Logical operators return a <codeph>BOOLEAN</codeph> value, based on a binary or unary logical operation + between arguments that are also Booleans. Typically, the argument expressions use + <xref href="impala_operators.xml#comparison_operators">comparison operators</xref>. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock><varname>boolean_expression</varname> <varname>binary_logical_operator</varname> <varname>boolean_expression</varname> +<varname>unary_logical_operator</varname> <varname>boolean_expression</varname> +</codeblock> + + <p> + The Impala logical operators are: + </p> + + <ul> + <li> + <codeph>AND</codeph>: A binary operator that returns <codeph>true</codeph> if its left-hand and + right-hand arguments both evaluate to <codeph>true</codeph>, <codeph>NULL</codeph> if either argument is + <codeph>NULL</codeph>, and <codeph>false</codeph> otherwise. + </li> + + <li> + <codeph>OR</codeph>: A binary operator that returns <codeph>true</codeph> if either of its left-hand and + right-hand arguments evaluate to <codeph>true</codeph>, <codeph>NULL</codeph> if one argument is + <codeph>NULL</codeph> and the other is either <codeph>NULL</codeph> or <codeph>false</codeph>, and + <codeph>false</codeph> otherwise. + </li> + + <li> + <codeph>NOT</codeph>: A unary operator that flips the state of a Boolean expression from + <codeph>true</codeph> to <codeph>false</codeph>, or <codeph>false</codeph> to <codeph>true</codeph>. If + the argument expression is <codeph>NULL</codeph>, the result remains <codeph>NULL</codeph>. (When + <codeph>NOT</codeph> is used this way as a unary logical operator, it works differently than the + <codeph>IS NOT NULL</codeph> comparison operator, which returns <codeph>true</codeph> when applied to a + <codeph>NULL</codeph>.) + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/complex_types_caveat_no_operator"/> + + <p rev="2.3.0"> + The following example shows how to do an arithmetic operation using a numeric field of a <codeph>STRUCT</codeph> type + that is an item within an <codeph>ARRAY</codeph> column. Once the scalar numeric value <codeph>R_NATIONKEY</codeph> + is extracted, it can be used in an arithmetic expression, such as multiplying by 10: + </p> + +<codeblock rev="2.3.0"> +-- The SMALLINT is a field within an array of structs. +describe region; ++-------------+-------------------------+---------+ +| name | type | comment | ++-------------+-------------------------+---------+ +| r_regionkey | smallint | | +| r_name | string | | +| r_comment | string | | +| r_nations | array<struct< | | +| | n_nationkey:smallint, | | +| | n_name:string, | | +| | n_comment:string | | +| | >> | | ++-------------+-------------------------+---------+ + +-- When we refer to the scalar value using dot notation, +-- we can use arithmetic and comparison operators on it +-- like any other number. +select r_name, nation.item.n_name, nation.item.n_nationkey + from region, region.r_nations as nation +where + nation.item.n_nationkey between 3 and 5 + or nation.item.n_nationkey < 15; ++-------------+----------------+------------------+ +| r_name | item.n_name | item.n_nationkey | ++-------------+----------------+------------------+ +| EUROPE | UNITED KINGDOM | 23 | +| EUROPE | RUSSIA | 22 | +| EUROPE | ROMANIA | 19 | +| ASIA | VIETNAM | 21 | +| ASIA | CHINA | 18 | +| AMERICA | UNITED STATES | 24 | +| AMERICA | PERU | 17 | +| AMERICA | CANADA | 3 | +| MIDDLE EAST | SAUDI ARABIA | 20 | +| MIDDLE EAST | EGYPT | 4 | +| AFRICA | MOZAMBIQUE | 16 | +| AFRICA | ETHIOPIA | 5 | ++-------------+----------------+------------------+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + These examples demonstrate the <codeph>AND</codeph> operator: + </p> + +<codeblock>[localhost:21000] > select true and true; ++---------------+ +| true and true | ++---------------+ +| true | ++---------------+ +[localhost:21000] > select true and false; ++----------------+ +| true and false | ++----------------+ +| false | ++----------------+ +[localhost:21000] > select false and false; ++-----------------+ +| false and false | ++-----------------+ +| false | ++-----------------+ +[localhost:21000] > select true and null; ++---------------+ +| true and null | ++---------------+ +| NULL | ++---------------+ +[localhost:21000] > select (10 > 2) and (6 != 9); ++-----------------------+ +| (10 > 2) and (6 != 9) | ++-----------------------+ +| true | ++-----------------------+ +</codeblock> + + <p> + These examples demonstrate the <codeph>OR</codeph> operator: + </p> + +<codeblock>[localhost:21000] > select true or true; ++--------------+ +| true or true | ++--------------+ +| true | ++--------------+ +[localhost:21000] > select true or false; ++---------------+ +| true or false | ++---------------+ +| true | ++---------------+ +[localhost:21000] > select false or false; ++----------------+ +| false or false | ++----------------+ +| false | ++----------------+ +[localhost:21000] > select true or null; ++--------------+ +| true or null | ++--------------+ +| true | ++--------------+ +[localhost:21000] > select null or true; ++--------------+ +| null or true | ++--------------+ +| true | ++--------------+ +[localhost:21000] > select false or null; ++---------------+ +| false or null | ++---------------+ +| NULL | ++---------------+ +[localhost:21000] > select (1 = 1) or ('hello' = 'world'); ++--------------------------------+ +| (1 = 1) or ('hello' = 'world') | ++--------------------------------+ +| true | ++--------------------------------+ +[localhost:21000] > select (2 + 2 != 4) or (-1 > 0); ++--------------------------+ +| (2 + 2 != 4) or (-1 > 0) | ++--------------------------+ +| false | ++--------------------------+ +</codeblock> + + <p> + These examples demonstrate the <codeph>NOT</codeph> operator: + </p> + +<codeblock>[localhost:21000] > select not true; ++----------+ +| not true | ++----------+ +| false | ++----------+ +[localhost:21000] > select not false; ++-----------+ +| not false | ++-----------+ +| true | ++-----------+ +[localhost:21000] > select not null; ++----------+ +| not null | ++----------+ +| NULL | ++----------+ +[localhost:21000] > select not (1=1); ++-------------+ +| not (1 = 1) | ++-------------+ +| false | ++-------------+ +</codeblock> + </conbody> + </concept> + + <concept id="regexp"> + + <title>REGEXP Operator</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">REGEXP operator</indexterm> + Tests whether a value matches a regular expression. Uses the POSIX regular expression syntax where + <codeph>^</codeph> and <codeph>$</codeph> match the beginning and end of the string, <codeph>.</codeph> + represents any single character, <codeph>*</codeph> represents a sequence of zero or more items, + <codeph>+</codeph> represents a sequence of one or more items, <codeph>?</codeph> produces a non-greedy + match, and so on. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> +<codeblock><varname>string_expression</varname> REGEXP <varname>regular_expression</varname> +</codeblock> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + The regular expression must match the entire value, not just occur somewhere inside it. Use + <codeph>.*</codeph> at the beginning and/or the end if you only need to match characters anywhere in the + middle. Thus, the <codeph>^</codeph> and <codeph>$</codeph> atoms are often redundant, although you might + already have them in your expression strings that you reuse from elsewhere. + </p> + + <p> + The <codeph>RLIKE</codeph> operator is a synonym for <codeph>REGEXP</codeph>. + </p> + + <p> + The <codeph>|</codeph> symbol is the alternation operator, typically used within <codeph>()</codeph> to + match different sequences. The <codeph>()</codeph> groups do not allow backreferences. To retrieve the part + of a value matched within a <codeph>()</codeph> section, use the + <codeph><xref href="impala_string_functions.xml#string_functions/regexp_extract">regexp_extract()</xref></codeph> + built-in function. + </p> + + <note rev="1.3.1"> + <p conref="../shared/impala_common.xml#common/regexp_matching"/> + </note> + + <p conref="../shared/impala_common.xml#common/regexp_re2"/> + + <p conref="../shared/impala_common.xml#common/regexp_re2_warning"/> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/complex_types_caveat_no_operator"/> + +<!-- To do: construct a REGEXP example for complex types. --> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following examples demonstrate the identical syntax for the <codeph>REGEXP</codeph> and + <codeph>RLIKE</codeph> operators. + </p> + +<!-- Same examples shown for both REGEXP and RLIKE operators. --> + +<codeblock conref="../shared/impala_common.xml#common/regexp_rlike_examples"/> + </conbody> + </concept> + + <concept id="rlike"> + + <title>RLIKE Operator</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">RLIKE operator</indexterm> + Synonym for the <codeph>REGEXP</codeph> operator. See <xref href="impala_operators.xml#regexp"/> for + details. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following examples demonstrate the identical syntax for the <codeph>REGEXP</codeph> and + <codeph>RLIKE</codeph> operators. + </p> + +<!-- Same examples shown for both REGEXP and RLIKE operators. --> + +<codeblock conref="../shared/impala_common.xml#common/regexp_rlike_examples"/> + </conbody> + </concept> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_order_by.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_order_by.xml b/docs/topics/impala_order_by.xml new file mode 100644 index 0000000..f3042e5 --- /dev/null +++ b/docs/topics/impala_order_by.xml @@ -0,0 +1,316 @@ +<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="order_by"> + + <title>ORDER BY Clause</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + + <conbody> + + <p> + The familiar <codeph>ORDER BY</codeph> clause of a <codeph>SELECT</codeph> statement sorts the result set + based on the values from one or more columns. + </p> + + <p> + For distributed queries, this is a relatively expensive operation, because the entire result set must be + produced and transferred to one node before the sorting can happen. This can require more memory capacity + than a query without <codeph>ORDER BY</codeph>. Even if the query takes approximately the same time to finish + with or without the <codeph>ORDER BY</codeph> clause, subjectively it can appear slower because no results + are available until all processing is finished, rather than results coming back gradually as rows matching + the <codeph>WHERE</codeph> clause are found. Therefore, if you only need the first N results from the sorted + result set, also include the <codeph>LIMIT</codeph> clause, which reduces network overhead and the memory + requirement on the coordinator node. + </p> + + <note> + <p rev="1.4.0 obwl"> + In Impala 1.4.0 and higher, the <codeph>LIMIT</codeph> clause is now optional (rather than required) for + queries that use the <codeph>ORDER BY</codeph> clause. Impala automatically uses a temporary disk work area + to perform the sort if the sort operation would otherwise exceed the Impala memory limit for a particular + data node. + </p> + </note> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p> + The full syntax for the <codeph>ORDER BY</codeph> clause is: + </p> + +<codeblock rev="1.2.1">ORDER BY <varname>col_ref</varname> [, <varname>col_ref</varname> ...] [ASC | DESC] [NULLS FIRST | NULLS LAST] + +col_ref ::= <varname>column_name</varname> | <varname>integer_literal</varname> +</codeblock> + + <p> + Although the most common usage is <codeph>ORDER BY <varname>column_name</varname></codeph>, you can also + specify <codeph>ORDER BY 1</codeph> to sort by the first column of the result set, <codeph>ORDER BY + 2</codeph> to sort by the second column, and so on. The number must be a numeric literal, not some other kind + of constant expression. (If the argument is some other expression, even a <codeph>STRING</codeph> value, the + query succeeds but the order of results is undefined.) + </p> + + <p> + <codeph>ORDER BY <varname>column_number</varname></codeph> can only be used when the query explicitly lists + the columns in the <codeph>SELECT</codeph> list, not with <codeph>SELECT *</codeph> queries. + </p> + + <p> + <b>Ascending and descending sorts:</b> + </p> + + <p> + The default sort order (the same as using the <codeph>ASC</codeph> keyword) puts the smallest values at the + start of the result set, and the largest values at the end. Specifying the <codeph>DESC</codeph> keyword + reverses that order. + </p> + + <p> + <b>Sort order for NULL values:</b> + </p> + + <p rev="1.2.1"> + See <xref href="impala_literals.xml#null"/> for details about how <codeph>NULL</codeph> values are positioned + in the sorted result set, and how to use the <codeph>NULLS FIRST</codeph> and <codeph>NULLS LAST</codeph> + clauses. (The sort position for <codeph>NULL</codeph> values in <codeph>ORDER BY ... DESC</codeph> queries is + changed in Impala 1.2.1 and higher to be more standards-compliant, and the <codeph>NULLS FIRST</codeph> and + <codeph>NULLS LAST</codeph> keywords are new in Impala 1.2.1.) + </p> + + <p rev="obwl" conref="../shared/impala_common.xml#common/order_by_limit"/> + + <!-- Good to show an example of cases where ORDER BY does and doesn't work with complex types. --> + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p rev="2.3.0"> + In CDH 5.5 / Impala 2.3 and higher, the complex data types <codeph>STRUCT</codeph>, + <codeph>ARRAY</codeph>, and <codeph>MAP</codeph> are available. These columns cannot + be referenced directly in the <codeph>ORDER BY</codeph> clause. + When you query a complex type column, you use join notation to <q>unpack</q> the elements + of the complex type, and within the join query you can include an <codeph>ORDER BY</codeph> + clause to control the order in the result set of the scalar elements from the complex type. + See <xref href="impala_complex_types.xml#complex_types"/> for details about Impala support for complex types. + </p> + + <p> + The following query shows how a complex type column cannot be directly used in an <codeph>ORDER BY</codeph> clause: + </p> + +<codeblock>CREATE TABLE games (id BIGINT, score ARRAY <BIGINT>) STORED AS PARQUET; +...use LOAD DATA to load externally created Parquet files into the table... +SELECT id FROM games ORDER BY score DESC; +ERROR: AnalysisException: ORDER BY expression 'score' with complex type 'ARRAY<BIGINT>' is not supported. +</codeblock> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following query retrieves the user ID and score, only for scores greater than one million, + with the highest scores for each user listed first. + Because the individual array elements are now represented as separate rows in the result set, + they can be used in the <codeph>ORDER BY</codeph> clause, referenced using the <codeph>ITEM</codeph> + pseudocolumn that represents each array element. + </p> + +<codeblock>SELECT id, item FROM games, games.score + WHERE item > 1000000 +ORDER BY id, item desc; +</codeblock> + + <p> + The following queries use similar <codeph>ORDER BY</codeph> techniques with variations of the <codeph>GAMES</codeph> + table, where the complex type is an <codeph>ARRAY</codeph> containing <codeph>STRUCT</codeph> or <codeph>MAP</codeph> + elements to represent additional details about each game that was played. + For an array of structures, the fields of the structure are referenced as <codeph>ITEM.<varname>field_name</varname></codeph>. + For an array of maps, the keys and values within each array element are referenced as <codeph>ITEM.KEY</codeph> + and <codeph>ITEM.VALUE</codeph>. + </p> + +<codeblock>CREATE TABLE games2 (id BIGINT, play array < struct <game_name: string, score: BIGINT, high_score: boolean> >) STORED AS PARQUET +...use LOAD DATA to load externally created Parquet files into the table... +SELECT id, item.game_name, item.score FROM games2, games2.play + WHERE item.score > 1000000 +ORDER BY id, item.score DESC; + +CREATE TABLE games3 (id BIGINT, play ARRAY < MAP <STRING, BIGINT> >) STORED AS PARQUET; +...use LOAD DATA to load externally created Parquet files into the table... +SELECT id, info.key AS k, info.value AS v from games3, games3.play AS plays, games3.play.item AS info + WHERE info.KEY = 'score' AND info.VALUE > 1000000 +ORDER BY id, info.value desc; +</codeblock> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + Although the <codeph>LIMIT</codeph> clause is now optional on <codeph>ORDER BY</codeph> queries, if your + query only needs some number of rows that you can predict in advance, use the <codeph>LIMIT</codeph> clause + to reduce unnecessary processing. For example, if the query has a clause <codeph>LIMIT 10</codeph>, each data + node sorts its portion of the relevant result set and only returns 10 rows to the coordinator node. The + coordinator node picks the 10 highest or lowest row values out of this small intermediate result set. + </p> + + <p> + If an <codeph>ORDER BY</codeph> clause is applied to an early phase of query processing, such as a subquery + or a view definition, Impala ignores the <codeph>ORDER BY</codeph> clause. To get ordered results from a + subquery or view, apply an <codeph>ORDER BY</codeph> clause to the outermost or final <codeph>SELECT</codeph> + level. + </p> + + <p> + <codeph>ORDER BY</codeph> is often used in combination with <codeph>LIMIT</codeph> to perform <q>top-N</q> + queries: + </p> + +<codeblock>SELECT user_id AS "Top 10 Visitors", SUM(page_views) FROM web_stats + GROUP BY page_views, user_id + ORDER BY SUM(page_views) DESC LIMIT 10; +</codeblock> + + <p> + <codeph>ORDER BY</codeph> is sometimes used in combination with <codeph>OFFSET</codeph> and + <codeph>LIMIT</codeph> to paginate query results, although it is relatively inefficient to issue multiple + queries like this against the large tables typically used with Impala: + </p> + +<codeblock>SELECT page_title AS "Page 1 of search results", page_url FROM search_content + WHERE LOWER(page_title) LIKE '%game%') + ORDER BY page_title LIMIT 10 OFFSET 0; +SELECT page_title AS "Page 2 of search results", page_url FROM search_content + WHERE LOWER(page_title) LIKE '%game%') + ORDER BY page_title LIMIT 10 OFFSET 10; +SELECT page_title AS "Page 3 of search results", page_url FROM search_content + WHERE LOWER(page_title) LIKE '%game%') + ORDER BY page_title LIMIT 10 OFFSET 20; +</codeblock> + + <p conref="../shared/impala_common.xml#common/internals_blurb"/> + + <p> + Impala sorts the intermediate results of an <codeph>ORDER BY</codeph> clause in memory whenever practical. In + a cluster of N data nodes, each node sorts roughly 1/Nth of the result set, the exact proportion varying + depending on how the data matching the query is distributed in HDFS. + </p> + + <p> + If the size of the sorted intermediate result set on any data node would cause the query to exceed the Impala + memory limit, Impala sorts as much as practical in memory, then writes partially sorted data to disk. (This + technique is known in industry terminology as <q>external sorting</q> and <q>spilling to disk</q>.) As each + 8 MB batch of data is written to disk, Impala frees the corresponding memory to sort a new 8 MB batch of + data. When all the data has been processed, a final merge sort operation is performed to correctly order the + in-memory and on-disk results as the result set is transmitted back to the coordinator node. When external + sorting becomes necessary, Impala requires approximately 60 MB of RAM at a minimum for the buffers needed to + read, write, and sort the intermediate results. If more RAM is available on the data node, Impala will use + the additional RAM to minimize the amount of disk I/O for sorting. + </p> + + <p> + This external sort technique is used as appropriate on each data node (possibly including the coordinator + node) to sort the portion of the result set that is processed on that node. When the sorted intermediate + results are sent back to the coordinator node to produce the final result set, the coordinator node uses a + merge sort technique to produce a final sorted result set without using any extra resources on the + coordinator node. + </p> + + <p rev="obwl"> + <b>Configuration for disk usage:</b> + </p> + + <p rev="obwl" conref="../shared/impala_common.xml#common/order_by_scratch_dir"/> + +<!-- Here is actually the more logical place to collect all those examples, move them from SELECT and cross-reference to here. --> + +<!-- <p rev="obwl" conref="/Content/impala_common_xi44078.xml#common/restrictions_blurb"/> --> + + <p rev="obwl" conref="../shared/impala_common.xml#common/insert_sort_blurb"/> + + <p rev="obwl" conref="../shared/impala_common.xml#common/order_by_view_restriction"/> + + <p> + With the lifting of the requirement to include a <codeph>LIMIT</codeph> clause in every <codeph>ORDER + BY</codeph> query (in Impala 1.4 and higher): + </p> + + <ul> + <li> + <p> + Now the use of scratch disk space raises the possibility of an <q>out of disk space</q> error on a + particular data node, as opposed to the previous possibility of an <q>out of memory</q> error. Make sure + to keep at least 1 GB free on the filesystem used for temporary sorting work. + </p> + </li> + + <li> + <p> + The query options + <xref href="impala_default_order_by_limit.xml#default_order_by_limit">DEFAULT_ORDER_BY_LIMIT</xref> and + <xref href="impala_abort_on_default_limit_exceeded.xml#abort_on_default_limit_exceeded">ABORT_ON_DEFAULT_LIMIT_EXCEEDED</xref>, + which formerly controlled the behavior of <codeph>ORDER BY</codeph> queries with no limit specified, are + now ignored. + </p> + </li> + </ul> + + <p rev="obwl" conref="../shared/impala_common.xml#common/null_sorting_change"/> +<codeblock>[localhost:21000] > create table numbers (x int); +[localhost:21000] > insert into numbers values (1), (null), (2), (null), (3); +[localhost:21000] > select x from numbers order by x nulls first; ++------+ +| x | ++------+ +| NULL | +| NULL | +| 1 | +| 2 | +| 3 | ++------+ +[localhost:21000] > select x from numbers order by x desc nulls first; ++------+ +| x | ++------+ +| NULL | +| NULL | +| 3 | +| 2 | +| 1 | ++------+ +[localhost:21000] > select x from numbers order by x nulls last; ++------+ +| x | ++------+ +| 1 | +| 2 | +| 3 | +| NULL | +| NULL | ++------+ +[localhost:21000] > select x from numbers order by x desc nulls last; ++------+ +| x | ++------+ +| 3 | +| 2 | +| 1 | +| NULL | +| NULL | ++------+ +</codeblock> + + <p rev="obwl" conref="../shared/impala_common.xml#common/related_info"/> + + <p rev="obwl"> + See <xref href="impala_select.xml#select"/> for further examples of queries with the <codeph>ORDER + BY</codeph> clause. + </p> + + <p> + Analytic functions use the <codeph>ORDER BY</codeph> clause in a different context to define the sequence in + which rows are analyzed. See <xref href="impala_analytic_functions.xml#analytic_functions"/> for details. + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_parquet_compression_codec.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_parquet_compression_codec.xml b/docs/topics/impala_parquet_compression_codec.xml new file mode 100644 index 0000000..d178a0d --- /dev/null +++ b/docs/topics/impala_parquet_compression_codec.xml @@ -0,0 +1,25 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="parquet_compression_codec"> + + <title>PARQUET_COMPRESSION_CODEC Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Parquet"/> + <data name="Category" value="File Formats"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Deprecated Features"/> + <data name="Category" value="Compression"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">PARQUET_COMPRESSION_CODEC query option</indexterm> + Deprecated. Use <codeph>COMPRESSION_CODEC</codeph> in Impala 2.0 and later. See + <xref href="impala_compression_codec.xml#compression_codec"/> for details. + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_parquet_file_size.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_parquet_file_size.xml b/docs/topics/impala_parquet_file_size.xml new file mode 100644 index 0000000..396fa92 --- /dev/null +++ b/docs/topics/impala_parquet_file_size.xml @@ -0,0 +1,82 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="parquet_block_size" id="parquet_file_size"> + + <title>PARQUET_FILE_SIZE Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Parquet"/> + <data name="Category" value="File Formats"/> + <data name="Category" value="Impala Query Options"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">PARQUET_FILE_SIZE query option</indexterm> + Specifies the maximum size of each Parquet data file produced by Impala <codeph>INSERT</codeph> statements. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p> + Specify the size in bytes, or with a trailing <codeph>m</codeph> or <codeph>g</codeph> character to indicate + megabytes or gigabytes. For example: + </p> + +<codeblock>-- 128 megabytes. +set PARQUET_FILE_SIZE=134217728 +INSERT OVERWRITE parquet_table SELECT * FROM text_table; + +-- 512 megabytes. +set PARQUET_FILE_SIZE=512m; +INSERT OVERWRITE parquet_table SELECT * FROM text_table; + +-- 1 gigabyte. +set PARQUET_FILE_SIZE=1g; +INSERT OVERWRITE parquet_table SELECT * FROM text_table; +</codeblock> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + With tables that are small or finely partitioned, the default Parquet block size (formerly 1 GB, now 256 MB + in Impala 2.0 and later) could be much larger than needed for each data file. For <codeph>INSERT</codeph> + operations into such tables, you can increase parallelism by specifying a smaller + <codeph>PARQUET_FILE_SIZE</codeph> value, resulting in more HDFS blocks that can be processed by different + nodes. +<!-- Reducing the file size also reduces the memory required to buffer each block before writing it to disk. --> + </p> + + <p> + <b>Type:</b> numeric, with optional unit specifier + </p> + + <note type="important"> + <p> + Currently, the maximum value for this setting is 1 gigabyte (<codeph>1g</codeph>). + Setting a value higher than 1 gigabyte could result in errors during + an <codeph>INSERT</codeph> operation. + </p> + </note> + + <p> + <b>Default:</b> 0 (produces files with a target size of 256 MB; files might be larger for very wide tables) + </p> + + <p conref="../shared/impala_common.xml#common/isilon_blurb"/> + <p conref="../shared/impala_common.xml#common/isilon_block_size_caveat"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + For information about the Parquet file format, and how the number and size of data files affects query + performance, see <xref href="impala_parquet.xml#parquet"/>. + </p> + +<!-- Examples actually folded into Syntax earlier. <p conref="../shared/impala_common.xml#common/example_blurb"/> --> + + </conbody> +</concept>
