http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_functions.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_functions.xml b/docs/topics/impala_functions.xml new file mode 100644 index 0000000..527744b --- /dev/null +++ b/docs/topics/impala_functions.xml @@ -0,0 +1,162 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="builtins"> + + <title id="title_functions">Impala Built-In Functions</title> + <titlealts><navtitle>Built-In Functions</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + </metadata> + </prolog> + + <conbody> + + <draft-comment translate="no"> +Opportunity to conref some material between here and the "Functions" topic under "Schema Objects". +</draft-comment> + + <p> + Impala supports several categories of built-in functions. These functions let you perform mathematical + calculations, string manipulation, date calculations, and other kinds of data transformations directly in + <codeph>SELECT</codeph> statements. The built-in functions let a SQL query return results with all + formatting, calculating, and type conversions applied, rather than performing time-consuming postprocessing + in another application. By applying function calls where practical, you can make a SQL query that is as + convenient as an expression in a procedural programming language or a formula in a spreadsheet. + </p> + + <p> + The categories of functions supported by Impala are: + </p> + + <ul> + <li> + <xref href="impala_math_functions.xml#math_functions"/> + </li> + + <li> + <xref href="impala_conversion_functions.xml#conversion_functions"/> + </li> + + <li> + <xref href="impala_datetime_functions.xml#datetime_functions"/> + </li> + + <li> + <xref href="impala_conditional_functions.xml#conditional_functions"/> + </li> + + <li> + <xref href="impala_string_functions.xml#string_functions"/> + </li> + + <li> + Aggregation functions, explained in <xref href="impala_aggregate_functions.xml#aggregate_functions"/>. + </li> + </ul> + + <p> + You call any of these functions through the <codeph>SELECT</codeph> statement. For most functions, you can + omit the <codeph>FROM</codeph> clause and supply literal values for any required arguments: + </p> + +<codeblock>select abs(-1); ++---------+ +| abs(-1) | ++---------+ +| 1 | ++---------+ + +select concat('The rain ', 'in Spain'); ++---------------------------------+ +| concat('the rain ', 'in spain') | ++---------------------------------+ +| The rain in Spain | ++---------------------------------+ + +select power(2,5); ++-------------+ +| power(2, 5) | ++-------------+ +| 32 | ++-------------+ +</codeblock> + + <p> + When you use a <codeph>FROM</codeph> clause and specify a column name as a function argument, the function is + applied for each item in the result set: + </p> + +<!-- TK: make real output for these; change the queries if necessary to use tables I already have. --> + +<codeblock>select concat('Country = ',country_code) from all_countries where population > 100000000; +select round(price) as dollar_value from product_catalog where price between 0.0 and 100.0; +</codeblock> + + <p> + Typically, if any argument to a built-in function is <codeph>NULL</codeph>, the result value is also + <codeph>NULL</codeph>: + </p> + +<codeblock>select cos(null); ++-----------+ +| cos(null) | ++-----------+ +| NULL | ++-----------+ + +select power(2,null); ++----------------+ +| power(2, null) | ++----------------+ +| NULL | ++----------------+ + +select concat('a',null,'b'); ++------------------------+ +| concat('a', null, 'b') | ++------------------------+ +| NULL | ++------------------------+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/aggr1"/> + +<codeblock conref="../shared/impala_common.xml#common/aggr2"/> + + <p conref="../shared/impala_common.xml#common/aggr3"/> + + <p> + Aggregate functions are a special category with different rules. These functions calculate a return value + across all the items in a result set, so they do require a <codeph>FROM</codeph> clause in the query: + </p> + +<!-- TK: make real output for these; change the queries if necessary to use tables I already have. --> + +<codeblock>select count(product_id) from product_catalog; +select max(height), avg(height) from census_data where age > 20; +</codeblock> + + <p> + Aggregate functions also ignore <codeph>NULL</codeph> values rather than returning a <codeph>NULL</codeph> + result. For example, if some rows have <codeph>NULL</codeph> for a particular column, those rows are ignored + when computing the AVG() for that column. Likewise, specifying <codeph>COUNT(col_name)</codeph> in a query + counts only those rows where <codeph>col_name</codeph> contains a non-<codeph>NULL</codeph> value. + </p> + + <p rev="2.0.0"> + Analytic functions are a variation on aggregate functions. Instead of returning a single value, or an + identical value for each group of rows, they can compute values that vary based on a <q>window</q> consisting + of of other rows around them in the result set. + </p> + + <p outputclass="toc"/> + + </conbody> + +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_functions_overview.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_functions_overview.xml b/docs/topics/impala_functions_overview.xml new file mode 100644 index 0000000..26a4d35 --- /dev/null +++ b/docs/topics/impala_functions_overview.xml @@ -0,0 +1,116 @@ +<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="functions"> + + <title>Overview of Impala Functions</title> + <titlealts><navtitle>Functions</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + </metadata> + </prolog> + + <conbody> + + <p> + Functions let you apply arithmetic, string, or other computations and transformations to Impala data. You + typically use them in <codeph>SELECT</codeph> lists and <codeph>WHERE</codeph> clauses to filter and format + query results so that the result set is exactly what you want, with no further processing needed on the + application side. + </p> + + <p> + Scalar functions return a single result for each input row. See <xref href="impala_functions.xml#builtins"/>. + </p> + +<codeblock>[localhost:21000] > select name, population from country where continent = 'North America' order by population desc limit 4; +[localhost:21000] > select upper(name), population from country where continent = 'North America' order by population desc limit 4; ++-------------+------------+ +| upper(name) | population | ++-------------+------------+ +| USA | 320000000 | +| MEXICO | 122000000 | +| CANADA | 25000000 | +| GUATEMALA | 16000000 | ++-------------+------------+ +</codeblock> + <p> + Aggregate functions combine the results from multiple rows: + either a single result for the entire table, or a separate result for each group of rows. + Aggregate functions are frequently used in combination with <codeph>GROUP BY</codeph> + and <codeph>HAVING</codeph> clauses in the <codeph>SELECT</codeph> statement. + See <xref href="impala_aggregate_functions.xml#aggregate_functions"/>. + </p> + +<codeblock>[localhost:21000] > select continent, <b>sum(population)</b> as howmany from country <b>group by continent</b> order by howmany desc; ++---------------+------------+ +| continent | howmany | ++---------------+------------+ +| Asia | 4298723000 | +| Africa | 1110635000 | +| Europe | 742452000 | +| North America | 565265000 | +| South America | 406740000 | +| Oceania | 38304000 | ++---------------+------------+ +</codeblock> + + <p> + User-defined functions (UDFs) let you code your own logic. They can be either scalar or aggregate functions. + UDFs let you implement important business or scientific logic using high-performance code for Impala to automatically parallelize. + You can also use UDFs to implement convenience functions to simplify reporting or porting SQL from other database systems. + See <xref href="impala_udf.xml#udfs"/>. + </p> + +<codeblock>[localhost:21000] > select <b>rot13('Hello world!')</b> as 'Weak obfuscation'; ++------------------+ +| weak obfuscation | ++------------------+ +| Uryyb jbeyq! | ++------------------+ +[localhost:21000] > select <b>likelihood_of_new_subatomic_particle(sensor1, sensor2, sensor3)</b> as probability + > from experimental_results group by experiment; +</codeblock> + + <p> + Each function is associated with a specific database. For example, if you issue a <codeph>USE somedb</codeph> + statement followed by <codeph>CREATE FUNCTION somefunc</codeph>, the new function is created in the + <codeph>somedb</codeph> database, and you could refer to it through the fully qualified name + <codeph>somedb.somefunc</codeph>. You could then issue another <codeph>USE</codeph> statement + and create a function with the same name in a different database. + </p> + + <p> + Impala built-in functions are associated with a special database named <codeph>_impala_builtins</codeph>, + which lets you refer to them from any database without qualifying the name. + </p> + +<codeblock>[localhost:21000] > show databases; ++-------------------------+ +| name | ++-------------------------+ +| <b>_impala_builtins</b> | +| analytic_functions | +| avro_testing | +| data_file_size | +... +[localhost:21000] > show functions in _impala_builtins like '*subs*'; ++-------------+-----------------------------------+ +| return type | signature | ++-------------+-----------------------------------+ +| STRING | substr(STRING, BIGINT) | +| STRING | substr(STRING, BIGINT, BIGINT) | +| STRING | substring(STRING, BIGINT) | +| STRING | substring(STRING, BIGINT, BIGINT) | ++-------------+-----------------------------------+ +</codeblock> + + <p> + <b>Related statements:</b> <xref href="impala_create_function.xml#create_function"/>, + <xref href="impala_drop_function.xml#drop_function"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_grant.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_grant.xml b/docs/topics/impala_grant.xml new file mode 100644 index 0000000..6aad41e --- /dev/null +++ b/docs/topics/impala_grant.xml @@ -0,0 +1,117 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="2.0.0" id="grant"> + + <title>GRANT Statement (CDH 5.2 or higher only)</title> + <titlealts><navtitle>GRANT (CDH 5.2 or higher only)</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="DDL"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Sentry"/> + <data name="Category" value="Roles"/> + <!-- Consider whether to go deeper into categories like Security for the Sentry-related statements. --> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">GRANT statement</indexterm> +<!-- Copied from Sentry docs. Turn into conref. I did some rewording for clarity. --> + The <codeph>GRANT</codeph> statement grants roles or privileges on specified objects to groups. Only Sentry + administrative users can grant roles to a group. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock rev="2.3.0 collevelauth">GRANT ROLE <varname>role_name</varname> TO GROUP <varname>group_name</varname> + +GRANT <varname>privilege</varname> ON <varname>object_type</varname> <varname>object_name</varname> + TO [ROLE] <varname>roleName</varname> + [WITH GRANT OPTION] + +<ph rev="2.3.0">privilege ::= SELECT | SELECT(<varname>column_name</varname>) | INSERT | ALL</ph> +object_type ::= TABLE | DATABASE | SERVER | URI +</codeblock> + + <p> + Typically, the object name is an identifier. For URIs, it is a string literal. + </p> + +<!-- Turn privilege info into a conref or series of conrefs. (In both GRANT and REVOKE.) --> + + <p conref="../shared/impala_common.xml#common/privileges_blurb"/> + + <p> +<!-- To do: The wording here can be fluid, and it's reused in several statements. Turn into a conref. --> + Only administrative users (initially, a predefined set of users specified in the Sentry service configuration + file) can use this statement. + </p> + + <p> + The <codeph>WITH GRANT OPTION</codeph> clause allows members of the specified role to issue + <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements for those same privileges +<!-- Copied from Sentry docs. Turn into conref. I did some rewording for clarity. --> + Hence, if a role has the <codeph>ALL</codeph> privilege on a database and the <codeph>WITH GRANT + OPTION</codeph> set, users granted that role can execute <codeph>GRANT</codeph>/<codeph>REVOKE</codeph> + statements only for that database or child tables of the database. This means a user could revoke the + privileges of the user that provided them the <codeph>GRANT OPTION</codeph>. + </p> + + <p> +<!-- Copied from Sentry docs. Turn into conref. Except I changed Hive to Impala. --> + Impala does not currently support revoking only the <codeph>WITH GRANT OPTION</codeph> from a privilege + previously granted to a role. To remove the <codeph>WITH GRANT OPTION</codeph>, revoke the privilege and + grant it again without the <codeph>WITH GRANT OPTION</codeph> flag. + </p> + + <p rev="2.3.0 collevelauth"> + The ability to grant or revoke <codeph>SELECT</codeph> privilege on specific columns is available + in CDH 5.5 / Impala 2.3 and higher. See <xref href="sg_hive_sql.xml#concept_c2q_4qx_p4/col_level_auth_sentry"/> + for details. + </p> + +<!-- Turn compatibility info into a conref or series of conrefs. (In both GRANT and REVOKE.) --> + +<!-- If they diverge during development, consider the version here in GRANT the authoritative one. --> + + <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> + + <p> + <ul> + <li> + The Impala <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements are available in CDH 5.2 and + later. + </li> + + <li> + In CDH 5.1 and later, Impala can make use of any roles and privileges specified by the + <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements in Hive, when your system is configured to + use the Sentry service instead of the file-based policy mechanism. + </li> + + <li> + The Impala <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements for privileges do not require + the <codeph>ROLE</codeph> keyword to be repeated before each role name, unlike the equivalent Hive + statements. + </li> + + <li conref="../shared/impala_common.xml#common/grant_revoke_single"/> + </ul> + </p> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_authorization.xml#authorization"/>, <xref href="impala_revoke.xml#revoke"/>, + <xref href="impala_create_role.xml#create_role"/>, <xref href="impala_drop_role.xml#drop_role"/>, + <xref href="impala_show.xml#show"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_group_by.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_group_by.xml b/docs/topics/impala_group_by.xml new file mode 100644 index 0000000..10b7de4 --- /dev/null +++ b/docs/topics/impala_group_by.xml @@ -0,0 +1,137 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="group_by"> + + <title>GROUP BY Clause</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Aggregate Functions"/> + </metadata> + </prolog> + + <conbody> + + <p> + Specify the <codeph>GROUP BY</codeph> clause in queries that use aggregation functions, such as + <codeph><xref href="impala_count.xml#count">COUNT()</xref></codeph>, + <codeph><xref href="impala_sum.xml#sum">SUM()</xref></codeph>, + <codeph><xref href="impala_avg.xml#avg">AVG()</xref></codeph>, + <codeph><xref href="impala_min.xml#min">MIN()</xref></codeph>, and + <codeph><xref href="impala_max.xml#max">MAX()</xref></codeph>. Specify in the + <codeph><xref href="impala_group_by.xml#group_by">GROUP BY</xref></codeph> clause the names of all the + columns that do not participate in the aggregation operation. + </p> + + <!-- Good to show an example of cases where ORDER BY does and doesn't work with complex types. --> + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p rev="2.3.0"> + In CDH 5.5 / Impala 2.3 and higher, the complex data types <codeph>STRUCT</codeph>, + <codeph>ARRAY</codeph>, and <codeph>MAP</codeph> are available. These columns cannot + be referenced directly in the <codeph>ORDER BY</codeph> clause. + When you query a complex type column, you use join notation to <q>unpack</q> the elements + of the complex type, and within the join query you can include an <codeph>ORDER BY</codeph> + clause to control the order in the result set of the scalar elements from the complex type. + See <xref href="impala_complex_types.xml#complex_types"/> for details about Impala support for complex types. + </p> + + <p conref="../shared/impala_common.xml#common/zero_length_strings"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + For example, the following query finds the 5 items that sold the highest total quantity (using the + <codeph>SUM()</codeph> function, and also counts the number of sales transactions for those items (using the + <codeph>COUNT()</codeph> function). Because the column representing the item IDs is not used in any + aggregation functions, we specify that column in the <codeph>GROUP BY</codeph> clause. + </p> + +<codeblock>select + <b>ss_item_sk</b> as Item, + <b>count</b>(ss_item_sk) as Times_Purchased, + <b>sum</b>(ss_quantity) as Total_Quantity_Purchased +from store_sales + <b>group by ss_item_sk</b> + order by sum(ss_quantity) desc + limit 5; ++-------+-----------------+--------------------------+ +| item | times_purchased | total_quantity_purchased | ++-------+-----------------+--------------------------+ +| 9325 | 372 | 19072 | +| 4279 | 357 | 18501 | +| 7507 | 371 | 18475 | +| 5953 | 369 | 18451 | +| 16753 | 375 | 18446 | ++-------+-----------------+--------------------------+</codeblock> + + <p> + The <codeph>HAVING</codeph> clause lets you filter the results of aggregate functions, because you cannot + refer to those expressions in the <codeph>WHERE</codeph> clause. For example, to find the 5 lowest-selling + items that were included in at least 100 sales transactions, we could use this query: + </p> + +<codeblock>select + <b>ss_item_sk</b> as Item, + <b>count</b>(ss_item_sk) as Times_Purchased, + <b>sum</b>(ss_quantity) as Total_Quantity_Purchased +from store_sales + <b>group by ss_item_sk</b> + <b>having times_purchased >= 100</b> + order by sum(ss_quantity) + limit 5; ++-------+-----------------+--------------------------+ +| item | times_purchased | total_quantity_purchased | ++-------+-----------------+--------------------------+ +| 13943 | 105 | 4087 | +| 2992 | 101 | 4176 | +| 4773 | 107 | 4204 | +| 14350 | 103 | 4260 | +| 11956 | 102 | 4275 | ++-------+-----------------+--------------------------+</codeblock> + + <p> + When performing calculations involving scientific or financial data, remember that columns with type + <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph> are stored as true floating-point numbers, which cannot + precisely represent every possible fractional value. Thus, if you include a <codeph>FLOAT</codeph> or + <codeph>DOUBLE</codeph> column in a <codeph>GROUP BY</codeph> clause, the results might not precisely match + literal values in your query or from an original Text data file. Use rounding operations, the + <codeph>BETWEEN</codeph> operator, or another arithmetic technique to match floating-point values that are + <q>near</q> literal values you expect. For example, this query on the <codeph>ss_wholesale_cost</codeph> + column returns cost values that are close but not identical to the original figures that were entered as + decimal fractions. + </p> + +<codeblock>select ss_wholesale_cost, avg(ss_quantity * ss_sales_price) as avg_revenue_per_sale + from sales + group by ss_wholesale_cost + order by avg_revenue_per_sale desc + limit 5; ++-------------------+----------------------+ +| ss_wholesale_cost | avg_revenue_per_sale | ++-------------------+----------------------+ +| 96.94000244140625 | 4454.351539300434 | +| 95.93000030517578 | 4423.119941283189 | +| 98.37999725341797 | 4332.516490316291 | +| 97.97000122070312 | 4330.480601655014 | +| 98.52999877929688 | 4291.316953108634 | ++-------------------+----------------------+</codeblock> + + <p> + Notice how wholesale cost values originally entered as decimal fractions such as <codeph>96.94</codeph> and + <codeph>98.38</codeph> are slightly larger or smaller in the result set, due to precision limitations in the + hardware floating-point types. The imprecise representation of <codeph>FLOAT</codeph> and + <codeph>DOUBLE</codeph> values is why financial data processing systems often store currency using data types + that are less space-efficient but avoid these types of rounding errors. + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_select.xml#select"/>, + <xref href="impala_aggregate_functions.xml#aggregate_functions"/> + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_group_concat.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_group_concat.xml b/docs/topics/impala_group_concat.xml new file mode 100644 index 0000000..b2a7ff6 --- /dev/null +++ b/docs/topics/impala_group_concat.xml @@ -0,0 +1,133 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.2" id="group_concat"> + + <title>GROUP_CONCAT Function</title> + <titlealts><navtitle>GROUP_CONCAT</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="Aggregate Functions"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">group_concat() function</indexterm> + An aggregate function that returns a single string representing the argument value concatenated together for + each row of the result set. If the optional separator string is specified, the separator is added between + each pair of concatenated values. The default separator is a comma followed by a space. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<!-- Might allow DISTINCT at some point. Check: does it allow ALL now? --> + +<codeblock>GROUP_CONCAT([ALL] <varname>expression</varname> [, <varname>separator</varname>])</codeblock> + + <p conref="../shared/impala_common.xml#common/concat_blurb"/> + + <p> + By default, returns a single string covering the whole result set. To include other columns or values in the + result set, or to produce multiple concatenated strings for subsets of rows, include a <codeph>GROUP + BY</codeph> clause in the query. + </p> + + <p> + <b>Return type:</b> <codeph>STRING</codeph> + </p> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p> + You cannot apply the <codeph>DISTINCT</codeph> operator to the argument of this function. + </p> + + <p conref="../shared/impala_common.xml#common/analytic_not_allowed_caveat"/> + + <p> + Currently, Impala returns an error if the result value grows larger than 1 GiB. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following examples illustrate various aspects of the <codeph>GROUP_CONCAT()</codeph> function. + </p> + + <p> + You can call the function directly on a <codeph>STRING</codeph> column. To use it with a numeric column, cast + the value to <codeph>STRING</codeph>. + </p> + +<codeblock>[localhost:21000] > create table t1 (x int, s string); +[localhost:21000] > insert into t1 values (1, "one"), (3, "three"), (2, "two"), (1, "one"); +[localhost:21000] > select group_concat(s) from t1; ++----------------------+ +| group_concat(s) | ++----------------------+ +| one, three, two, one | ++----------------------+ +[localhost:21000] > select group_concat(cast(x as string)) from t1; ++---------------------------------+ +| group_concat(cast(x as string)) | ++---------------------------------+ +| 1, 3, 2, 1 | ++---------------------------------+ +</codeblock> + + <p> + The optional separator lets you format the result in flexible ways. The separator can be an arbitrary string + expression, not just a single character. + </p> + +<codeblock>[localhost:21000] > select group_concat(s,"|") from t1; ++----------------------+ +| group_concat(s, '|') | ++----------------------+ +| one|three|two|one | ++----------------------+ +[localhost:21000] > select group_concat(s,'---') from t1; ++-------------------------+ +| group_concat(s, '---') | ++-------------------------+ +| one---three---two---one | ++-------------------------+ +</codeblock> + + <p> + The default separator is a comma followed by a space. To get a comma-delimited result without extra spaces, + specify a delimiter character that is only a comma. + </p> + +<codeblock>[localhost:21000] > select group_concat(s,',') from t1; ++----------------------+ +| group_concat(s, ',') | ++----------------------+ +| one,three,two,one | ++----------------------+ +</codeblock> + + <p> + Including a <codeph>GROUP BY</codeph> clause lets you produce a different concatenated result for each group + in the result set. In this example, the only <codeph>X</codeph> value that occurs more than once is + <codeph>1</codeph>, so that is the only row in the result set where <codeph>GROUP_CONCAT()</codeph> returns a + delimited value. For groups containing a single value, <codeph>GROUP_CONCAT()</codeph> returns the original + value of its <codeph>STRING</codeph> argument. + </p> + +<codeblock>[localhost:21000] > select x, group_concat(s) from t1 group by x; ++---+-----------------+ +| x | group_concat(s) | ++---+-----------------+ +| 2 | two | +| 3 | three | +| 1 | one, one | ++---+-----------------+ +</codeblock> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_having.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_having.xml b/docs/topics/impala_having.xml new file mode 100644 index 0000000..064a4a8 --- /dev/null +++ b/docs/topics/impala_having.xml @@ -0,0 +1,42 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="having"> + + <title>HAVING Clause</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Aggregate Functions"/> + </metadata> + </prolog> + + <conbody> + + <p> + Performs a filter operation on a <codeph>SELECT</codeph> query, by examining the results of aggregation + functions rather than testing each individual table row. Therefore, it is always used in conjunction with a + function such as <codeph><xref href="impala_count.xml#count">COUNT()</xref></codeph>, + <codeph><xref href="impala_sum.xml#sum">SUM()</xref></codeph>, + <codeph><xref href="impala_avg.xml#avg">AVG()</xref></codeph>, + <codeph><xref href="impala_min.xml#min">MIN()</xref></codeph>, or + <codeph><xref href="impala_max.xml#max">MAX()</xref></codeph>, and typically with the + <codeph><xref href="impala_group_by.xml#group_by">GROUP BY</xref></codeph> clause also. + </p> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p rev="2.0.0"> + The filter expression in the <codeph>HAVING</codeph> clause cannot include a scalar subquery. + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_select.xml#select"/>, + <xref href="impala_group_by.xml#group_by"/>, + <xref href="impala_aggregate_functions.xml#aggregate_functions"/> + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_hbase_cache_blocks.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_hbase_cache_blocks.xml b/docs/topics/impala_hbase_cache_blocks.xml new file mode 100644 index 0000000..d42cbf6 --- /dev/null +++ b/docs/topics/impala_hbase_cache_blocks.xml @@ -0,0 +1,34 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="hbase_cache_blocks"> + + <title>HBASE_CACHE_BLOCKS Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="HBase"/> + <data name="Category" value="Impala Query Options"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">HBASE_CACHE_BLOCKS query option</indexterm> + Setting this option is equivalent to calling the <codeph>setCacheBlocks</codeph> method of the class + <xref href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html" scope="external" format="html">org.apache.hadoop.hbase.client.Scan</xref>, + in an HBase Java application. Helps to control the memory pressure on the HBase region server, in conjunction + with the <codeph>HBASE_CACHING</codeph> query option. + </p> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + <p conref="../shared/impala_common.xml#common/default_false_0"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_hbase.xml#impala_hbase"/>, + <xref href="impala_hbase_caching.xml#hbase_caching"/> + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_hbase_caching.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_hbase_caching.xml b/docs/topics/impala_hbase_caching.xml new file mode 100644 index 0000000..e543792 --- /dev/null +++ b/docs/topics/impala_hbase_caching.xml @@ -0,0 +1,39 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="hbase_caching"> + + <title>HBASE_CACHING Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="HBase"/> + <data name="Category" value="Impala Query Options"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">HBASE_CACHING query option</indexterm> + Setting this option is equivalent to calling the <codeph>setCaching</codeph> method of the class + <xref href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html" scope="external" format="html">org.apache.hadoop.hbase.client.Scan</xref>, + in an HBase Java application. Helps to control the memory pressure on the HBase region server, in conjunction + with the <codeph>HBASE_CACHE_BLOCKS</codeph> query option. + </p> + + <p> + <b>Type:</b> <codeph>BOOLEAN</codeph> + </p> + + <p> + <b>Default:</b> 0 + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_hbase.xml#impala_hbase"/>, + <xref href="impala_hbase_cache_blocks.xml#hbase_cache_blocks"/> + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_hints.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_hints.xml b/docs/topics/impala_hints.xml new file mode 100644 index 0000000..429fb19 --- /dev/null +++ b/docs/topics/impala_hints.xml @@ -0,0 +1,247 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="hints"> + + <title>Query Hints in Impala SELECT Statements</title> + <titlealts><navtitle>Hints</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Troubleshooting"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">hints</indexterm> + The Impala SQL dialect supports query hints, for fine-tuning the inner workings of queries. Specify hints as + a temporary workaround for expensive queries, where missing statistics or other factors cause inefficient + performance. + </p> + + <p> + Hints are most often used for the most resource-intensive kinds of Impala queries: + </p> + + <ul> + <li> + Join queries involving large tables, where intermediate result sets are transmitted across the network to + evaluate the join conditions. + </li> + + <li> + Inserting into partitioned Parquet tables, where many memory buffers could be allocated on each host to + hold intermediate results for each partition. + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p> + You can represent the hints as keywords surrounded by <codeph>[]</codeph> square brackets; include the + brackets in the text of the SQL statement. + </p> + +<codeblock>SELECT STRAIGHT_JOIN <varname>select_list</varname> FROM +<varname>join_left_hand_table</varname> + JOIN [{BROADCAST|SHUFFLE}] +<varname>join_right_hand_table</varname> +<varname>remainder_of_query</varname>; + +INSERT <varname>insert_clauses</varname> + [{SHUFFLE|NOSHUFFLE}] + SELECT <varname>remainder_of_query</varname>; +</codeblock> + + <p rev="2.0.0"> + In Impala 2.0 and higher, or CDH 5.2 and higher, you can also specify the hints inside comments that use + either the <codeph>/* */</codeph> or <codeph>--</codeph> notation. Specify a <codeph>+</codeph> symbol + immediately before the hint name. + </p> + +<codeblock rev="2.0.0">SELECT STRAIGHT_JOIN <varname>select_list</varname> FROM +<varname>join_left_hand_table</varname> + JOIN /* +BROADCAST|SHUFFLE */ +<varname>join_right_hand_table</varname> +<varname>remainder_of_query</varname>; + +SELECT <varname>select_list</varname> FROM +<varname>join_left_hand_table</varname> + JOIN -- +BROADCAST|SHUFFLE +<varname>join_right_hand_table</varname> +<varname>remainder_of_query</varname>; + +INSERT <varname>insert_clauses</varname> + /* +SHUFFLE|NOSHUFFLE */ + SELECT <varname>remainder_of_query</varname>; + +INSERT <varname>insert_clauses</varname> + -- +SHUFFLE|NOSHUFFLE + SELECT <varname>remainder_of_query</varname>; +</codeblock> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + With both forms of hint syntax, include the <codeph>STRAIGHT_JOIN</codeph> + keyword immediately after the <codeph>SELECT</codeph> keyword to prevent Impala from + reordering the tables in a way that makes the hint ineffective. + </p> + + <p> + To reduce the need to use hints, run the <codeph>COMPUTE STATS</codeph> statement against all tables involved + in joins, or used as the source tables for <codeph>INSERT ... SELECT</codeph> operations where the + destination is a partitioned Parquet table. Do this operation after loading data or making substantial + changes to the data within each table. Having up-to-date statistics helps Impala choose more efficient query + plans without the need for hinting. See <xref href="impala_perf_stats.xml#perf_stats"/> for details and + examples. + </p> + + <p> + To see which join strategy is used for a particular query, examine the <codeph>EXPLAIN</codeph> output for + that query. See <xref href="impala_explain_plan.xml#perf_explain"/> for details and examples. + </p> + + <p> + <b>Hints for join queries:</b> + </p> + + <p> + The <codeph>[BROADCAST]</codeph> and <codeph>[SHUFFLE]</codeph> hints control the execution strategy for join + queries. Specify one of the following constructs immediately after the <codeph>JOIN</codeph> keyword in a + query: + </p> + + <ul> + <li> + <codeph>[SHUFFLE]</codeph> - Makes that join operation use the <q>partitioned</q> technique, which divides + up corresponding rows from both tables using a hashing algorithm, sending subsets of the rows to other + nodes for processing. (The keyword <codeph>SHUFFLE</codeph> is used to indicate a <q>partitioned join</q>, + because that type of join is not related to <q>partitioned tables</q>.) Since the alternative + <q>broadcast</q> join mechanism is the default when table and index statistics are unavailable, you might + use this hint for queries where broadcast joins are unsuitable; typically, partitioned joins are more + efficient for joins between large tables of similar size. + </li> + + <li> + <codeph>[BROADCAST]</codeph> - Makes that join operation use the <q>broadcast</q> technique that sends the + entire contents of the right-hand table to all nodes involved in processing the join. This is the default + mode of operation when table and index statistics are unavailable, so you would typically only need it if + stale metadata caused Impala to mistakenly choose a partitioned join operation. Typically, broadcast joins + are more efficient in cases where one table is much smaller than the other. (Put the smaller table on the + right side of the <codeph>JOIN</codeph> operator.) + </li> + </ul> + + <p> + <b>Hints for INSERT ... SELECT queries:</b> + </p> + + <p conref="../shared/impala_common.xml#common/insert_hints"/> + + <p> + <b>Suggestions versus directives:</b> + </p> + + <p> + In early Impala releases, hints were always obeyed and so acted more like directives. Once Impala gained join + order optimizations, sometimes join queries were automatically reordered in a way that made a hint + irrelevant. Therefore, the hints act more like suggestions in Impala 1.2.2 and higher. + </p> + + <p> + To force Impala to follow the hinted execution mechanism for a join query, include the + <codeph>STRAIGHT_JOIN</codeph> keyword in the <codeph>SELECT</codeph> statement. See + <xref href="impala_perf_joins.xml#straight_join"/> for details. When you use this technique, Impala does not + reorder the joined tables at all, so you must be careful to arrange the join order to put the largest table + (or subquery result set) first, then the smallest, second smallest, third smallest, and so on. This ordering lets Impala do the + most I/O-intensive parts of the query using local reads on the data nodes, and then reduce the size of the + intermediate result set as much as possible as each subsequent table or subquery result set is joined. + </p> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p> + Queries that include subqueries in the <codeph>WHERE</codeph> clause can be rewritten internally as join + queries. Currently, you cannot apply hints to the joins produced by these types of queries. + </p> + + <p> + Because hints can prevent queries from taking advantage of new metadata or improvements in query planning, + use them only when required to work around performance issues, and be prepared to remove them when they are + no longer required, such as after a new Impala release or bug fix. + </p> + + <p> + In particular, the <codeph>[BROADCAST]</codeph> and <codeph>[SHUFFLE]</codeph> hints are expected to be + needed much less frequently in Impala 1.2.2 and higher, because the join order optimization feature in + combination with the <codeph>COMPUTE STATS</codeph> statement now automatically choose join order and join + mechanism without the need to rewrite the query and add hints. See + <xref href="impala_perf_joins.xml#perf_joins"/> for details. + </p> + + <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> + + <p rev="2.0.0"> + The hints embedded within <codeph>--</codeph> comments are compatible with Hive queries. The hints embedded + within <codeph>/* */</codeph> comments or <codeph>[ ]</codeph> square brackets are not recognized by or not + compatible with Hive. For example, Hive raises an error for Impala hints within <codeph>/* */</codeph> + comments because it does not recognize the Impala hint names. + </p> + + <p conref="../shared/impala_common.xml#common/view_blurb"/> + + <p rev="2.0.0"> + If you use a hint in the query that defines a view, the hint is preserved when you query the view. Impala + internally rewrites all hints in views to use the <codeph>--</codeph> comment notation, so that Hive can + query such views without errors due to unrecognized hint names. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + For example, this query joins a large customer table with a small lookup table of less than 100 rows. The + right-hand table can be broadcast efficiently to all nodes involved in the join. Thus, you would use the + <codeph>[broadcast]</codeph> hint to force a broadcast join strategy: + </p> + +<codeblock>select straight_join customer.address, state_lookup.state_name + from customer join <b>[broadcast]</b> state_lookup + on customer.state_id = state_lookup.state_id;</codeblock> + + <p> + This query joins two large tables of unpredictable size. You might benchmark the query with both kinds of + hints and find that it is more efficient to transmit portions of each table to other nodes for processing. + Thus, you would use the <codeph>[shuffle]</codeph> hint to force a partitioned join strategy: + </p> + +<codeblock>select straight_join weather.wind_velocity, geospatial.altitude + from weather join <b>[shuffle]</b> geospatial + on weather.lat = geospatial.lat and weather.long = geospatial.long;</codeblock> + + <p> + For joins involving three or more tables, the hint applies to the tables on either side of that specific + <codeph>JOIN</codeph> keyword. The <codeph>STRAIGHT_JOIN</codeph> keyword ensures that joins are processed + in a predictable order from left to right. For example, this query joins + <codeph>t1</codeph> and <codeph>t2</codeph> using a partitioned join, then joins that result set to + <codeph>t3</codeph> using a broadcast join: + </p> + +<codeblock>select straight_join t1.name, t2.id, t3.price + from t1 join <b>[shuffle]</b> t2 join <b>[broadcast]</b> t3 + on t1.id = t2.id and t2.id = t3.id;</codeblock> + + <draft-comment translate="no"> This is a good place to add more sample output showing before and after EXPLAIN plans. </draft-comment> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + For more background information about join queries, see <xref href="impala_joins.xml#joins"/>. For + performance considerations, see <xref href="impala_perf_joins.xml#perf_joins"/>. + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_identifiers.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_identifiers.xml b/docs/topics/impala_identifiers.xml new file mode 100644 index 0000000..55477ed --- /dev/null +++ b/docs/topics/impala_identifiers.xml @@ -0,0 +1,114 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="identifiers"> + + <title>Overview of Impala Identifiers</title> + <titlealts><navtitle>Identifiers</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Databases"/> + <data name="Category" value="Tables"/> + </metadata> + </prolog> + + <conbody> + + <p> + Identifiers are the names of databases, tables, or columns that you specify in a SQL statement. The rules for + identifiers govern what names you can give to things you create, the notation for referring to names + containing unusual characters, and other aspects such as case sensitivity. + </p> + + <ul> + <li> + <p> + The minimum length of an identifier is 1 character. + </p> + </li> + + <li> + <p> + The maximum length of an identifier is currently 128 characters, enforced by the metastore database. + </p> + </li> + + <li> + <p> + An identifier must start with an alphabetic character. The remainder can contain any combination of + alphanumeric characters and underscores. Quoting the identifier with backticks has no effect on the allowed + characters in the name. + </p> + </li> + + <li> + <p> + An identifier can contain only ASCII characters. + </p> + </li> + + <li> + <p> + To use an identifier name that matches one of the Impala reserved keywords (listed in + <xref href="impala_reserved_words.xml#reserved_words"/>), surround the identifier with <codeph>``</codeph> + characters (backticks). Quote the reserved word even if it is part of a fully qualified name. + The following example shows how a reserved word can be used as a column name if it is quoted + with backticks in the <codeph>CREATE TABLE</codeph> statement, and how the column name + must also be quoted with backticks in a query: + </p> +<codeblock>[localhost:21000] > create table reserved (`data` string); + +[localhost:21000] > select data from reserved; +ERROR: AnalysisException: Syntax error in line 1: +select data from reserved + ^ +Encountered: DATA +Expected: ALL, CASE, CAST, DISTINCT, EXISTS, FALSE, IF, INTERVAL, NOT, NULL, STRAIGHT_JOIN, TRUE, IDENTIFIER +CAUSED BY: Exception: Syntax error + +[localhost:21000] > select reserved.data from reserved; +ERROR: AnalysisException: Syntax error in line 1: +select reserved.data from reserved + ^ +Encountered: DATA +Expected: IDENTIFIER +CAUSED BY: Exception: Syntax error + +[localhost:21000] > select reserved.`data` from reserved; + +[localhost:21000] > +</codeblock> + + <note type="important"> + Because the list of reserved words grows over time as new SQL syntax is added, + consider adopting coding conventions (especially for any automated scripts + or in packaged applications) to always quote all identifiers with backticks. + Quoting all identifiers protects your SQL from compatibility issues if + new reserved words are added in later releases. + </note> + + </li> + + <li> + <p> + Impala identifiers are always case-insensitive. That is, tables named <codeph>t1</codeph> and + <codeph>T1</codeph> always refer to the same table, regardless of quote characters. Internally, Impala + always folds all specified table and column names to lowercase. This is why the column headers in query + output are always displayed in lowercase. + </p> + </li> + </ul> + + <p> + See <xref href="impala_aliases.xml#aliases"/> for how to define shorter or easier-to-remember aliases if the + original names are long or cryptic identifiers. + <ph conref="../shared/impala_common.xml#common/aliases_vs_identifiers"/> + </p> + + <p conref="../shared/impala_common.xml#common/views_vs_identifiers"/> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_insert.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_insert.xml b/docs/topics/impala_insert.xml new file mode 100644 index 0000000..6d0f68b --- /dev/null +++ b/docs/topics/impala_insert.xml @@ -0,0 +1,676 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="insert"> + + <title>INSERT Statement</title> + <titlealts><navtitle>INSERT</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="ETL"/> + <data name="Category" value="Ingest"/> + <data name="Category" value="DML"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Tables"/> + <data audience="impala_next" name="Category" value="Kudu"/> + <!-- This is such an important statement, think if there are more applicable categories. --> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">INSERT statement</indexterm> + Impala supports inserting into tables and partitions that you create with the Impala <codeph>CREATE + TABLE</codeph> statement, or pre-defined tables and partitions created through Hive. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>[<varname>with_clause</varname>] +INSERT { INTO | OVERWRITE } [TABLE] <varname>table_name</varname> + [(<varname>column_list</varname>)] + [ PARTITION (<varname>partition_clause</varname>)] +{ + [<varname>hint_clause</varname>] <varname>select_statement</varname> + | VALUES (<varname>value</varname> [, <varname>value</varname> ...]) [, (<varname>value</varname> [, <varname>value</varname> ...]) ...] +} + +partition_clause ::= <varname>col_name</varname> [= <varname>constant</varname>] [, <varname>col_name</varname> [= <varname>constant</varname>] ...] + +hint_clause ::= [SHUFFLE] | [NOSHUFFLE] (Note: the square brackets are part of the syntax.) +</codeblock> + + <p> + <b>Appending or replacing (INTO and OVERWRITE clauses):</b> + </p> + + <p> + The <codeph>INSERT INTO</codeph> syntax appends data to a table. The existing data files are left as-is, and + the inserted data is put into one or more new data files. + </p> + + <p> + The <codeph>INSERT OVERWRITE</codeph> syntax replaces the data in a table. +<!-- What happens with INSERT OVERWRITE if the target is a single partition or multiple partitions? --> +<!-- If that gets too detailed, cover later under "Partitioning Considerations". --> + Currently, the overwritten data files are deleted immediately; they do not go through the HDFS trash + mechanism. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p rev="2.3.0"> + The <codeph>INSERT</codeph> statement currently does not support writing data files + containing complex types (<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>). + To prepare Parquet data for such tables, you generate the data files outside Impala and then + use <codeph>LOAD DATA</codeph> or <codeph>CREATE EXTERNAL TABLE</codeph> to associate those + data files with the table. Currently, such tables must use the Parquet file format. + See <xref href="impala_complex_types.xml#complex_types"/> for details about working with complex types. + </p> + + <p rev="kudu" audience="impala_next"> + <b>Ignoring duplicate partition keys for Kudu tables (IGNORE clause)</b> + </p> + + <p rev="kudu" audience="impala_next"> + Normally, an <codeph>INSERT</codeph> operation into a Kudu table fails if + it would result in duplicate partition key columns for any rows. + Specify <codeph>INSERT IGNORE <varname>rest_of_statement</varname></codeph> to + make the <codeph>INSERT</codeph> continue in this case. The rows that would + have duplicate partition key columns are not inserted. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + Impala currently supports: + </p> + + <ul> + <li> + Copy data from another table using <codeph>SELECT</codeph> query. In Impala 1.2.1 and higher, you can + combine <codeph>CREATE TABLE</codeph> and <codeph>INSERT</codeph> operations into a single step with the + <codeph>CREATE TABLE AS SELECT</codeph> syntax, which bypasses the actual <codeph>INSERT</codeph> keyword. + </li> + + <li> + An optional <xref href="impala_with.xml#with"><codeph>WITH</codeph> clause</xref> before the + <codeph>INSERT</codeph> keyword, to define a subquery referenced in the <codeph>SELECT</codeph> portion. + </li> + + <li> + Create one or more new rows using constant expressions through <codeph>VALUES</codeph> clause. (The + <codeph>VALUES</codeph> clause was added in Impala 1.0.1.) + </li> + + <li rev="1.1"> + <p> + By default, the first column of each newly inserted row goes into the first column of the table, the + second column into the second column, and so on. + </p> + <p> + You can also specify the columns to be inserted, an arbitrarily ordered subset of the columns in the + destination table, by specifying a column list immediately after the name of the destination table. This + feature lets you adjust the inserted columns to match the layout of a <codeph>SELECT</codeph> statement, + rather than the other way around. (This feature was added in Impala 1.1.) + </p> + <p> + The number of columns mentioned in the column list (known as the <q>column permutation</q>) must match + the number of columns in the <codeph>SELECT</codeph> list or the <codeph>VALUES</codeph> tuples. The + order of columns in the column permutation can be different than in the underlying table, and the columns + of each input row are reordered to match. If the number of columns in the column permutation is less than + in the destination table, all unmentioned columns are set to <codeph>NULL</codeph>. + </p> + </li> + + <li> + <p> + For a partitioned table, the optional <codeph>PARTITION</codeph> clause identifies which partition or + partitions the new values go into. If a partition key column is given a constant value such as + <codeph>PARTITION (year=2012)</codeph> or <codeph>PARTITION (year=2012, month=2)</codeph>, all the + inserted rows use those same values for those partition key columns and you omit any corresponding + columns in the source table from the <codeph>SELECT</codeph> list. This form is known as <q>static + partitioning</q>. + </p> + <p> + If a partition key column is mentioned but not assigned a value, such as in <codeph>PARTITION (year, + region)</codeph> (both columns unassigned) or <codeph>PARTITION(year, region='CA')</codeph> + (<codeph>year</codeph> column unassigned), the unassigned columns are filled in with the final columns of + the <codeph>SELECT</codeph> list. In this case, the number of columns in the <codeph>SELECT</codeph> list + must equal the number of columns in the column permutation plus the number of partition key columns not + assigned a constant value. This form is known as <q>dynamic partitioning</q>. + </p> + <p> + See <xref href="impala_partitioning.xml#partition_static_dynamic"/> for examples and performance + characteristics of static and dynamic partitioned inserts. + </p> + </li> + + <li rev="1.2.2"> + An optional hint clause immediately before the <codeph>SELECT</codeph> keyword, to fine-tune the behavior + when doing an <codeph>INSERT ... SELECT</codeph> operation into partitioned Parquet tables. The hint + keywords are <codeph>[SHUFFLE]</codeph> and <codeph>[NOSHUFFLE]</codeph>, including the square brackets. + Inserting into partitioned Parquet tables can be a resource-intensive operation because it potentially + involves many files being written to HDFS simultaneously, and separate + <ph rev="parquet_block_size">large</ph> memory buffers being allocated to buffer the data for each + partition. For usage details, see <xref href="impala_parquet.xml#parquet_etl"/>. + </li> + </ul> + + <note> + <ul> + <li> + Insert commands that partition or add files result in changes to Hive metadata. Because Impala uses Hive + metadata, such changes may necessitate a metadata refresh. For more information, see the + <xref href="impala_refresh.xml#refresh" format="dita">REFRESH</xref> function. + </li> + + <li> + Currently, Impala can only insert data into tables that use the text and Parquet formats. For other file + formats, insert the data using Hive and use Impala to query it. + </li> + + <li> + As an alternative to the <codeph>INSERT</codeph> statement, if you have existing data files elsewhere in + HDFS, the <codeph>LOAD DATA</codeph> statement can move those files into a table. This statement works + with tables of any file format. + </li> + </ul> + </note> + + <p conref="../shared/impala_common.xml#common/dml_blurb"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + When you insert the results of an expression, particularly of a built-in function call, into a small numeric + column such as <codeph>INT</codeph>, <codeph>SMALLINT</codeph>, <codeph>TINYINT</codeph>, or + <codeph>FLOAT</codeph>, you might need to use a <codeph>CAST()</codeph> expression to coerce values into the + appropriate type. Impala does not automatically convert from a larger type to a smaller one. For example, to + insert cosine values into a <codeph>FLOAT</codeph> column, write <codeph>CAST(COS(angle) AS FLOAT)</codeph> + in the <codeph>INSERT</codeph> statement to make the conversion explicit. + </p> + + <p conref="../shared/impala_common.xml#common/insert_parquet_blocksize"/> + + <p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/> + + <note conref="../shared/impala_common.xml#common/compute_stats_next"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example sets up new tables with the same definition as the <codeph>TAB1</codeph> table from the + <xref href="impala_tutorial.xml#tutorial" format="dita">Tutorial</xref> section, using different file + formats, and demonstrates inserting data into the tables created with the <codeph>STORED AS TEXTFILE</codeph> + and <codeph>STORED AS PARQUET</codeph> clauses: + </p> + +<codeblock>CREATE DATABASE IF NOT EXISTS file_formats; +USE file_formats; + +DROP TABLE IF EXISTS text_table; +CREATE TABLE text_table +( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) +STORED AS TEXTFILE; + +DROP TABLE IF EXISTS parquet_table; +CREATE TABLE parquet_table +( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) +STORED AS PARQUET;</codeblock> + + <p> + With the <codeph>INSERT INTO TABLE</codeph> syntax, each new set of inserted rows is appended to any existing + data in the table. This is how you would record small amounts of data that arrive continuously, or ingest new + batches of data alongside the existing data. For example, after running 2 <codeph>INSERT INTO TABLE</codeph> + statements with 5 rows each, the table contains 10 rows total: + </p> + +<codeblock>[localhost:21000] > insert into table text_table select * from default.tab1; +Inserted 5 rows in 0.41s + +[localhost:21000] > insert into table text_table select * from default.tab1; +Inserted 5 rows in 0.46s + +[localhost:21000] > select count(*) from text_table; ++----------+ +| count(*) | ++----------+ +| 10 | ++----------+ +Returned 1 row(s) in 0.26s</codeblock> + + <p> + With the <codeph>INSERT OVERWRITE TABLE</codeph> syntax, each new set of inserted rows replaces any existing + data in the table. This is how you load data to query in a data warehousing scenario where you analyze just + the data for a particular day, quarter, and so on, discarding the previous data each time. You might keep the + entire set of data in one raw table, and transfer and transform certain rows into a more compact and + efficient form to perform intensive analysis on that subset. + </p> + + <p> + For example, here we insert 5 rows into a table using the <codeph>INSERT INTO</codeph> clause, then replace + the data by inserting 3 rows with the <codeph>INSERT OVERWRITE</codeph> clause. Afterward, the table only + contains the 3 rows from the final <codeph>INSERT</codeph> statement. + </p> + +<codeblock>[localhost:21000] > insert into table parquet_table select * from default.tab1; +Inserted 5 rows in 0.35s + +[localhost:21000] > insert overwrite table parquet_table select * from default.tab1 limit 3; +Inserted 3 rows in 0.43s +[localhost:21000] > select count(*) from parquet_table; ++----------+ +| count(*) | ++----------+ +| 3 | ++----------+ +Returned 1 row(s) in 0.43s</codeblock> + + <p> + The <codeph><xref href="impala_insert.xml#values">VALUES</xref></codeph> clause lets you insert one or more + rows by specifying constant values for all the columns. The number, types, and order of the expressions must + match the table definition. + </p> + + <note id="insert_values_warning"> + The <codeph>INSERT ... VALUES</codeph> technique is not suitable for loading large quantities of data into + HDFS-based tables, because the insert operations cannot be parallelized, and each one produces a separate + data file. Use it for setting up small dimension tables or tiny amounts of data for experimenting with SQL + syntax, or with HBase tables. Do not use it for large ETL jobs or benchmark tests for load operations. Do not + run scripts with thousands of <codeph>INSERT ... VALUES</codeph> statements that insert a single row each + time. If you do run <codeph>INSERT ... VALUES</codeph> operations to load data into a staging table as one + stage in an ETL pipeline, include multiple row values if possible within each <codeph>VALUES</codeph> clause, + and use a separate database to make cleanup easier if the operation does produce many tiny files. + </note> + + <p> + The following example shows how to insert one row or multiple rows, with expressions of different types, + using literal values, expressions, and function return values: + </p> + +<codeblock>create table val_test_1 (c1 int, c2 float, c3 string, c4 boolean, c5 timestamp); +insert into val_test_1 values (100, 99.9/10, 'abc', true, now()); +create table val_test_2 (id int, token string); +insert overwrite val_test_2 values (1, 'a'), (2, 'b'), (-1,'xyzzy');</codeblock> + + <p> + These examples show the type of <q>not implemented</q> error that you see when attempting to insert data into + a table with a file format that Impala currently does not write to: + </p> + +<codeblock>DROP TABLE IF EXISTS sequence_table; +CREATE TABLE sequence_table +( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) +STORED AS SEQUENCEFILE; + +DROP TABLE IF EXISTS rc_table; +CREATE TABLE rc_table +( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) +STORED AS RCFILE; + +[localhost:21000] > insert into table rc_table select * from default.tab1; +Remote error +Backend 0:RC_FILE not implemented. + +[localhost:21000] > insert into table sequence_table select * from default.tab1; +Remote error +Backend 0:SEQUENCE_FILE not implemented. </codeblock> + + <p> + Inserting data into partitioned tables requires slightly different syntax that divides the partitioning + columns from the others: + </p> + +<codeblock>create table t1 (i int) <b>partitioned by (x int, y string)</b>; +-- Select an INT column from another table. +-- All inserted rows will have the same x and y values, as specified in the INSERT statement. +-- This technique of specifying all the partition key values is known as static partitioning. +insert into t1 <b>partition(x=10, y='a')</b> select c1 from some_other_table; +-- Select two INT columns from another table. +-- All inserted rows will have the same y value, as specified in the INSERT statement. +-- Values from c2 go into t1.x. +-- Any partitioning columns whose value is not specified are filled in +-- from the columns specified last in the SELECT list. +-- This technique of omitting some partition key values is known as dynamic partitioning. +insert into t1 <b>partition(x, y='b')</b> select c1, c2 from some_other_table; +-- Select an INT and a STRING column from another table. +-- All inserted rows will have the same x value, as specified in the INSERT statement. +-- Values from c3 go into t1.y. +insert into t1 <b>partition(x=20, y)</b> select c1, c3 from some_other_table;</codeblock> + + <p rev="1.1"> + The following examples show how you can copy the data in all the columns from one table to another, copy the + data from only some columns, or specify the columns in the select list in a different order than they + actually appear in the table: + </p> + +<codeblock>-- Start with 2 identical tables. +create table t1 (c1 int, c2 int); +create table t2 like t1; + +-- If there is no () part after the destination table name, +-- all columns must be specified, either as * or by name. +insert into t2 select * from t1; +insert into t2 select c1, c2 from t1; + +-- With the () notation following the destination table name, +-- you can omit columns (all values for that column are NULL +-- in the destination table), and/or reorder the values +-- selected from the source table. This is the "column permutation" feature. +insert into t2 (c1) select c1 from t1; +insert into t2 (c2, c1) select c1, c2 from t1; + +-- The column names can be entirely different in the source and destination tables. +-- You can copy any columns, not just the corresponding ones, from the source table. +-- But the number and type of selected columns must match the columns mentioned in the () part. +alter table t2 replace columns (x int, y int); +insert into t2 (y) select c1 from t1; + +-- For partitioned tables, all the partitioning columns must be mentioned in the () column list +-- or a PARTITION clause; these columns cannot be defaulted to NULL. +create table pt1 (x int, y int) partitioned by (z int); +-- The values from c1 are copied into the column x in the new table, +-- all in the same partition based on a constant value for z. +-- The values of y in the new table are all NULL. +insert into pt1 (x) partition (z=5) select c1 from t1; +-- Again we omit the values for column y so they are all NULL. +-- The inserted x values can go into different partitions, based on +-- the different values inserted into the partitioning column z. +insert into pt1 (x,z) select x, z from t2; +</codeblock> + + <p> + <codeph>SELECT *</codeph> for a partitioned table requires that all partition key columns in the source table + be declared as the last columns in the <codeph>CREATE TABLE</codeph> statement. You still include a + <codeph>PARTITION BY</codeph> clause listing all the partition key columns. These partition columns are + automatically mapped to the last columns from the <codeph>SELECT *</codeph> list. + </p> + +<codeblock>create table source (x int, y int, year int, month int, day int); +create table destination (x int, y int) partitioned by (year int, month int, day int); +...load some data into the unpartitioned source table... +-- Insert a single partition of data. +-- The SELECT * means you cannot specify partition (year=2014, month, day). +insert overwrite destination partition (year, month, day) select * from source where year=2014; +-- Insert the data for all year/month/day combinations. +insert overwrite destination partition (year, month, day) select * from source; + +-- If one of the partition columns is omitted from the source table, +-- then you can specify a specific value for that column in the PARTITION clause. +-- Here the source table holds only data from 2014, and so does not include a year column. +create table source_2014 (x int, y int, month, day); +...load some data into the unpartitioned source_2014 table... +insert overwrite destination partition (year=2014, month, day) select * from source_2014; +</codeblock> + + <p conref="../shared/impala_common.xml#common/insert_sort_blurb"/> + + <p> + <b>Concurrency considerations:</b> Each <codeph>INSERT</codeph> operation creates new data files with unique + names, so you can run multiple <codeph>INSERT INTO</codeph> statements simultaneously without filename + conflicts. +<!-- +If data is inserted into a table by a statement issued to a different +<cmdname>impalad</cmdname> node, +issue a <codeph>REFRESH <varname>table_name</varname></codeph> +statement to make the node you are connected to aware of this new data. +--> + While data is being inserted into an Impala table, the data is staged temporarily in a subdirectory inside + the data directory; during this period, you cannot issue queries against that table in Hive. If an + <codeph>INSERT</codeph> operation fails, the temporary data file and the subdirectory could be left behind in + the data directory. If so, remove the relevant subdirectory and any data files it contains manually, by + issuing an <codeph>hdfs dfs -rm -r</codeph> command, specifying the full path of the work subdirectory, whose + name ends in <codeph>_dir</codeph>. + </p> + </conbody> + + <concept id="values"> + + <title>VALUES Clause</title> + + <conbody> + + <p> + The <codeph>VALUES</codeph> clause is a general-purpose way to specify the columns of one or more rows, + typically within an <codeph><xref href="impala_insert.xml#insert">INSERT</xref></codeph> statement. + </p> + + <note conref="../shared/impala_common.xml#common/insert_values_warning"> + <p/> + </note> + + <p> + The following examples illustrate: + </p> + + <ul> + <li> + How to insert a single row using a <codeph>VALUES</codeph> clause. + </li> + + <li> + How to insert multiple rows using a <codeph>VALUES</codeph> clause. + </li> + + <li> + How the row or rows from a <codeph>VALUES</codeph> clause can be appended to a table through + <codeph>INSERT INTO</codeph>, or replace the contents of the table through <codeph>INSERT + OVERWRITE</codeph>. + </li> + + <li> + How the entries in a <codeph>VALUES</codeph> clause can be literals, function results, or any other kind + of expression. See <xref href="impala_literals.xml#literals"/> for the notation to use for literal + values, especially <xref href="impala_literals.xml#string_literals"/> for quoting and escaping + conventions for strings. See <xref href="impala_operators.xml#operators"/> and + <xref href="impala_functions.xml#builtins"/> for other things you can include in expressions with the + <codeph>VALUES</codeph> clause. + </li> + </ul> + +<codeblock>[localhost:21000] > describe val_example; +Query: describe val_example +Query finished, fetching results ... ++-------+---------+---------+ +| name | type | comment | ++-------+---------+---------+ +| id | int | | +| col_1 | boolean | | +| col_2 | double | | ++-------+---------+---------+ + +[localhost:21000] > insert into val_example values (1,true,100.0); +Inserted 1 rows in 0.30s +[localhost:21000] > select * from val_example; ++----+-------+-------+ +| id | col_1 | col_2 | ++----+-------+-------+ +| 1 | true | 100 | ++----+-------+-------+ + +[localhost:21000] > insert overwrite val_example values (10,false,pow(2,5)), (50,true,10/3); +Inserted 2 rows in 0.16s +[localhost:21000] > select * from val_example; ++----+-------+-------------------+ +| id | col_1 | col_2 | ++----+-------+-------------------+ +| 10 | false | 32 | +| 50 | true | 3.333333333333333 | ++----+-------+-------------------+</codeblock> + + <p> + When used in an <codeph>INSERT</codeph> statement, the Impala <codeph>VALUES</codeph> clause can specify + some or all of the columns in the destination table, and the columns can be specified in a different order + than they actually appear in the table. To specify a different set or order of columns than in the table, + use the syntax: + </p> + +<codeblock>INSERT INTO <varname>destination</varname> + (<varname>col_x</varname>, <varname>col_y</varname>, <varname>col_z</varname>) + VALUES + (<varname>val_x</varname>, <varname>val_y</varname>, <varname>val_z</varname>); +</codeblock> + + <p> + Any columns in the table that are not listed in the <codeph>INSERT</codeph> statement are set to + <codeph>NULL</codeph>. + </p> + +<!-- + <p> + does not support specifying a subset of the + columns in the table or specifying the columns in a different order. Use a + <codeph>VALUES</codeph> clause with all the column values in the same order as + the table definition, using <codeph>NULL</codeph> values for any columns you + want to omit from the <codeph>INSERT</codeph> operation. + </p> +--> + + <p> + To use a <codeph>VALUES</codeph> clause like a table in other statements, wrap it in parentheses and use + <codeph>AS</codeph> clauses to specify aliases for the entire object and any columns you need to refer to: + </p> + +<codeblock>[localhost:21000] > select * from (values(4,5,6),(7,8,9)) as t; ++---+---+---+ +| 4 | 5 | 6 | ++---+---+---+ +| 4 | 5 | 6 | +| 7 | 8 | 9 | ++---+---+---+ +[localhost:21000] > select * from (values(1 as c1, true as c2, 'abc' as c3),(100,false,'xyz')) as t; ++-----+-------+-----+ +| c1 | c2 | c3 | ++-----+-------+-----+ +| 1 | true | abc | +| 100 | false | xyz | ++-----+-------+-----+</codeblock> + + <p> + For example, you might use a tiny table constructed like this from constant literals or function return + values as part of a longer statement involving joins or <codeph>UNION ALL</codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/hdfs_blurb"/> + + <p> + Impala physically writes all inserted files under the ownership of its default user, typically + <codeph>impala</codeph>. Therefore, this user must have HDFS write permission in the corresponding table + directory. + </p> + + <p> + The permission requirement is independent of the authorization performed by the Sentry framework. (If the + connected user is not authorized to insert into a table, Sentry blocks that operation immediately, + regardless of the privileges available to the <codeph>impala</codeph> user.) Files created by Impala are + not owned by and do not inherit permissions from the connected user. + </p> + + <p> + The number of data files produced by an <codeph>INSERT</codeph> statement depends on the size of the + cluster, the number of data blocks that are processed, the partition key columns in a partitioned table, + and the mechanism Impala uses for dividing the work in parallel. Do not assume that an + <codeph>INSERT</codeph> statement will produce some particular number of output files. In case of + performance issues with data written by Impala, check that the output files do not suffer from issues such + as many tiny files or many tiny partitions. (In the Hadoop context, even files or partitions of a few tens + of megabytes are considered <q>tiny</q>.) + </p> + + <p conref="../shared/impala_common.xml#common/insert_hidden_work_directory"/> + + <p conref="../shared/impala_common.xml#common/hbase_blurb"/> + + <p> + You can use the <codeph>INSERT</codeph> statement with HBase tables as follows: + </p> + + <ul> + <li> + <p> + You can insert a single row or a small set of rows into an HBase table with the <codeph>INSERT ... + VALUES</codeph> syntax. This is a good use case for HBase tables with Impala, because HBase tables are + not subject to the same kind of fragmentation from many small insert operations as HDFS tables are. + </p> + </li> + + <li> + <p> + You can insert any number of rows at once into an HBase table using the <codeph>INSERT ... + SELECT</codeph> syntax. + </p> + </li> + + <li> + <p> + If more than one inserted row has the same value for the HBase key column, only the last inserted row + with that value is visible to Impala queries. You can take advantage of this fact with <codeph>INSERT + ... VALUES</codeph> statements to effectively update rows one at a time, by inserting new rows with the + same key values as existing rows. Be aware that after an <codeph>INSERT ... SELECT</codeph> operation + copying from an HDFS table, the HBase table might contain fewer rows than were inserted, if the key + column in the source table contained duplicate values. + </p> + </li> + + <li> + <p> + You cannot <codeph>INSERT OVERWRITE</codeph> into an HBase table. New rows are always appended. + </p> + </li> + + <li> + <p> + When you create an Impala or Hive table that maps to an HBase table, the column order you specify with + the <codeph>INSERT</codeph> statement might be different than the order you declare with the + <codeph>CREATE TABLE</codeph> statement. Behind the scenes, HBase arranges the columns based on how + they are divided into column families. This might cause a mismatch during insert operations, especially + if you use the syntax <codeph>INSERT INTO <varname>hbase_table</varname> SELECT * FROM + <varname>hdfs_table</varname></codeph>. Before inserting data, verify the column order by issuing a + <codeph>DESCRIBE</codeph> statement for the table, and adjust the order of the select list in the + <codeph>INSERT</codeph> statement. + </p> + </li> + </ul> + + <p> + See <xref href="impala_hbase.xml#impala_hbase"/> for more details about using Impala with HBase. + </p> + + <p conref="../shared/impala_common.xml#common/s3_blurb"/> + <p conref="../shared/impala_common.xml#common/s3_dml"/> + + <p conref="../shared/impala_common.xml#common/security_blurb"/> + <p conref="../shared/impala_common.xml#common/redaction_yes"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_yes"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb"/> + <p rev="CDH-19187"> + The user ID that the <cmdname>impalad</cmdname> daemon runs under, + typically the <codeph>impala</codeph> user, must have read + permission for the files in the source directory of an <codeph>INSERT ... SELECT</codeph> + operation, and write permission for all affected directories in the destination table. + (An <codeph>INSERT</codeph> operation could write files to multiple different HDFS directories + if the destination table is partitioned.) + This user must also have write permission to create a temporary work directory + in the top-level HDFS directory of the destination table. + An <codeph>INSERT OVERWRITE</codeph> operation does not require write permission on + the original data files in the table, only on the table directories themselves. + </p> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p conref="../shared/impala_common.xml#common/char_varchar_cast_from_string"/> + + <p conref="../shared/impala_common.xml#common/related_options"/> + + <p rev="1.3.1" conref="../shared/impala_common.xml#common/insert_inherit_permissions"/> + </conbody> + </concept> + +<!-- Values clause --> +</concept> +<!-- INSERT statement --> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_int.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_int.xml b/docs/topics/impala_int.xml new file mode 100644 index 0000000..514d377 --- /dev/null +++ b/docs/topics/impala_int.xml @@ -0,0 +1,95 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="int"> + + <title>INT Data Type</title> + <titlealts><navtitle>INT</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Schemas"/> + </metadata> + </prolog> + + <conbody> + + <p> + A 4-byte integer data type used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p> + In the column definition of a <codeph>CREATE TABLE</codeph> statement: + </p> + +<codeblock><varname>column_name</varname> INT</codeblock> + + <p> + <b>Range:</b> -2147483648 .. 2147483647. There is no <codeph>UNSIGNED</codeph> subtype. + </p> + + <p> + <b>Conversions:</b> Impala automatically converts to a larger integer type (<codeph>BIGINT</codeph>) or a + floating-point type (<codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph>) automatically. Use + <codeph>CAST()</codeph> to convert to <codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, + <codeph>STRING</codeph>, or <codeph>TIMESTAMP</codeph>. + <ph conref="../shared/impala_common.xml#common/cast_int_to_timestamp"/> + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + The data type <codeph>INTEGER</codeph> is an alias for <codeph>INT</codeph>. + </p> + + <p> + For a convenient and automated way to check the bounds of the <codeph>INT</codeph> type, call the functions + <codeph>MIN_INT()</codeph> and <codeph>MAX_INT()</codeph>. + </p> + + <p> + If an integer value is too large to be represented as a <codeph>INT</codeph>, use a <codeph>BIGINT</codeph> + instead. + </p> + + <p conref="../shared/impala_common.xml#common/null_bad_numeric_cast"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<codeblock>CREATE TABLE t1 (x INT); +SELECT CAST(1000 AS INT); +</codeblock> + + <p conref="../shared/impala_common.xml#common/partitioning_good"/> + + <p conref="../shared/impala_common.xml#common/hbase_ok"/> + + <p conref="../shared/impala_common.xml#common/parquet_blurb"/> + + <p conref="../shared/impala_common.xml#common/text_bulky"/> + +<!-- <p conref="/Content/impala_common_xi44078.xml#common/compatibility_blurb"/> --> + + <p conref="../shared/impala_common.xml#common/internals_4_bytes"/> + + <p conref="../shared/impala_common.xml#common/added_forever"/> + + <p conref="../shared/impala_common.xml#common/column_stats_constant"/> + +<!-- <p conref="/Content/impala_common_xi44078.xml#common/restrictions_blurb"/> --> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_literals.xml#numeric_literals"/>, <xref href="impala_tinyint.xml#tinyint"/>, + <xref href="impala_smallint.xml#smallint"/>, <xref href="impala_int.xml#int"/>, + <xref href="impala_bigint.xml#bigint"/>, <xref href="impala_decimal.xml#decimal"/>, + <xref href="impala_math_functions.xml#math_functions"/> + </p> + </conbody> +</concept>
