http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/shared/impala_common.xml ---------------------------------------------------------------------- diff --git a/docs/shared/impala_common.xml b/docs/shared/impala_common.xml new file mode 100644 index 0000000..37ebc34 --- /dev/null +++ b/docs/shared/impala_common.xml @@ -0,0 +1,2477 @@ +<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept xmlns:ditaarch="http://dita.oasis-open.org/architecture/2005/" id="common" ditaarch:DITAArchVersion="1.2" domains="(topic concept) (topic hi-d) (topic ut-d) (topic indexing-d) (topic hazard-d) (topic abbrev-d) (topic pr-d) (topic sw-d) (topic ui-d) " xml:lang="en-US"> + + <title>Reusable Text, Paragraphs, List Items, and Other Elements for Impala</title> + + <conbody> + + <p> + All the elements in this file with IDs are intended to be conref'ed elsewhere. Practically all of the + conref'ed elements for the Impala docs are in this file, to avoid questions of when it's safe to remove or + move something in any of the 'main' files, and avoid having to change and conref references as a result. + </p> + + <p> + This file defines some dummy subheadings as section elements, just for self-documentation. Using sections + instead of nested concepts lets all the conref links point to a very simple name pattern, + '#common/id_within_the_file', rather than a 3-part reference with an intervening, variable concept ID. + </p> + + <section id="sentry"> + + <title>Sentry-Related Content</title> + + <p> + Material related to Sentry security, intended to be reused between Hive and Impala. Complicated by the fact + that most of it will probably be multi-paragraph or involve subheads, might need to be represented as + nested topics at the end of this file. + </p> + + <note id="authentication_vs_authorization"> + Regardless of the authentication mechanism used, Impala always creates HDFS directories and data files + owned by the same user (typically <codeph>impala</codeph>). To implement user-level access to different + databases, tables, columns, partitions, and so on, use the Sentry authorization feature, as explained in + <xref href="../topics/impala_authorization.xml#authorization"/>. + </note> + +<!-- Contrived nesting needed to allow <ph> with ID to be reused inside the <title> of a conref. --> + + <p> + <b><ph id="title_sentry_debug">Debugging Failed Sentry Authorization Requests</ph></b> + </p> + + <p id="sentry_debug"> + Sentry logs all facts that lead up to authorization decisions at the debug level. If you do not understand + why Sentry is denying access, the best way to debug is to temporarily turn on debug logging: + <ul> + <li> + In Cloudera Manager, add <codeph>log4j.logger.org.apache.sentry=DEBUG</codeph> to the logging settings + for your service through the corresponding <uicontrol>Logging Safety Valve</uicontrol> field for the + Impala, Hive Server 2, or Solr Server services. + </li> + + <li> + On systems not managed by Cloudera Manager, add <codeph>log4j.logger.org.apache.sentry=DEBUG</codeph> + to the <filepath>log4j.properties</filepath> file on each host in the cluster, in the appropriate + configuration directory for each service. + </li> + </ul> + Specifically, look for exceptions and messages such as: +<codeblock xml:space="preserve">FilePermission server..., RequestPermission server...., result [true|false]</codeblock> + which indicate each evaluation Sentry makes. The <codeph>FilePermission</codeph> is from the policy file, + while <codeph>RequestPermission</codeph> is the privilege required for the query. A + <codeph>RequestPermission</codeph> will iterate over all appropriate <codeph>FilePermission</codeph> + settings until a match is found. If no matching privilege is found, Sentry returns <codeph>false</codeph> + indicating <q>Access Denied</q> . +<!-- +[1] +Impala: Impala Daemon -> Advanced -> Impala Daemon Logging Safety Valve +Hive: Hive Server 2 -> Advanced -> HiveServer2 Logging Safety Valve +Search: Solr Server -> Advanced -> HiveServer2 Logging Safety Valve +--> + </p> + + </section> + + <section id="cm"> + + <title>Cloudera Manager Terminology</title> + + <p> + Especially during the transition from CM 4 to CM 5, we'll use some stock phraseology to talk about fields + and such. + </p> + + <p> + <ph id="safety_valve"> In Cloudera Manager 4, these fields are labelled <uicontrol>Safety + Valve</uicontrol>; in Cloudera Manager 5, they are called <uicontrol>Advanced Configuration + Snippet</uicontrol>. </ph> + </p> + + </section> + + <section id="citi"> + + <title>Items from the Citibank Escalation Spreadsheet</title> + + <p> + Paragraphs with IDs are intended to be reused both in the FAQ and the User's Guide. They refer to feature + requests or misunderstandings encountered by Citibank, captured in the escalation spreadsheet here: + <xref href="https://docs.google.com/a/cloudera.com/spreadsheet/ccc?key=0AplfwQJKyyTWdFdhY0E5WHVwNXZSTG9sMEZwQy1QZ1E&usp=drive_web#gid=0" scope="external" format="html"/>. + </p> + + <p id="string_concatenation"> + With Impala, you use the built-in <codeph>CONCAT()</codeph> function to concatenate two, three, or more + strings: +<codeblock xml:space="preserve">select concat('some prefix: ', col1) from t1; +select concat('abc','mno','xyz');</codeblock> + Impala does not currently support operators for string concatenation, such as <codeph>||</codeph> as seen + in some other database systems. + </p> + + <p id="column_aliases"> + You can specify column aliases with or without the <codeph>AS</codeph> keyword, and with no quotation + marks, single quotation marks, or double quotation marks. Some kind of quotation marks are required if the + column alias contains any spaces or other problematic characters. The alias text is displayed in the + <cmdname>impala-shell</cmdname> output as all-lowercase. For example: +<codeblock xml:space="preserve">[localhost:21000] > select c1 First_Column from t; +[localhost:21000] > select c1 as First_Column from t; ++--------------+ +| first_column | ++--------------+ +... + +[localhost:21000] > select c1 'First Column' from t; +[localhost:21000] > select c1 as 'First Column' from t; ++--------------+ +| first column | ++--------------+ +... + +[localhost:21000] > select c1 "First Column" from t; +[localhost:21000] > select c1 as "First Column" from t; ++--------------+ +| first column | ++--------------+ +...</codeblock> + </p> + + <p id="temp_tables"> + Currently, Impala does not support temporary tables. Some other database systems have a class of + <q>lightweight</q> tables that are held only in memory and/or that are only accessible by one connection + and disappear when the session ends. In Impala, creating new databases is a relatively lightweight + operation, so as an alternative, you could create a database with a unique name and use <codeph>CREATE + TABLE LIKE</codeph>, <codeph>CREATE TABLE AS SELECT</codeph>, and <codeph>INSERT</codeph> statements to + create a table in that database to hold the result set of a query, to use in subsequent queries. When + finished, issue a <codeph>DROP TABLE</codeph> statement followed by <codeph>DROP DATABASE</codeph>. + </p> + + </section> + + <section id="standards"> + + <title>Blurbs About Standards Compliance</title> + + <p> + The following blurbs simplify the process of flagging which SQL standard various features were first + introduced in. The wording and the tagging can be modified by editing one central instance of each blurb. + Not extensively used yet, just here and there in the SQL Language Reference section. + </p> + + <p id="sql1986"> +<!-- No Wikipedia page for SQL-1986, so no link. --> + <b>Standards compliance:</b> Introduced in SQL-1986. + </p> + + <p id="sql1989"> +<!-- No Wikipedia page for SQL-1989, so no link. --> + <b>Standards compliance:</b> Introduced in SQL-1989. + </p> + + <p id="sql1992"> + <b>Standards compliance:</b> Introduced in + <xref href="http://en.wikipedia.org/wiki/SQL-92" scope="external" format="html">SQL-1992</xref>. + </p> + + <p id="sql1999"> + <b>Standards compliance:</b> Introduced in + <xref href="http://en.wikipedia.org/wiki/SQL:1999" scope="external" format="html">SQL:1999</xref>. + </p> + + <p id="sql2003"> + <b>Standards compliance:</b> Introduced in + <xref href="http://en.wikipedia.org/wiki/SQL:2003" scope="external" format="html">SQL:2003</xref>. + </p> + + <p id="sql2008"> + <b>Standards compliance:</b> Introduced in + <xref href="http://en.wikipedia.org/wiki/SQL:2008" scope="external" format="html">SQL:2008</xref>. + </p> + + <p id="sql2011"> + <b>Standards compliance:</b> Introduced in + <xref href="http://en.wikipedia.org/wiki/SQL:2011" scope="external" format="html">SQL:2011</xref>. + </p> + + <p id="hiveql"> + <b>Standards compliance:</b> Extension first introduced in HiveQL. + </p> + + <p id="impalaql"> + <b>Standards compliance:</b> Extension first introduced in Impala. + </p> + + </section> + + <section id="refresh_invalidate"> + + <title>Background Info for REFRESH, INVALIDATE METADATA, and General Metadata Discussion</title> + + <p id="refresh_vs_invalidate"> + <codeph>INVALIDATE METADATA</codeph> and <codeph>REFRESH</codeph> are counterparts: <codeph>INVALIDATE + METADATA</codeph> waits to reload the metadata when needed for a subsequent query, but reloads all the + metadata for the table, which can be an expensive operation, especially for large tables with many + partitions. <codeph>REFRESH</codeph> reloads the metadata immediately, but only loads the block location + data for newly added data files, making it a less expensive operation overall. If data was altered in some + more extensive way, such as being reorganized by the HDFS balancer, use <codeph>INVALIDATE + METADATA</codeph> to avoid a performance penalty from reduced local reads. If you used Impala version 1.0, + the <codeph>INVALIDATE METADATA</codeph> statement works just like the Impala 1.0 <codeph>REFRESH</codeph> + statement did, while the Impala 1.1 <codeph>REFRESH</codeph> is optimized for the common use case of adding + new data files to an existing table, thus the table name argument is now required. + </p> + + </section> + + <section id="kudu"> + + <title>Kudu Snippets</title> + + <p> + If any advice, background info, or warnings are needed in multiple + places for interaction of Impala with Kudu, put them under here. + </p> + + </section> + + <section id="sql_ref"> + + <title>SQL Language Reference Snippets</title> + + <p> + These reusable chunks were taken from conrefs originally in <filepath>ciiu_langref_sql.xml</filepath>. Or + they are primarily used in new SQL syntax topics underneath that parent topic. + </p> + + <p id="live_reporting_details"> + The output from this query option is printed to standard error. The output is only displayed in interactive mode, + that is, not when the <codeph>-q</codeph> or <codeph>-f</codeph> options are used. + </p> + + <p id="live_progress_live_summary_asciinema"> + To see how the <codeph>LIVE_PROGRESS</codeph> and <codeph>LIVE_SUMMARY</codeph> query options + work in real time, see <xref href="https://asciinema.org/a/1rv7qippo0fe7h5k1b6k4nexk" scope="external" format="html">this animated demo</xref>. + </p> + + <p rev="2.3.0" id="impala_shell_progress_reports_compute_stats_caveat"> + The <codeph>LIVE_PROGRESS</codeph> and <codeph>LIVE_SUMMARY</codeph> query options + currently do not produce any output during <codeph>COMPUTE STATS</codeph> operations. + </p> + +<!-- This is a shorter version of the similar 'caveat' text. This shorter one can be reused more easily in various places. --> + <p rev="2.3.0" id="impala_shell_progress_reports_shell_only_blurb"> + The <codeph>LIVE_PROGRESS</codeph> and <codeph>LIVE_SUMMARY</codeph> query options only apply + inside the <cmdname>impala-shell</cmdname> interpreter. You cannot use them with the + <codeph>SET</codeph> statement from a JDBC or ODBC application. + </p> + + <p id="impala_shell_progress_reports_shell_only_caveat"> + Because the <codeph>LIVE_PROGRESS</codeph> and <codeph>LIVE_SUMMARY</codeph> query options + are available only within the <cmdname>impala-shell</cmdname> interpreter: + <ul> + <li> + <p> + You cannot change these query options through the SQL <codeph>SET</codeph> + statement using the JDBC or ODBC interfaces. The <codeph>SET</codeph> + command in <cmdname>impala-shell</cmdname> recognizes these names as + shell-only options. + </p> + </li> + <li> + <p> + Be careful when using <cmdname>impala-shell</cmdname> on a pre-CDH 5.5 + system to connect to Impala running on a CDH 5.5 or higher system. + The older <cmdname>impala-shell</cmdname> does not recognize these + query option names. Upgrade <cmdname>impala-shell</cmdname> on the + systems where you intend to use these query options. + </p> + </li> + <li> + <p> + Likewise, the <cmdname>impala-shell</cmdname> command relies on + some information only available in Impala 2.3 / CDH 5.5 and higher + to prepare live progress reports and query summaries. The + <codeph>LIVE_PROGRESS</codeph> and <codeph>LIVE_SUMMARY</codeph> + query options have no effect when <cmdname>impala-shell</cmdname> connects + to a cluster running an older version of Impala. + </p> + </li> + </ul> + </p> + +<!-- Same example used in both CREATE DATABASE and DROP DATABASE. --> +<codeblock id="create_drop_db_example">create database first_db; +use first_db; +create table t1 (x int); + +create database second_db; +use second_db; +-- Each database has its own namespace for tables. +-- You can reuse the same table names in each database. +create table t1 (s string); + +create database temp; + +-- You can either USE a database after creating it, +-- or qualify all references to the table name with the name of the database. +-- Here, tables T2 and T3 are both created in the TEMP database. + +create table temp.t2 (x int, y int); +use database temp; +create table t3 (s string); + +-- You cannot drop a database while it is selected by the USE statement. +drop database temp; +<i>ERROR: AnalysisException: Cannot drop current default database: temp</i> + +-- The always-available database 'default' is a convenient one to USE +-- before dropping a database you created. +use default; + +-- Before dropping a database, first drop all the tables inside it, +<ph rev="2.3.0">-- or in CDH 5.5 and higher use the CASCADE clause.</ph> +drop database temp; +ERROR: ImpalaRuntimeException: Error making 'dropDatabase' RPC to Hive Metastore: +CAUSED BY: InvalidOperationException: Database temp is not empty +show tables in temp; ++------+ +| name | ++------+ +| t3 | ++------+ + +<ph rev="2.3.0">-- CDH 5.5 and higher:</ph> +<ph rev="2.3.0">drop database temp cascade;</ph> + +-- CDH 5.4 and lower: +drop table temp.t3; +drop database temp; +</codeblock> + + <p id="cast_convenience_fn_example"> + This example shows how to use the <codeph>castto*()</codeph> functions as an equivalent + to <codeph>CAST(<varname>value</varname> AS <varname>type</varname>)</codeph> expressions. + </p> + + <p id="cast_convenience_fn_usage"><b>Usage notes:</b> + A convenience function to skip the SQL <codeph>CAST <varname>value</varname> AS <varname>type</varname></codeph> syntax, + for example when programmatically generating SQL statements where a regular function call might be easier to construct. + </p> + + <p rev="2.2.0" id="timezone_conversion_caveat"> + The way this function deals with time zones when converting to or from <codeph>TIMESTAMP</codeph> + values is affected by the <codeph>-use_local_tz_for_unix_timestamp_conversions</codeph> startup flag for the + <cmdname>impalad</cmdname> daemon. See <xref href="../topics/impala_timestamp.xml#timestamp"/> for details about + how Impala handles time zone considerations for the <codeph>TIMESTAMP</codeph> data type. + </p> + + <note rev="2.2.0" id="s3_caveat" type="important"> + <p> + Impala query support for Amazon S3 is included in CDH 5.4.0, but is not currently supported or recommended for production use. + If you're interested in this feature, try it out in a test environment until we address the issues and limitations needed for production-readiness. + </p> + </note> + + <p rev="2.2.0" id="s3_dml"> + Currently, Impala cannot insert or load data into a table or partition that resides in the Amazon + Simple Storage Service (S3). + Bring data into S3 using the normal S3 transfer mechanisms, then use Impala to query the S3 data. + See <xref href="../topics/impala_s3.xml#s3"/> for details about using Impala with S3. + </p> + + <p rev="2.2.0" id="s3_metadata"> + The <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> statements also cache metadata + for tables where the data resides in the Amazon Simple Storage Service (S3). + In particular, issue a <codeph>REFRESH</codeph> for a table after adding or removing files + in the associated S3 data directory. + See <xref href="../topics/impala_s3.xml#s3"/> for details about working with S3 tables. + </p> + + <p id="y2k38" rev="2.2.0"> + In Impala 2.2.0 and higher, built-in functions that accept or return integers representing <codeph>TIMESTAMP</codeph> values + use the <codeph>BIGINT</codeph> type for parameters and return values, rather than <codeph>INT</codeph>. + This change lets the date and time functions avoid an overflow error that would otherwise occur + on January 19th, 2038 (known as the + <xref href="http://en.wikipedia.org/wiki/Year_2038_problem" scope="external" format="html"><q>Year 2038 problem</q> or <q>Y2K38 problem</q></xref>). + This change affects the <codeph>from_unixtime()</codeph> and <codeph>unix_timestamp()</codeph> functions. + You might need to change application code that interacts with these functions, change the types of + columns that store the return values, or add <codeph>CAST()</codeph> calls to SQL statements that + call these functions. + </p> + + <p id="timestamp_conversions"> + Impala automatically converts <codeph>STRING</codeph> literals of the correct format into + <codeph>TIMESTAMP</codeph> values. Timestamp values are accepted in the format + <codeph>"yyyy-MM-dd HH:mm:ss.SSSSSS"</codeph>, and can consist of just the date, or just the time, with or + without the fractional second portion. For example, you can specify <codeph>TIMESTAMP</codeph> values such as + <codeph>'1966-07-30'</codeph>, <codeph>'08:30:00'</codeph>, or <codeph>'1985-09-25 17:45:30.005'</codeph>. + <ph conref="../shared/impala_common.xml#common/cast_int_to_timestamp"/> + </p> + + + <p> + <ph id="cast_int_to_timestamp">Casting an integer or floating-point value <codeph>N</codeph> to + <codeph>TIMESTAMP</codeph> produces a value that is <codeph>N</codeph> seconds past the start of the epoch + date (January 1, 1970). By default, the result value represents a date and time in the UTC time zone. + If the setting <codeph>-use_local_tz_for_unix_timestamp_conversions=true</codeph> is in effect, + the resulting <codeph>TIMESTAMP</codeph> represents a date and time in the local time zone.</ph> + </p> + + <p id="redaction_yes" rev="2.2.0"> + If these statements in your environment contain sensitive literal values such as credit card numbers or tax + identifiers, Impala can redact this sensitive information when displaying the statements in log files and + other administrative contexts. See + <xref audience="integrated" href="../topics/sg_redaction.xml#log_redact"/><xref audience="standalone" href="http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/sg_redaction.html" scope="external" format="html"/> + for details. + </p> + + <p id="incremental_partition_spec"> + The <codeph>PARTITION</codeph> clause is only allowed in combination with the <codeph>INCREMENTAL</codeph> + clause. It is optional for <codeph>COMPUTE INCREMENTAL STATS</codeph>, and required for <codeph>DROP + INCREMENTAL STATS</codeph>. Whenever you specify partitions through the <codeph>PARTITION + (<varname>partition_spec</varname>)</codeph> clause in a <codeph>COMPUTE INCREMENTAL STATS</codeph> or + <codeph>DROP INCREMENTAL STATS</codeph> statement, you must include all the partitioning columns in the + specification, and specify constant values for all the partition key columns. + </p> + + <p id="udf_persistence_restriction"> + Currently, Impala UDFs and UDAs are not persisted in the metastore database. Information + about these functions is held in the memory of the <cmdname>catalogd</cmdname> daemon. You must reload them + by running the <codeph>CREATE FUNCTION</codeph> statements again each time you restart the + <cmdname>catalogd</cmdname> daemon. + </p> + + <note id="add_partition_set_location"> + If you are creating a partition for the first time and specifying its location, for maximum efficiency, use + a single <codeph>ALTER TABLE</codeph> statement including both the <codeph>ADD PARTITION</codeph> and + <codeph>LOCATION</codeph> clauses, rather than separate statements with <codeph>ADD PARTITION</codeph> and + <codeph>SET LOCATION</codeph> clauses. + </note> + + <p id="insert_hidden_work_directory"> + The <codeph>INSERT</codeph> statement has always left behind a hidden work directory inside the data + directory of the table. Formerly, this hidden work directory was named + <filepath>.impala_insert_staging</filepath> . In Impala 2.0.1 and later, this directory name is changed to + <filepath>_impala_insert_staging</filepath> . (While HDFS tools are expected to treat names beginning + either with underscore and dot as hidden, in practice names beginning with an underscore are more widely + supported.) If you have any scripts, cleanup jobs, and so on that rely on the name of this work directory, + adjust them to use the new name. + </p> + + <p id="check_internal_external_table"> + To see whether a table is internal or external, and its associated HDFS location, issue the statement + <codeph>DESCRIBE FORMATTED <varname>table_name</varname></codeph>. The <codeph>Table Type</codeph> field + displays <codeph>MANAGED_TABLE</codeph> for internal tables and <codeph>EXTERNAL_TABLE</codeph> for + external tables. The <codeph>Location</codeph> field displays the path of the table directory as an HDFS + URI. + </p> + + <p id="switch_internal_external_table"> + You can switch a table from internal to external, or from external to internal, by using the <codeph>ALTER + TABLE</codeph> statement: +<codeblock xml:space="preserve"> +-- Switch a table from internal to external. +ALTER TABLE <varname>table_name</varname> SET TBLPROPERTIES('EXTERNAL'='TRUE'); + +-- Switch a table from external to internal. +ALTER TABLE <varname>table_name</varname> SET TBLPROPERTIES('EXTERNAL'='FALSE'); +</codeblock> + </p> + +<!-- The data to show sensible output from these queries is in the TPC-DS schema 'CUSTOMER' table. + If you want to show real output, add a LIMIT 5 or similar clause to each query to avoid + too-long output. --> + +<codeblock id="regexp_rlike_examples" xml:space="preserve">-- Find all customers whose first name starts with 'J', followed by 0 or more of any character. +select c_first_name, c_last_name from customer where c_first_name regexp '^J.*'; +select c_first_name, c_last_name from customer where c_first_name rlike '^J.*'; + +-- Find 'Macdonald', where the first 'a' is optional and the 'D' can be upper- or lowercase. +-- The ^...$ are required, to match the start and end of the value. +select c_first_name, c_last_name from customer where c_last_name regexp '^Ma?c[Dd]onald$'; +select c_first_name, c_last_name from customer where c_last_name rlike '^Ma?c[Dd]onald$'; + +-- Match multiple character sequences, either 'Mac' or 'Mc'. +select c_first_name, c_last_name from customer where c_last_name regexp '^(Mac|Mc)donald$'; +select c_first_name, c_last_name from customer where c_last_name rlike '^(Mac|Mc)donald$'; + +-- Find names starting with 'S', then one or more vowels, then 'r', then any other characters. +-- Matches 'Searcy', 'Sorenson', 'Sauer'. +select c_first_name, c_last_name from customer where c_last_name regexp '^S[aeiou]+r.*$'; +select c_first_name, c_last_name from customer where c_last_name rlike '^S[aeiou]+r.*$'; + +-- Find names that end with 2 or more vowels: letters from the set a,e,i,o,u. +select c_first_name, c_last_name from customer where c_last_name regexp '.*[aeiou]{2,}$'; +select c_first_name, c_last_name from customer where c_last_name rlike '.*[aeiou]{2,}$'; + +-- You can use letter ranges in the [] blocks, for example to find names starting with A, B, or C. +select c_first_name, c_last_name from customer where c_last_name regexp '^[A-C].*'; +select c_first_name, c_last_name from customer where c_last_name rlike '^[A-C].*'; + +-- If you are not sure about case, leading/trailing spaces, and so on, you can process the +-- column using string functions first. +select c_first_name, c_last_name from customer where lower(trim(c_last_name)) regexp '^de.*'; +select c_first_name, c_last_name from customer where lower(trim(c_last_name)) rlike '^de.*'; +</codeblock> + + <p id="show_security"> + When authorization is enabled, the output of the <codeph>SHOW</codeph> statement is limited to those + objects for which you have some privilege. There might be other database, tables, and so on, but their + names are concealed. If you believe an object exists but you cannot see it in the <codeph>SHOW</codeph> + output, check with the system administrator if you need to be granted a new privilege for that object. See + <xref href="../topics/impala_authorization.xml#authorization"/> for how to set up authorization and add + privileges for specific kinds of objects. + </p> + + <p rev="2.0.0" id="user_kerberized"> + In Impala 2.0 and later, <codeph>user()</codeph> returns the the full Kerberos principal string, such as + <codeph>[email protected]</codeph>, in a Kerberized environment. + </p> + + <ul> + <li id="grant_revoke_single"> + Currently, each Impala <codeph>GRANT</codeph> or <codeph>REVOKE</codeph> statement can only grant or + revoke a single privilege to or from a single role. + </li> + </ul> + + <p id="blobs_are_strings"> + All data in <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> columns must be in a character encoding that + is compatible with UTF-8. If you have binary data from another database system (that is, a BLOB type), use + a <codeph>STRING</codeph> column to hold it. + </p> + +<!-- The codeblock is nested inside this paragraph, so the intro text + and the code get conref'ed as a unit. --> + + <p id="create_drop_view_examples"> + The following example creates a series of views and then drops them. These examples illustrate how views + are associated with a particular database, and both the view definitions and the view names for + <codeph>CREATE VIEW</codeph> and <codeph>DROP VIEW</codeph> can refer to a view in the current database or + a fully qualified view name. +<codeblock xml:space="preserve"> +-- Create and drop a view in the current database. +CREATE VIEW few_rows_from_t1 AS SELECT * FROM t1 LIMIT 10; +DROP VIEW few_rows_from_t1; + +-- Create and drop a view referencing a table in a different database. +CREATE VIEW table_from_other_db AS SELECT x FROM db1.foo WHERE x IS NOT NULL; +DROP VIEW table_from_other_db; + +USE db1; +-- Create a view in a different database. +CREATE VIEW db2.v1 AS SELECT * FROM db2.foo; +-- Switch into the other database and drop the view. +USE db2; +DROP VIEW v1; + +USE db1; +-- Create a view in a different database. +CREATE VIEW db2.v1 AS SELECT * FROM db2.foo; +-- Drop a view in the other database. +DROP VIEW db2.v1; +</codeblock> + </p> + + <p id="char_varchar_cast_from_string"> + For <codeph>INSERT</codeph> operations into <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> columns, you + must cast all <codeph>STRING</codeph> literals or expressions returning <codeph>STRING</codeph> to to a + <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> type with the appropriate length. + </p> + + <p rev="2.0.0" id="subquery_no_limit"> + Correlated subqueries used in <codeph>EXISTS</codeph> and <codeph>IN</codeph> operators cannot include a + <codeph>LIMIT</codeph> clause. + </p> + + <p id="avro_no_timestamp"> + Currently, Avro tables cannot contain <codeph>TIMESTAMP</codeph> columns. If you need to store date and + time values in Avro tables, as a workaround you can use a <codeph>STRING</codeph> representation of the + values, convert the values to <codeph>BIGINT</codeph> with the <codeph>UNIX_TIMESTAMP()</codeph> function, + or create separate numeric columns for individual date and time fields using the <codeph>EXTRACT()</codeph> + function. + </p> + + <p id="zero_length_strings"> + <b>Zero-length strings:</b> For purposes of clauses such as <codeph>DISTINCT</codeph> and <codeph>GROUP + BY</codeph>, Impala considers zero-length strings (<codeph>""</codeph>), <codeph>NULL</codeph>, and space + to all be different values. + </p> + + <p id="order_by_scratch_dir"> + By default, intermediate files used during large sort, join, aggregation, or analytic function operations + are stored in the directory <filepath>/tmp/impala-scratch</filepath> . These files are removed when the + operation finishes. (Multiple concurrent queries can perform operations that use the <q>spill to disk</q> + technique, without any name conflicts for these temporary files.) You can specify a different location by + starting the <cmdname>impalad</cmdname> daemon with the + <codeph>--scratch_dirs="<varname>path_to_directory</varname>"</codeph> configuration option or the + equivalent configuration option in the Cloudera Manager user interface. You can specify a single directory, + or a comma-separated list of directories. The scratch directories must be on the local filesystem, not in + HDFS. You might specify different directory paths for different hosts, depending on the capacity and speed + of the available storage devices. In CDH 5.5 / Impala 2.3 or higher, Impala successfully starts (with a warning + written to the log) if it cannot create or read and write files in one of the scratch directories. + If there is less than 1 GB free on the filesystem where that directory resides, Impala still runs, but writes a + warning message to its log. If Impala encounters an error reading or writing files in a scratch directory during + a query, Impala logs the error and the query fails. + </p> + + <p id="order_by_view_restriction"> + An <codeph>ORDER BY</codeph> clause without an additional <codeph>LIMIT</codeph> clause is ignored in any + view definition. If you need to sort the entire result set from a view, use an <codeph>ORDER BY</codeph> + clause in the <codeph>SELECT</codeph> statement that queries the view. You can still make a simple <q>top + 10</q> report by combining the <codeph>ORDER BY</codeph> and <codeph>LIMIT</codeph> clauses in the same + view definition: +<codeblock xml:space="preserve">[localhost:21000] > create table unsorted (x bigint); +[localhost:21000] > insert into unsorted values (1), (9), (3), (7), (5), (8), (4), (6), (2); +[localhost:21000] > create view sorted_view as select x from unsorted order by x; +[localhost:21000] > select x from sorted_view; -- ORDER BY clause in view has no effect. ++---+ +| x | ++---+ +| 1 | +| 9 | +| 3 | +| 7 | +| 5 | +| 8 | +| 4 | +| 6 | +| 2 | ++---+ +[localhost:21000] > select x from sorted_view order by x; -- View query requires ORDER BY at outermost level. ++---+ +| x | ++---+ +| 1 | +| 2 | +| 3 | +| 4 | +| 5 | +| 6 | +| 7 | +| 8 | +| 9 | ++---+ +[localhost:21000] > create view top_3_view as select x from unsorted order by x limit 3; +[localhost:21000] > select x from top_3_view; -- ORDER BY and LIMIT together in view definition are preserved. ++---+ +| x | ++---+ +| 1 | +| 2 | +| 3 | ++---+ +</codeblock> + </p> + + <p id="precision_scale_example"> + The following examples demonstrate how to check the precision and scale of numeric literals or other + numeric expressions. Impala represents numeric literals in the smallest appropriate type. 5 is a + <codeph>TINYINT</codeph> value, which ranges from -128 to 127, therefore 3 decimal digits are needed to + represent the entire range, and because it is an integer value there are no fractional digits. 1.333 is + interpreted as a <codeph>DECIMAL</codeph> value, with 4 digits total and 3 digits after the decimal point. +<codeblock xml:space="preserve">[localhost:21000] > select precision(5), scale(5); ++--------------+----------+ +| precision(5) | scale(5) | ++--------------+----------+ +| 3 | 0 | ++--------------+----------+ +[localhost:21000] > select precision(1.333), scale(1.333); ++------------------+--------------+ +| precision(1.333) | scale(1.333) | ++------------------+--------------+ +| 4 | 3 | ++------------------+--------------+ +[localhost:21000] > with t1 as + ( select cast(12.34 as decimal(20,2)) x union select cast(1 as decimal(8,6)) x ) + select precision(x), scale(x) from t1 limit 1; ++--------------+----------+ +| precision(x) | scale(x) | ++--------------+----------+ +| 24 | 6 | ++--------------+----------+ +</codeblock> + </p> + +<!-- These 'type_' entries are for query options, where the type doesn't match up exactly with an Impala data type. --> + + <p id="type_boolean"> + <b>Type:</b> Boolean; recognized values are 1 and 0, or <codeph>true</codeph> and <codeph>false</codeph>; + any other value interpreted as <codeph>false</codeph> + </p> + + <p id="type_string"> + <b>Type:</b> string + </p> + + <p id="default_false"> + <b>Default:</b> <codeph>false</codeph> + </p> + + <p id="default_false_0"> + <b>Default:</b> <codeph>false</codeph> (shown as 0 in output of <codeph>SET</codeph> statement) + </p> + + <p id="odd_return_type_string"> + Currently, the return value is always a <codeph>STRING</codeph>. The return type is subject to change in + future releases. Always use <codeph>CAST()</codeph> to convert the result to whichever data type is + appropriate for your computations. + </p> + + <p rev="2.0.0" id="former_odd_return_type_string"> + <b>Return type:</b> <codeph>DOUBLE</codeph> in Impala 2.0 and higher; <codeph>STRING</codeph> in earlier + releases + </p> + + <p id="for_compatibility_only"> + <b>Usage notes:</b> Primarily for compatibility with code containing industry extensions to SQL. + </p> + + <p id="return_type_boolean"> + <b>Return type:</b> <codeph>BOOLEAN</codeph> + </p> + + <p id="return_type_double"> + <b>Return type:</b> <codeph>DOUBLE</codeph> + </p> + + <p id="return_type_same"> + <b>Return type:</b> Same as the input value + </p> + + <p id="return_type_same_except_string"> + <b>Return type:</b> Same as the input value, except for <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> + arguments which produce a <codeph>STRING</codeph> result + </p> + + <p id="builtins_db"> + Impala includes another predefined database, <codeph>_impala_builtins</codeph>, that serves as the location + for the <xref href="../topics/impala_functions.xml#builtins">built-in functions</xref>. To see the built-in + functions, use a statement like the following: +<codeblock xml:space="preserve">show functions in _impala_builtins; +show functions in _impala_builtins like '*<varname>substring</varname>*'; +</codeblock> + </p> + + <p id="sum_double"> + Due to the way arithmetic on <codeph>FLOAT</codeph> and <codeph>DOUBLE</codeph> columns uses + high-performance hardware instructions, and distributed queries can perform these operations in different + order for each query, results can vary slightly for aggregate function calls such as <codeph>SUM()</codeph> + and <codeph>AVG()</codeph> for <codeph>FLOAT</codeph> and <codeph>DOUBLE</codeph> columns, particularly on + large data sets where millions or billions of values are summed or averaged. For perfect consistency and + repeatability, use the <codeph>DECIMAL</codeph> data type for such operations instead of + <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph>. + </p> + + <p id="float_double_decimal_caveat"> + The inability to exactly represent certain floating-point values means that + <codeph>DECIMAL</codeph> is sometimes a better choice than <codeph>DOUBLE</codeph> + or <codeph>FLOAT</codeph> when precision is critical, particularly when + transferring data from other database systems that use different representations + or file formats. + </p> + + <p rev="1.4.0" id="decimal_no_stats"> + Currently, the <codeph>COMPUTE STATS</codeph> statement under CDH 4 does not store any statistics for + <codeph>DECIMAL</codeph> columns. When Impala runs under CDH 5, which has better support for + <codeph>DECIMAL</codeph> in the metastore database, <codeph>COMPUTE STATS</codeph> does collect statistics + for <codeph>DECIMAL</codeph> columns and Impala uses the statistics to optimize query performance. + </p> + + <p id="datetime_function_chaining"> + <codeph>unix_timestamp()</codeph> and <codeph>from_unixtime()</codeph> are often used in combination to + convert a <codeph>TIMESTAMP</codeph> value into a particular string format. For example: +<codeblock xml:space="preserve">select from_unixtime(unix_timestamp(now() + interval 3 days), 'yyyy/MM/dd HH:mm'); +</codeblock> + </p> + + <p rev="1.4.0 obwl" id="insert_sort_blurb"> + <b>Sorting considerations:</b> Although you can specify an <codeph>ORDER BY</codeph> clause in an + <codeph>INSERT ... SELECT</codeph> statement, any <codeph>ORDER BY</codeph> clause is ignored and the + results are not necessarily sorted. An <codeph>INSERT ... SELECT</codeph> operation potentially creates + many different data files, prepared on different data nodes, and therefore the notion of the data being + stored in sorted order is impractical. + </p> + + <p rev="1.4.0" id="create_table_like_view"> + Prior to Impala 1.4.0, it was not possible to use the <codeph>CREATE TABLE LIKE + <varname>view_name</varname></codeph> syntax. In Impala 1.4.0 and higher, you can create a table with the + same column definitions as a view using the <codeph>CREATE TABLE LIKE</codeph> technique. Although + <codeph>CREATE TABLE LIKE</codeph> normally inherits the file format of the original table, a view has no + underlying file format, so <codeph>CREATE TABLE LIKE <varname>view_name</varname></codeph> produces a text + table by default. To specify a different file format, include a <codeph>STORED AS + <varname>file_format</varname></codeph> clause at the end of the <codeph>CREATE TABLE LIKE</codeph> + statement. + </p> + + <note rev="1.4.0" id="compute_stats_nulls"> + Prior to Impala 1.4.0, <codeph>COMPUTE STATS</codeph> counted the number of <codeph>NULL</codeph> values in + each column and recorded that figure in the metastore database. Because Impala does not currently make use + of the <codeph>NULL</codeph> count during query planning, Impala 1.4.0 and higher speeds up the + <codeph>COMPUTE STATS</codeph> statement by skipping this <codeph>NULL</codeph> counting. + </note> + + <p rev="1.3.1" id="regexp_matching"> + In Impala 1.3.1 and higher, the <codeph>REGEXP</codeph> and <codeph>RLIKE</codeph> operators now match a + regular expression string that occurs anywhere inside the target string, the same as if the regular + expression was enclosed on each side by <codeph>.*</codeph>. See + <xref href="../topics/impala_operators.xml#regexp"/> for examples. Previously, these operators only + succeeded when the regular expression matched the entire target string. This change improves compatibility + with the regular expression support for popular database systems. There is no change to the behavior of the + <codeph>regexp_extract()</codeph> and <codeph>regexp_replace()</codeph> built-in functions. + </p> + + <p rev="1.3.1" id="insert_inherit_permissions"> + By default, if an <codeph>INSERT</codeph> statement creates any new subdirectories underneath a partitioned + table, those subdirectories are assigned default HDFS permissions for the <codeph>impala</codeph> user. To + make each subdirectory have the same permissions as its parent directory in HDFS, specify the + <codeph>--insert_inherit_permissions</codeph> startup option for the <cmdname>impalad</cmdname> daemon. + </p> + + <note id="multiple_count_distinct"> + <p> + By default, Impala only allows a single <codeph>COUNT(DISTINCT <varname>columns</varname>)</codeph> + expression in each query. + </p> + <p> + If you do not need precise accuracy, you can produce an estimate of the distinct values for a column by + specifying <codeph>NDV(<varname>column</varname>)</codeph>; a query can contain multiple instances of + <codeph>NDV(<varname>column</varname>)</codeph>. To make Impala automatically rewrite + <codeph>COUNT(DISTINCT)</codeph> expressions to <codeph>NDV()</codeph>, enable the + <codeph>APPX_COUNT_DISTINCT</codeph> query option. + </p> + <p> + To produce the same result as multiple <codeph>COUNT(DISTINCT)</codeph> expressions, you can use the + following technique for queries involving a single table: + </p> +<codeblock xml:space="preserve">select v1.c1 result1, v2.c1 result2 from + (select count(distinct col1) as c1 from t1) v1 + cross join + (select count(distinct col2) as c1 from t1) v2; +</codeblock> + <p> + Because <codeph>CROSS JOIN</codeph> is an expensive operation, prefer to use the <codeph>NDV()</codeph> + technique wherever practical. + </p> + </note> + + <p> + <ph id="union_all_vs_union">Prefer <codeph>UNION ALL</codeph> over <codeph>UNION</codeph> when you know the + data sets are disjoint or duplicate values are not a problem; <codeph>UNION ALL</codeph> is more efficient + because it avoids materializing and sorting the entire result set to eliminate duplicate values.</ph> + </p> + + <note id="thorn"> + The <codeph>CREATE TABLE</codeph> clauses <codeph>FIELDS TERMINATED BY</codeph>, <codeph>ESCAPED + BY</codeph>, and <codeph>LINES TERMINATED BY</codeph> have special rules for the string literal used for + their argument, because they all require a single character. You can use a regular character surrounded by + single or double quotation marks, an octal sequence such as <codeph>'\054'</codeph> (representing a comma), + or an integer in the range '-127'..'128' (with quotation marks but no backslash), which is interpreted as a + single-byte ASCII character. Negative values are subtracted from 256; for example, <codeph>FIELDS + TERMINATED BY '-2'</codeph> sets the field delimiter to ASCII code 254, the <q>Icelandic Thorn</q> + character used as a delimiter by some data formats. + </note> + + <p id="command_line_blurb"> + <b>Command-line equivalent:</b> + </p> + + <p rev="2.3.0" id="complex_types_blurb"> + <b>Complex type considerations:</b> + </p> + + <p id="complex_types_combo"> + Because complex types are often used in combination, + for example an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> + elements, if you are unfamiliar with the Impala complex types, + start with <xref href="../topics/impala_complex_types.xml#complex_types"/> for + background information and usage examples. + </p> + + <ul id="complex_types_restrictions"> + <li> + Columns with this data type can only be used in tables or partitions with the Parquet file format. + </li> + <li> + Columns with this data type cannot be used as partition key columns in a partitioned table. + </li> + <li> + The <codeph>COMPUTE STATS</codeph> statement does not produce any statistics for columns of this data type. + </li> + <li> + See <xref href="../topics/impala_complex_types.xml#complex_types_limits"/> for a full list of limitations + and associated guidelines about complex type columns. + </li> + </ul> + + <p rev="2.3.0" id="complex_types_partitioning"> + Partitioned tables can contain complex type columns. + All the partition key columns must be scalar types. + </p> + + <p rev="2.3.0" id="complex_types_describe"> + You can pass a multi-part qualified name to <codeph>DESCRIBE</codeph> + to specify an <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, or <codeph>MAP</codeph> + column and visualize its structure as if it were a table. + For example, if table <codeph>T1</codeph> contains an <codeph>ARRAY</codeph> column + <codeph>A1</codeph>, you could issue the statement <codeph>DESCRIBE t1.a1</codeph>. + If table <codeph>T1</codeph> contained a <codeph>STRUCT</codeph> column <codeph>S1</codeph>, + and a field <codeph>F1</codeph> within the <codeph>STRUCT</codeph> was a <codeph>MAP</codeph>, + you could issue the statement <codeph>DESCRIBE t1.s1.f1</codeph>. + An <codeph>ARRAY</codeph> is shown as a two-column table, with + <codeph>ITEM</codeph> and <codeph>POS</codeph> columns. + A <codeph>STRUCT</codeph> is shown as a table with each field + representing a column in the table. + A <codeph>MAP</codeph> is shown as a two-column table, with + <codeph>KEY</codeph> and <codeph>VALUE</codeph> columns. + </p> + + <note id="complex_type_schema_pointer"> + Many of the complex type examples refer to tables + such as <codeph>CUSTOMER</codeph> and <codeph>REGION</codeph> + adapted from the tables used in the TPC-H benchmark. + See <xref href="../topics/impala_complex_types.xml#complex_sample_schema"/> + for the table definitions. + </note> + + <p rev="2.3.0" id="complex_types_unsupported_filetype"> + <b>Complex type considerations:</b> + Although you can create tables in this file format using + the complex types (<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, + and <codeph>MAP</codeph>) available in CDH 5.5 / Impala 2.3 and higher, + currently, Impala can query these types only in Parquet tables. + </p> + + <p rev="2.3.0" id="complex_types_caveat_no_operator"> + You cannot refer to a column with a complex data type (<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, or <codeph>MAP</codeph> + directly in an operator. You can apply operators only to scalar values that make up a complex type + (the fields of a <codeph>STRUCT</codeph>, the items of an <codeph>ARRAY</codeph>, + or the key or value portion of a <codeph>MAP</codeph>) as part of a join query that refers to + the scalar value using the appropriate dot notation or <codeph>ITEM</codeph>, <codeph>KEY</codeph>, or <codeph>VALUE</codeph> + pseudocolumn names. + </p> + + <p rev="2.3.0" id="udfs_no_complex_types"> + Currently, Impala UDFs cannot accept arguments or return values of the Impala complex types + (<codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, or <codeph>MAP</codeph>). + </p> + + <p rev="2.3.0" id="complex_types_read_only"> + Impala currently cannot write new data files containing complex type columns. + Therefore, although the <codeph>SELECT</codeph> statement works for queries + involving complex type columns, you cannot use a statement form that writes + data to complex type columns, such as <codeph>CREATE TABLE AS SELECT</codeph> or <codeph>INSERT ... SELECT</codeph>. + To create data files containing complex type data, use the Hive <codeph>INSERT</codeph> statement, or another + ETL mechanism such as MapReduce jobs, Spark jobs, Pig, and so on. + </p> + + <p rev="2.3.0" id="complex_types_views"> + For tables containing complex type columns (<codeph>ARRAY</codeph>, + <codeph>STRUCT</codeph>, or <codeph>MAP</codeph>), you typically use + join queries to refer to the complex values. You can use views to + hide the join notation, making such tables seem like traditional denormalized + tables, and making those tables queryable by business intelligence tools + that do not have built-in support for those complex types. + See <xref href="../topics/impala_complex_types.xml#complex_types_views"/> for details. + </p> + + <p rev="2.3.0" id="complex_types_views_caveat"> + Because you cannot directly issue <codeph>SELECT <varname>col_name</varname></codeph> + against a column of complex type, you cannot use a view or a <codeph>WITH</codeph> + clause to <q>rename</q> a column by selecting it with a column alias. + </p> + + <p rev="2.3.0" id="jdbc_odbc_complex_types"> + The Impala complex types (<codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, or <codeph>MAP</codeph>) + are available in CDH 5.5 / Impala 2.3 and higher. + To use these types with JDBC requires version 2.5.28 or higher of the Cloudera JDBC Connector for Impala. + To use these types with ODBC requires version 2.5.30 or higher of the Cloudera ODBC Connector for Impala. + Consider upgrading all JDBC and ODBC drivers at the same time you upgrade from CDH 5.5 or higher. + </p> + + <p rev="2.3.0" id="jdbc_odbc_complex_types_views"> + Although the result sets from queries involving complex types consist of all scalar values, + the queries involve join notation and column references that might not be understood by + a particular JDBC or ODBC connector. Consider defining a view that represents the + flattened version of a table containing complex type columns, and pointing the JDBC + or ODBC application at the view. + See <xref href="../topics/impala_complex_types.xml#complex_types"/> for details. + </p> + + <p rev="2.3.0" id="complex_types_aggregation_explanation"> + To access a column with a complex type (<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, or <codeph>MAP</codeph>) + in an aggregation function, you unpack the individual elements using join notation in the query, + and then apply the function to the final scalar item, field, key, or value at the bottom of any nested type hierarchy in the column. + See <xref href="../topics/impala_complex_types.xml#complex_types"/> for details about using complex types in Impala. + </p> + +<p rev="2.3.0" id="complex_types_aggregation_example"> +The following example demonstrates calls to several aggregation functions +using values from a column containing nested complex types +(an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> items). +The array is unpacked inside the query using join notation. +The array elements are referenced using the <codeph>ITEM</codeph> +pseudocolumn, and the structure fields inside the array elements +are referenced using dot notation. +Numeric values such as <codeph>SUM()</codeph> and <codeph>AVG()</codeph> +are computed using the numeric <codeph>R_NATIONKEY</codeph> field, and +the general-purpose <codeph>MAX()</codeph> and <codeph>MIN()</codeph> +values are computed from the string <codeph>N_NAME</codeph> field. +<codeblock>describe region; ++-------------+-------------------------+---------+ +| name | type | comment | ++-------------+-------------------------+---------+ +| r_regionkey | smallint | | +| r_name | string | | +| r_comment | string | | +| r_nations | array<struct< | | +| | n_nationkey:smallint, | | +| | n_name:string, | | +| | n_comment:string | | +| | >> | | ++-------------+-------------------------+---------+ + +select r_name, r_nations.item.n_nationkey + from region, region.r_nations as r_nations +order by r_name, r_nations.item.n_nationkey; ++-------------+------------------+ +| r_name | item.n_nationkey | ++-------------+------------------+ +| AFRICA | 0 | +| AFRICA | 5 | +| AFRICA | 14 | +| AFRICA | 15 | +| AFRICA | 16 | +| AMERICA | 1 | +| AMERICA | 2 | +| AMERICA | 3 | +| AMERICA | 17 | +| AMERICA | 24 | +| ASIA | 8 | +| ASIA | 9 | +| ASIA | 12 | +| ASIA | 18 | +| ASIA | 21 | +| EUROPE | 6 | +| EUROPE | 7 | +| EUROPE | 19 | +| EUROPE | 22 | +| EUROPE | 23 | +| MIDDLE EAST | 4 | +| MIDDLE EAST | 10 | +| MIDDLE EAST | 11 | +| MIDDLE EAST | 13 | +| MIDDLE EAST | 20 | ++-------------+------------------+ + +select + r_name, + count(r_nations.item.n_nationkey) as count, + sum(r_nations.item.n_nationkey) as sum, + avg(r_nations.item.n_nationkey) as average, + min(r_nations.item.n_name) as minimum, + max(r_nations.item.n_name) as maximum, + ndv(r_nations.item.n_nationkey) as distinct_values +from + region, region.r_nations as r_nations +group by r_name +order by r_name; ++-------------+-------+-----+---------+-----------+----------------+-----------------+ +| r_name | count | sum | average | minimum | maximum | distinct_values | ++-------------+-------+-----+---------+-----------+----------------+-----------------+ +| AFRICA | 5 | 50 | 10 | ALGERIA | MOZAMBIQUE | 5 | +| AMERICA | 5 | 47 | 9.4 | ARGENTINA | UNITED STATES | 5 | +| ASIA | 5 | 68 | 13.6 | CHINA | VIETNAM | 5 | +| EUROPE | 5 | 77 | 15.4 | FRANCE | UNITED KINGDOM | 5 | +| MIDDLE EAST | 5 | 58 | 11.6 | EGYPT | SAUDI ARABIA | 5 | ++-------------+-------+-----+---------+-----------+----------------+-----------------+ +</codeblock> +</p> + + <p id="hive_blurb"> + <b>Hive considerations:</b> + </p> + + <p rev="CDH-19187" id="permissions_blurb"> + <b>HDFS permissions:</b> + </p> + + <p rev="CDH-19187" id="permissions_blurb_no"> + <b>HDFS permissions:</b> This statement does not touch any HDFS files or directories, + therefore no HDFS permissions are required. + </p> + + <p id="security_blurb"> + <b>Security considerations:</b> + </p> + + <p id="performance_blurb"> + <b>Performance considerations:</b> + </p> + + <p id="conversion_blurb"> + <b>Casting and conversions:</b> + </p> + + <p id="related_info"> + <b>Related information:</b> + </p> + + <p id="related_tasks"> + <b>Related tasks:</b> + </p> + + <p id="related_options"> + <b>Related startup options:</b> + </p> + + <p id="restrictions_blurb"> + <b>Restrictions:</b> + </p> + + <p rev="2.0.0" id="restrictions_sliding_window"> + <b>Restrictions:</b> In Impala 2.0 and higher, this function can be used as an analytic function, but with restrictions on any window clause. + For <codeph>MAX()</codeph> and <codeph>MIN()</codeph>, the window clause is only allowed if the start + bound is <codeph>UNBOUNDED PRECEDING</codeph>. + </p> + +<!-- This blurb has been superceded by analytic_not_allowed_caveat. Consider removing it if it turns out never to be needed. --> + <p rev="2.0.0" id="restrictions_non_analytic"> + <b>Restrictions:</b> This function cannot be used as an analytic function; it does not currently support + the <codeph>OVER()</codeph> clause. + </p> + + <p id="compatibility_blurb"> + <b>Compatibility:</b> + </p> + + <p id="null_blurb"> + <b>NULL considerations:</b> + </p> + + <p id="udf_blurb"> + <b>UDF considerations:</b> + </p> + + <p id="udf_blurb_no"> + <b>UDF considerations:</b> This type cannot be used for the argument or return type of a user-defined + function (UDF) or user-defined aggregate function (UDA). + </p> + + <p id="view_blurb"> + <b>Considerations for views:</b> + </p> + + <p id="null_bad_numeric_cast"> + <b>NULL considerations:</b> Casting any non-numeric value to this type produces a <codeph>NULL</codeph> + value. + </p> + + <p id="null_bad_timestamp_cast"> + <b>NULL considerations:</b> Casting any unrecognized <codeph>STRING</codeph> value to this type produces a + <codeph>NULL</codeph> value. + </p> + + <p id="null_null_arguments"> + <b>NULL considerations:</b> An expression of this type produces a <codeph>NULL</codeph> value if any + argument of the expression is <codeph>NULL</codeph>. + </p> + + <p id="privileges_blurb"> + <b>Required privileges:</b> + </p> + + <p id="parquet_blurb"> + <b>Parquet considerations:</b> + </p> + + <p id="parquet_tools_blurb"> + To examine the internal structure and data of Parquet files, you can use the + <cmdname>parquet-tools</cmdname> command that comes with CDH. Make sure this + command is in your <codeph>$PATH</codeph>. (Typically, it is symlinked from + <filepath>/usr/bin</filepath>; sometimes, depending on your installation setup, you + might need to locate it under a CDH-specific <codeph>bin</codeph> directory.) + The arguments to this command let you perform operations such as: + <ul> + <li> + <codeph>cat</codeph>: Print a file's contents to standard out. In CDH 5.5 and higher, you can use + the <codeph>-j</codeph> option to output JSON. + </li> + <li> + <codeph>head</codeph>: Print the first few records of a file to standard output. + </li> + <li> + <codeph>schema</codeph>: Print the Parquet schema for the file. + </li> + <li> + <codeph>meta</codeph>: Print the file footer metadata, including key-value properties (like Avro schema), compression ratios, + encodings, compression used, and row group information. + </li> + <li> + <codeph>dump</codeph>: Print all data and metadata. + </li> + </ul> + Use <codeph>parquet-tools -h</codeph> to see usage information for all the arguments. + Here are some examples showing <cmdname>parquet-tools</cmdname> usage: + +<codeblock><![CDATA[ +$ # Be careful doing this for a big file! Use parquet-tools head to be safe. +$ parquet-tools cat sample.parq +year = 1992 +month = 1 +day = 2 +dayofweek = 4 +dep_time = 748 +crs_dep_time = 750 +arr_time = 851 +crs_arr_time = 846 +carrier = US +flight_num = 53 +actual_elapsed_time = 63 +crs_elapsed_time = 56 +arrdelay = 5 +depdelay = -2 +origin = CMH +dest = IND +distince = 182 +cancelled = 0 +diverted = 0 + +year = 1992 +month = 1 +day = 3 +... +]]> +</codeblock> + +<codeblock><![CDATA[ +$ parquet-tools head -n 2 sample.parq +year = 1992 +month = 1 +day = 2 +dayofweek = 4 +dep_time = 748 +crs_dep_time = 750 +arr_time = 851 +crs_arr_time = 846 +carrier = US +flight_num = 53 +actual_elapsed_time = 63 +crs_elapsed_time = 56 +arrdelay = 5 +depdelay = -2 +origin = CMH +dest = IND +distince = 182 +cancelled = 0 +diverted = 0 + +year = 1992 +month = 1 +day = 3 +... +]]> +</codeblock> + +<codeblock><![CDATA[ +$ parquet-tools schema sample.parq +message schema { + optional int32 year; + optional int32 month; + optional int32 day; + optional int32 dayofweek; + optional int32 dep_time; + optional int32 crs_dep_time; + optional int32 arr_time; + optional int32 crs_arr_time; + optional binary carrier; + optional int32 flight_num; +... +]]> +</codeblock> + +<codeblock><![CDATA[ +$ parquet-tools meta sample.parq +creator: impala version 2.2.0-cdh5.4.3 (build 517bb0f71cd604a00369254ac6d88394df83e0f6) + +file schema: schema +------------------------------------------------------------------- +year: OPTIONAL INT32 R:0 D:1 +month: OPTIONAL INT32 R:0 D:1 +day: OPTIONAL INT32 R:0 D:1 +dayofweek: OPTIONAL INT32 R:0 D:1 +dep_time: OPTIONAL INT32 R:0 D:1 +crs_dep_time: OPTIONAL INT32 R:0 D:1 +arr_time: OPTIONAL INT32 R:0 D:1 +crs_arr_time: OPTIONAL INT32 R:0 D:1 +carrier: OPTIONAL BINARY R:0 D:1 +flight_num: OPTIONAL INT32 R:0 D:1 +... + +row group 1: RC:20636601 TS:265103674 +------------------------------------------------------------------- +year: INT32 SNAPPY DO:4 FPO:35 SZ:10103/49723/4.92 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +month: INT32 SNAPPY DO:10147 FPO:10210 SZ:11380/35732/3.14 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +day: INT32 SNAPPY DO:21572 FPO:21714 SZ:3071658/9868452/3.21 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +dayofweek: INT32 SNAPPY DO:3093276 FPO:3093319 SZ:2274375/5941876/2.61 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +dep_time: INT32 SNAPPY DO:5367705 FPO:5373967 SZ:28281281/28573175/1.01 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +crs_dep_time: INT32 SNAPPY DO:33649039 FPO:33654262 SZ:10220839/11574964/1.13 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +arr_time: INT32 SNAPPY DO:43869935 FPO:43876489 SZ:28562410/28797767/1.01 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +crs_arr_time: INT32 SNAPPY DO:72432398 FPO:72438151 SZ:10908972/12164626/1.12 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +carrier: BINARY SNAPPY DO:83341427 FPO:83341558 SZ:114916/128611/1.12 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +flight_num: INT32 SNAPPY DO:83456393 FPO:83488603 SZ:10216514/11474301/1.12 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +... +]]> +</codeblock> + </p> + + <p id="parquet_ok"> + <b>Parquet considerations:</b> This type is fully compatible with Parquet tables. + </p> + + <p id="analytic_not_allowed_caveat"> + This function cannot be used in an analytic context. That is, the <codeph>OVER()</codeph> clause is not allowed at all with this function. + </p> + + <p id="impala_parquet_encodings_caveat"> + Impala can query Parquet files that use the <codeph>PLAIN</codeph>, <codeph>PLAIN_DICTIONARY</codeph>, + <codeph>BIT_PACKED</codeph>, and <codeph>RLE</codeph> encodings. + Currently, Impala does not support <codeph>RLE_DICTIONARY</codeph> encoding. + When creating files outside of Impala for use by Impala, make sure to use one of the supported encodings. + In particular, for MapReduce jobs, <codeph>parquet.writer.version</codeph> must not be defined + (especially as <codeph>PARQUET_2_0</codeph>) for writing the configurations of Parquet MR jobs. + Use the default version (or format). The default format, 1.0, includes some enhancements that are compatible with older versions. + Data using the 2.0 format might not be consumable by Impala, due to use of the <codeph>RLE_DICTIONARY</codeph> encoding. + </p> + + <note id="restrictions_nonimpala_parquet"> + <p> + Currently, Impala always decodes the column data in Parquet files based on the ordinal position of the + columns, not by looking up the position of each column based on its name. Parquet files produced outside + of Impala must write column data in the same order as the columns are declared in the Impala table. Any + optional columns that are omitted from the data files must be the rightmost columns in the Impala table + definition. + </p> + + <p> + If you created compressed Parquet files through some tool other than Impala, make sure that any + compression codecs are supported in Parquet by Impala. For example, Impala does not currently support LZO + compression in Parquet files. Also doublecheck that you used any recommended compatibility settings in + the other tool, such as <codeph>spark.sql.parquet.binaryAsString</codeph> when writing Parquet files + through Spark. + </p> + </note> + + <p id="text_blurb"> + <b>Text table considerations:</b> + </p> + + <p id="text_bulky"> + <b>Text table considerations:</b> Values of this type are potentially larger in text tables than in tables + using Parquet or other binary formats. + </p> + + <p id="schema_evolution_blurb"> + <b>Schema evolution considerations:</b> + </p> + + <p id="column_stats_blurb"> + <b>Column statistics considerations:</b> + </p> + + <p id="column_stats_constant"> + <b>Column statistics considerations:</b> Because this type has a fixed size, the maximum and average size + fields are always filled in for column statistics, even before you run the <codeph>COMPUTE STATS</codeph> + statement. + </p> + + <p id="column_stats_variable"> + <b>Column statistics considerations:</b> Because the values of this type have variable size, none of the + column statistics fields are filled in until you run the <codeph>COMPUTE STATS</codeph> statement. + </p> + + <p id="usage_notes_blurb"> + <b>Usage notes:</b> + </p> + + <p id="example_blurb"> + <b>Examples:</b> + </p> + + <p id="result_set_blurb"> + <b>Result set:</b> + </p> + + <p id="jdbc_blurb"> + <b>JDBC and ODBC considerations:</b> + </p> + + <p id="cancel_blurb_no"> + <b>Cancellation:</b> Cannot be cancelled. + </p> + + <p id="cancel_blurb_yes"> + <b>Cancellation:</b> Can be cancelled. To cancel this statement, use Ctrl-C from the + <cmdname>impala-shell</cmdname> interpreter, the <uicontrol>Cancel</uicontrol> button from the + <uicontrol>Watch</uicontrol> page in Hue, <uicontrol>Actions > Cancel</uicontrol> from the + <uicontrol>Queries</uicontrol> list in Cloudera Manager, or <uicontrol>Cancel</uicontrol> from the list of + in-flight queries (for a particular node) on the <uicontrol>Queries</uicontrol> tab in the Impala web UI + (port 25000). + </p> + + <p id="cancel_blurb_maybe"> + <b>Cancellation:</b> Certain multi-stage statements (<codeph>CREATE TABLE AS SELECT</codeph> and + <codeph>COMPUTE STATS</codeph>) can be cancelled during some stages, when running <codeph>INSERT</codeph> + or <codeph>SELECT</codeph> operations internally. To cancel this statement, use Ctrl-C from the + <cmdname>impala-shell</cmdname> interpreter, the <uicontrol>Cancel</uicontrol> button from the + <uicontrol>Watch</uicontrol> page in Hue, <uicontrol>Actions > Cancel</uicontrol> from the + <uicontrol>Queries</uicontrol> list in Cloudera Manager, or <uicontrol>Cancel</uicontrol> from the list of + in-flight queries (for a particular node) on the <uicontrol>Queries</uicontrol> tab in the Impala web UI + (port 25000). + </p> + + <p id="partitioning_blurb"> + <b>Partitioning:</b> + </p> + + <p id="partitioning_good"> + <b>Partitioning:</b> Prefer to use this type for a partition key column. Impala can process the numeric + type more efficiently than a <codeph>STRING</codeph> representation of the value. + </p> + + <p id="partitioning_bad"> + <b>Partitioning:</b> This type can be used for partition key columns. Because of the efficiency advantage + of numeric values over character-based values, if the partition key is a string representation of a number, + prefer to use an integer type with sufficient range (<codeph>INT</codeph>, <codeph>BIGINT</codeph>, and so + on) where practical. + </p> + + <p id="partitioning_silly"> + <b>Partitioning:</b> Because this type has so few distinct values, it is typically not a sensible choice + for a partition key column. + </p> + + <p id="partitioning_imprecise"> + <b>Partitioning:</b> Because fractional values of this type are not always represented precisely, when this + type is used for a partition key column, the underlying HDFS directories might not be named exactly as you + expect. Prefer to partition on a <codeph>DECIMAL</codeph> column instead. + </p> + + <p id="partitioning_worrisome"> + <b>Partitioning:</b> Because this type potentially has so many distinct values, it is often not a sensible + choice for a partition key column. For example, events 1 millisecond apart would be stored in different + partitions. Consider using the <codeph>TRUNC()</codeph> function to condense the number of distinct values, + and partition on a new column with the truncated values. + </p> + + <p id="hdfs_blurb"> + <b>HDFS considerations:</b> + </p> + + <p id="file_format_blurb"> + <b>File format considerations:</b> + </p> + + <p id="s3_blurb" rev="2.2.0"> + <b>Amazon S3 considerations:</b> + </p> + + <p id="isilon_blurb" rev="5.4.3"> + <b>Isilon considerations:</b> + </p> + <p id="isilon_block_size_caveat" rev="5.4.3"> + Because the EMC Isilon storage devices use a global value for the block size + rather than a configurable value for each file, the <codeph>PARQUET_FILE_SIZE</codeph> + query option has no effect when Impala inserts data into a table or partition + residing on Isilon storage. Use the <codeph>isi</codeph> command to set the + default block size globally on the Isilon device. For example, to set the + Isilon default block size to 256 MB, the recommended size for Parquet + data files for Impala, issue the following command: +<codeblock>isi hdfs settings modify --default-block-size=256MB</codeblock> + </p> + + <p id="hbase_blurb"> + <b>HBase considerations:</b> + </p> + + <p id="hbase_ok"> + <b>HBase considerations:</b> This data type is fully compatible with HBase tables. + </p> + + <p id="hbase_no"> + <b>HBase considerations:</b> This data type cannot be used with HBase tables. + </p> + + <p id="internals_blurb"> + <b>Internal details:</b> + </p> + + <p id="internals_1_bytes"> + <b>Internal details:</b> Represented in memory as a 1-byte value. + </p> + + <p id="internals_2_bytes"> + <b>Internal details:</b> Represented in memory as a 2-byte value. + </p> + + <p id="internals_4_bytes"> + <b>Internal details:</b> Represented in memory as a 4-byte value. + </p> + + <p id="internals_8_bytes"> + <b>Internal details:</b> Represented in memory as an 8-byte value. + </p> + + <p id="internals_16_bytes"> + <b>Internal details:</b> Represented in memory as a 16-byte value. + </p> + + <p id="internals_max_bytes"> + <b>Internal details:</b> Represented in memory as a byte array with the same size as the length + specification. Values that are shorter than the specified length are padded on the right with trailing + spaces. + </p> + + <p id="internals_min_bytes"> + <b>Internal details:</b> Represented in memory as a byte array with the minimum size needed to represent + each value. + </p> + + <p rev="2.3.0" id="added_in_230"> + <b>Added in:</b> CDH 5.5.0 (Impala 2.3.0) + </p> + + <p rev="2.0.0" id="added_in_20"> + <b>Added in:</b> CDH 5.2.0 (Impala 2.0.0) + </p> + + <p rev="2.0.0" id="enhanced_in_20"> + <b>Added in:</b> Available in earlier Impala releases, but new capabilities were added in + CDH 5.2.0 / Impala 2.0.0 + </p> + + <p id="added_forever"> + <b>Added in:</b> Available in all versions of Impala. + </p> + + <p id="added_in_140"> + <b>Added in:</b> Impala 1.4.0 + </p> + + <p id="added_in_130"> + <b>Added in:</b> Impala 1.3.0 + </p> + + <p id="added_in_11"> + <b>Added in:</b> Impala 1.1 + </p> + + <p id="added_in_111"> + <b>Added in:</b> Impala 1.1.1 + </p> + + <p id="added_in_210"> + <b>Added in:</b> CDH 5.3.0 (Impala 2.1.0) + </p> + + <p id="added_in_220"> + <b>Added in:</b> CDH 5.4.0 (Impala 2.2.0) + </p> + + <p id="syntax_blurb"> + <b>Syntax:</b> + </p> + + <p id="disk_space_blurb"> + For other tips about managing and reclaiming Impala disk space, see + <xref href="../topics/impala_disk_space.xml#disk_space"/>. + </p> + + <p id="join_types"> + Impala supports a wide variety of <codeph>JOIN</codeph> clauses. Left, right, semi, full, and outer joins + are supported in all Impala versions. The <codeph>CROSS JOIN</codeph> operator is available in Impala 1.2.2 + and higher. During performance tuning, you can override the reordering of join clauses that Impala does + internally by including the keyword <codeph>STRAIGHT_JOIN</codeph> immediately after the + <codeph>SELECT</codeph> keyword + </p> + + <p id="catalog_server_124"> + In Impala 1.2.4 and higher, you can specify a table name with <codeph>INVALIDATE METADATA</codeph> after + the table is created in Hive, allowing you to make individual tables visible to Impala without doing a full + reload of the catalog metadata. Impala 1.2.4 also includes other changes to make the metadata broadcast + mechanism faster and more responsive, especially during Impala startup. See + <xref href="../topics/impala_new_features.xml#new_features_124"/> for details. + </p> + + <p id="explain_interpret"> + Read the <codeph>EXPLAIN</codeph> plan from bottom to top: + <ul> + <li> + The last part of the plan shows the low-level details such as the expected amount of data that will be + read, where you can judge the effectiveness of your partitioning strategy and estimate how long it will + take to scan a table based on total data size and the size of the cluster. + </li> + + <li> + As you work your way up, next you see the operations that will be parallelized and performed on each + Impala node. + </li> + + <li> + At the higher levels, you see how data flows when intermediate result sets are combined and transmitted + from one node to another. + </li> + + <li> + See <xref href="../topics/impala_explain_level.xml#explain_level"/> for details about the + <codeph>EXPLAIN_LEVEL</codeph> query option, which lets you customize how much detail to show in the + <codeph>EXPLAIN</codeph> plan depending on whether you are doing high-level or low-level tuning, + dealing with logical or physical aspects of the query. + </li> + </ul> + </p> + +<!-- This sequence of paragraph + codeblock + paragraph is typically referenced in sequence wherever it's reused. --> + + <p id="aggr1"> + Aggregate functions are a special category with different rules. These functions calculate a return value + across all the items in a result set, so they require a <codeph>FROM</codeph> clause in the query: + </p> + +<codeblock id="aggr2" xml:space="preserve">select count(product_id) from product_catalog; +select max(height), avg(height) from census_data where age > 20; +</codeblock> + + <p id="aggr3"> + Aggregate functions also ignore <codeph>NULL</codeph> values rather than returning a <codeph>NULL</codeph> + result. For example, if some rows have <codeph>NULL</codeph> for a particular column, those rows are + ignored when computing the <codeph>AVG()</codeph> for that column. Likewise, specifying + <codeph>COUNT(<varname>col_name</varname>)</codeph> in a query counts only those rows where + <varname>col_name</varname> contains a non-<codeph>NULL</codeph> value. + </p> + + <p> + <ph id="aliases_vs_identifiers"> Aliases follow the same rules as identifiers when it comes to case + insensitivity. Aliases can be longer than identifiers (up to the maximum length of a Java string) and can + include additional characters such as spaces and dashes when they are quoted using backtick characters. + </ph> + </p> + + <p id="views_vs_identifiers"> + Another way to define different names for the same tables or columns is to create views. See + <xref href="../topics/impala_views.xml#views"/> for details. + </p> + + <p id="insert_hints" rev="1.2.2"> + When inserting into partitioned tables, especially using the Parquet file format, you can include a hint in + the <codeph>INSERT</codeph> statement to fine-tune the overall performance of the operation and its + resource usage: + <ul> + <li> + These hints are available in Impala 1.2.2 and higher. + </li> + + <li> + You would only use these hints if an <codeph>INSERT</codeph> into a partitioned Parquet table was + failing due to capacity limits, or if such an <codeph>INSERT</codeph> was succeeding but with + less-than-optimal performance. + </li> + + <li> + To use these hints, put the hint keyword <codeph>[SHUFFLE]</codeph> or <codeph>[NOSHUFFLE]</codeph> + (including the square brackets) after the <codeph>PARTITION</codeph> clause, immediately before the + <codeph>SELECT</codeph> keyword. + </li> + + <li> + <codeph>[SHUFFLE]</codeph> selects an execution plan that minimizes the number of files being written + simultaneously to HDFS, and the number of memory buffers holding data for individual partitions. Thus + it reduces overall resource usage for the <codeph>INSERT</codeph> operation, allowing some + <codeph>INSERT</codeph> operations to succeed that otherwise would fail. It does involve some data + transfer between the nodes so that the data files for a particular partition are all constructed on the + same node. + </li> + + <li> + <codeph>[NOSHUFFLE]</codeph> selects an execution plan that might be faster overall, but might also + produce a larger number of small data files or exceed capacity limits, causing the + <codeph>INSERT</codeph> operation to fail. Use <codeph>[SHUFFLE]</codeph> in cases where an + <codeph>INSERT</codeph> statement fails or runs inefficiently due to all nodes attempting to construct + data for all partitions. + </li> + + <li> + Impala automatically uses the <codeph>[SHUFFLE]</codeph> method if any partition key column in the + source table, mentioned in the <codeph>INSERT ... SELECT</codeph> query, does not have column + statistics. In this case, only the <codeph>[NOSHUFFLE]</codeph> hint would have any effect. + </li> + + <li> + If column statistics are available for all partition key columns in the source table mentioned in the + <codeph>INSERT ... SELECT</codeph> query, Impala chooses whether to use the <codeph>[SHUFFLE]</codeph> + or <codeph>[NOSHUFFLE]</codeph> technique based on the estimated number of distinct values in those + columns and the number of nodes involved in the <codeph>INSERT</codeph> operation. In this case, you + might need the <codeph>[SHUFFLE]</codeph> or the <codeph>[NOSHUFFLE]</codeph> hint to override the + execution plan selected by Impala. + </li> + </ul> + </p> + + <p id="insert_parquet_blocksize"> + Any <codeph>INSERT</codeph> statement for a Parquet table requires enough free space in the HDFS filesystem + to write one block. Because Parquet data files use a block size of 1 GB by default, an + <codeph>INSERT</codeph> might fail (even for a very small amount of data) if your HDFS is running low on + space. + </p> + + <note id="compute_stats_next" type="important"> + After adding or replacing data in a table used in performance-critical queries, issue a <codeph>COMPUTE + STATS</codeph> statement to make sure all statistics are up-to-date. Consider updating statistics for a + table after any <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, or <codeph>CREATE TABLE AS + SELECT</codeph> statement in Impala, or after loading data through Hive and doing a <codeph>REFRESH + <varname>table_name</varname></codeph> in Impala. This technique is especially important for tables that + are very large, used in join queries, or both. + </note> + + <p id="concat_blurb"> + <b>Usage notes:</b> <codeph>concat()</codeph> and <codeph>concat_ws()</codeph> are appropriate for + concatenating the values of multiple columns within the same row, while <codeph>group_concat()</codeph> + joins together values from different rows. + </p> + + <p id="null_sorting_change"> + In Impala 1.2.1 and higher, all <codeph>NULL</codeph> values come at the end of the result set for + <codeph>ORDER BY ... ASC</codeph> queries, and at the beginning of the result set for <codeph>ORDER BY ... + DESC</codeph> queries. In effect, <codeph>NULL</codeph> is considered greater than all other values for + sorting purposes. The original Impala behavior always put <codeph>NULL</codeph> values at the end, even for + <codeph>ORDER BY ... DESC</codeph> queries. The new behavior in Impala 1.2.1 makes Impala more compatible + with other popular database systems. In Impala 1.2.1 and higher, you can override or specify the sorting + behavior for <codeph>NULL</codeph> by adding the clause <codeph>NULLS FIRST</codeph> or <codeph>NULLS + LAST</codeph> at the end of the <codeph>ORDER BY</codeph> clause. + </p> + + <p id="return_same_type"> + <b>Return type:</b> same as the initial argument value, except that integer values are promoted to + <codeph>BIGINT</codeph> and floating-point values are promoted to <codeph>DOUBLE</codeph>; use + <codeph>CAST()</codeph> when inserting into a smaller numeric column + </p> + + <p id="ddl_blurb"> + <b>Statement type:</b> DDL + </p> + + <p id="dml_blurb"> + <b>Statement type:</b> DML (but still affected by + <xref href="../topics/impala_sync_ddl.xml#sync_ddl">SYNC_DDL</xref> query option) + </p> + + <p rev="1.2" id="sync_ddl_blurb"> + If you connect to different Impala nodes within an <cmdname>impala-shell</cmdname> session for + load-balancing purposes, you can enable the <codeph>SYNC_DDL</codeph> query option to make each DDL + statement wait before returning, until the new or changed metadata has been received by all the Impala + nodes. See <xref href="../topics/impala_sync_ddl.xml#sync_ddl"/> for details. + </p> + +<!-- Boost no longer used in Impala 2.0 and later, so this conref is no longer referenced anywhere. --> + + <p id="regexp_boost"> + The Impala regular expression syntax conforms to the POSIX Extended Regular Expression syntax used by the + Boost library. For details, see + <xref href="http://www.boost.org/doc/libs/1_46_0/libs/regex/doc/html/boost_regex/syntax/basic_extended.html" scope="external" format="html">the + Boost documentation</xref>. It has most idioms familiar from regular expressions in Perl, Python, and so + on. It does not support <codeph>.*?</codeph> for non-greedy matches. + </p> + + <p rev="2.0.0" id="regexp_re2"> + In Impala 2.0 and later, the Impala regular expression syntax conforms to the POSIX Extended Regular + Expression syntax used by the Google RE2 library. For details, see + <xref href="https://code.google.com/p/re2/" scope="external" format="html">the RE2 documentation</xref>. It + has most idioms familiar from regular expressions in Perl, Python, and so on, including + <codeph>.*?</codeph> for non-greedy matches. + </p> + + <p rev="2.0.0" id="regexp_re2_warning"> + In Impala 2.0 and later, a change in the underlying regular expression library could cause changes in the + way regular expressions are interpreted by this function. Test any queries that use regular expressions and + adjust the expression patterns if necessary. See + <xref href="../topics/impala_incompatible_changes.xml#incompatible_changes_200"/> for details. + </p> + + <p id="regexp_escapes"> + Because the <cmdname>impala-shell</cmdname> interpreter uses the <codeph>\</codeph> character for escaping, + use <codeph>\\</codeph> to represent the regular expression escape character in any regular expressions + that you submit through <cmdname>impala-shell</cmdname> . You might prefer to use the equivalent character + class names, such as <codeph>[[:digit:]]</codeph> instead of <codeph>\d</codeph> which you would have to + escape
<TRUNCATED>
