http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/shared/impala_common.xml ---------------------------------------------------------------------- diff --git a/docs/shared/impala_common.xml b/docs/shared/impala_common.xml new file mode 100644 index 0000000..b0c9ee5 --- /dev/null +++ b/docs/shared/impala_common.xml @@ -0,0 +1,3690 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="common"> + + <title>Reusable Text, Paragraphs, List Items, and Other Elements for Impala</title> + + <conbody> + + <p> + All the elements in this file with IDs are intended to be conref'ed elsewhere. Practically all of the + conref'ed elements for the Impala docs are in this file, to avoid questions of when it's safe to remove or + move something in any of the 'main' files, and avoid having to change and conref references as a result. + </p> + + <p> + This file defines some dummy subheadings as section elements, just for self-documentation. Using sections + instead of nested concepts lets all the conref links point to a very simple name pattern, + '#common/id_within_the_file', rather than a 3-part reference with an intervening, variable concept ID. + </p> + + <section id="concepts"> + + <title>Conceptual Content</title> + + <p> + Overview and conceptual information for Impala as a whole. + </p> + + <!-- Reconcile the 'advantages' and 'benefits' elements; be mindful of where each is used. --> + + <p id="impala_advantages"> + The following are some of the key advantages of Impala: + + <ul> + <li> + Impala integrates with the existing <keyword keyref="hadoop_distro"/> ecosystem, meaning data can be stored, shared, and accessed using + the various solutions included with <keyword keyref="hadoop_distro"/>. This also avoids data silos and minimizes expensive data movement. + </li> + + <li> + Impala provides access to data stored in <keyword keyref="hadoop_distro"/> without requiring the Java skills required for MapReduce jobs. + Impala can access data directly from the HDFS file system. Impala also provides a SQL front-end to access + data in the HBase database system, <ph rev="2.2.0">or in the Amazon Simple Storage System (S3)</ph>. + </li> + + <li> + Impala returns results typically within seconds or a few minutes, rather than the many minutes or hours + that are often required for Hive queries to complete. + </li> + + <li> + Impala is pioneering the use of the Parquet file format, a columnar storage layout that is optimized for + large-scale queries typical in data warehouse scenarios. + </li> + </ul> + </p> + + <p id="impala_benefits"> + Impala provides: + + <ul> + <li> + Familiar SQL interface that data scientists and analysts already know. + </li> + + <li> + Ability to query high volumes of data (<q>big data</q>) in Apache Hadoop. + </li> + + <li> + Distributed queries in a cluster environment, for convenient scaling and to make use of cost-effective + commodity hardware. + </li> + + <li> + Ability to share data files between different components with no copy or export/import step; for example, + to write with Pig, transform with Hive and query with Impala. Impala can read from and write to Hive + tables, enabling simple data interchange using Impala for analytics on Hive-produced data. + </li> + + <li> + Single system for big data processing and analytics, so customers can avoid costly modeling and ETL just + for analytics. + </li> + </ul> + </p> + + </section> + + <section id="sentry"> + + <title>Sentry-Related Content</title> + + <p> + Material related to Sentry security, intended to be reused between Hive and Impala. Complicated by the fact + that most of it will probably be multi-paragraph or involve subheads, might need to be represented as + nested topics at the end of this file. + </p> + + <table id="sentry_privileges_objects"> + <title>Valid privilege types and objects they apply to</title> + <tgroup cols="2"> + <colspec colnum="1" colname="col1" colwidth="1*"/> + <colspec colnum="2" colname="col2" colwidth="2*"/> + <thead> + <row> + <entry><b>Privilege</b></entry> + <entry><b>Object</b></entry> + </row> + </thead> + <tbody> + <row> + <entry>INSERT</entry> + <entry>DB, TABLE</entry> + </row> + <row> + <entry>SELECT</entry> + <entry>DB, TABLE, COLUMN</entry> + </row> + <row> + <entry>ALL</entry> + <entry>SERVER, TABLE, DB, URI</entry> + </row> + </tbody> + </tgroup> + </table> + + <table id="privileges_sql"> + <title>Privilege table for Hive & Impala operations</title> + <tgroup cols="4"> + <colspec colnum="1" colname="col1" colwidth="1.31*"/> + <colspec colnum="2" colname="col2" colwidth="1.17*"/> + <colspec colnum="3" colname="col3" colwidth="1*"/> + <colspec colname="newCol4" colnum="4" colwidth="1*"/> + <thead> + <row> + <entry>Operation</entry> + <entry>Scope</entry> + <entry>Privileges Required</entry> + <entry>URI</entry> + </row> + </thead> + <tbody> + <row id="create_database_privs"> + <entry>CREATE DATABASE</entry> + <entry>SERVER</entry> + <entry>ALL</entry> + <entry/> + </row> + <row id="drop_database_privs"> + <entry>DROP DATABASE</entry> + <entry>DATABASE</entry> + <entry>ALL</entry> + <entry/> + </row> + <row id="create_table_privs"> + <entry>CREATE TABLE</entry> + <entry>DATABASE</entry> + <entry>ALL</entry> + <entry/> + </row> + <row id="drop_table_privs"> + <entry>DROP TABLE</entry> + <entry>TABLE</entry> + <entry>ALL</entry> + <entry/> + </row> + <row id="create_view_privs"> + <entry>CREATE VIEW<p>-This operation is allowed if you have + column-level <codeph>SELECT</codeph> access to the columns + being used.</p></entry> + <entry>DATABASE; SELECT on TABLE; </entry> + <entry>ALL</entry> + <entry/> + </row> + <row> + <entry>ALTER VIEW<p>-This operation is allowed if you have + column-level <codeph>SELECT</codeph> access to the columns + being used.</p></entry> + <entry>VIEW/TABLE</entry> + <entry>ALL</entry> + <entry/> + </row> + <row id="drop_view_privs"> + <entry>DROP VIEW</entry> + <entry>VIEW/TABLE</entry> + <entry>ALL</entry> + <entry/> + </row> + <row id="alter_table_add_columns_privs"> + <entry>ALTER TABLE .. ADD COLUMNS</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row id="alter_table_replace_columns_privs"> + <entry>ALTER TABLE .. REPLACE COLUMNS</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row id="alter_table_change_column_privs"> + <entry>ALTER TABLE .. CHANGE column</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row id="alter_table_rename_privs"> + <entry>ALTER TABLE .. RENAME</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row id="alter_table_set_tblproperties_privs"> + <entry>ALTER TABLE .. SET TBLPROPERTIES</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row id="alter_table_set_fileformat_privs"> + <entry>ALTER TABLE .. SET FILEFORMAT</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row id="alter_table_set_location_privs"> + <entry>ALTER TABLE .. SET LOCATION</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry>URI</entry> + </row> + <row id="alter_table_add_partition_privs"> + <entry>ALTER TABLE .. ADD PARTITION</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row id="alter_table_add_partition_location_privs"> + <entry>ALTER TABLE .. ADD PARTITION location</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry>URI</entry> + </row> + <row id="alter_table_drop_partition_privs"> + <entry>ALTER TABLE .. DROP PARTITION</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row id="alter_table_partition_set_fileformat_privs"> + <entry>ALTER TABLE .. PARTITION SET FILEFORMAT</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row id="show_create_table_privs"> + <entry>SHOW CREATE TABLE</entry> + <entry>TABLE</entry> + <entry>SELECT/INSERT</entry> + <entry/> + </row> + <row id="show_partitions_privs"> + <entry>SHOW PARTITIONS</entry> + <entry>TABLE</entry> + <entry>SELECT/INSERT</entry> + <entry/> + </row> + <row> + <entry>SHOW TABLES<p>-Output includes all the tables for which + the user has table-level privileges and all the tables for + which the user has some column-level privileges.</p></entry> + <entry>TABLE</entry> + <entry>SELECT/INSERT</entry> + <entry/> + </row> + <row> + <entry>SHOW GRANT ROLE<p>-Output includes an additional field + for any column-level privileges.</p></entry> + <entry>TABLE</entry> + <entry>SELECT/INSERT</entry> + <entry/> + </row> + <row id="describe_table_privs"> + <entry>DESCRIBE TABLE<p>-Output shows <i>all</i> columns if the + user has table level-privileges or <codeph>SELECT</codeph> + privilege on at least one table column</p></entry> + <entry>TABLE</entry> + <entry>SELECT/INSERT</entry> + <entry/> + </row> + <row id="load_data_privs"> + <entry>LOAD DATA</entry> + <entry>TABLE</entry> + <entry>INSERT</entry> + <entry>URI</entry> + </row> + <row id="select_privs"> + <entry>SELECT<p>-You can grant the SELECT privilege on a view to + give users access to specific columns of a table they do not + otherwise have access to.</p><p>-See + <xref audience="integrated" href="../topics/sg_hive_sql.xml#concept_c2q_4qx_p4/col_level_auth_sentry"/><xref audience="standalone" href="https://www.cloudera.com/documentation/enterprise/latest/topics/sg_hive_sql.html" format="html" scope="external"/> + for details on allowed column-level + operations.</p></entry> + <entry>VIEW/TABLE; COLUMN</entry> + <entry>SELECT</entry> + <entry/> + </row> + <row id="insert_overwrite_table_privs"> + <entry>INSERT OVERWRITE TABLE</entry> + <entry>TABLE</entry> + <entry>INSERT</entry> + <entry/> + </row> + <row id="create_table_as_select_privs"> + <entry>CREATE TABLE .. AS SELECT<p>-This operation is allowed if + you have column-level <codeph>SELECT</codeph> access to the + columns being used.</p></entry> + <entry>DATABASE; SELECT on TABLE</entry> + <entry>ALL</entry> + <entry/> + </row> + <row id="use_privs"> + <entry>USE <dbName></entry> + <entry>Any</entry> + <entry/> + <entry/> + </row> + <row id="create_function_privs"> + <entry>CREATE FUNCTION</entry> + <entry>SERVER</entry> + <entry>ALL</entry> + <entry/> + </row> + <row id="alter_table_set_serdeproperties_privs"> + <entry>ALTER TABLE .. SET SERDEPROPERTIES</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row> + <entry>ALTER TABLE .. PARTITION SET SERDEPROPERTIES</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row id="_privs"> + <entry namest="col1" nameend="newCol4"><b>Hive-Only + Operations</b></entry> + </row> + <row id="insert_overwrite_directory_privs"> + <entry>INSERT OVERWRITE DIRECTORY</entry> + <entry>TABLE</entry> + <entry>INSERT</entry> + <entry>URI</entry> + </row> + <row id="analyze_table_privs"> + <entry>Analyze TABLE</entry> + <entry>TABLE</entry> + <entry>SELECT + INSERT</entry> + <entry/> + </row> + <row id="import_table_privs"> + <entry>IMPORT TABLE</entry> + <entry>DATABASE</entry> + <entry>ALL</entry> + <entry>URI</entry> + </row> + <row id="export_table_privs"> + <entry>EXPORT TABLE</entry> + <entry>TABLE</entry> + <entry>SELECT</entry> + <entry>URI</entry> + </row> + <row id="alter_table_touch_privs"> + <entry>ALTER TABLE TOUCH</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row id="alter_table_touch_partition_privs"> + <entry>ALTER TABLE TOUCH PARTITION</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row id="alter_table_clustered_by_privs"> + <entry>ALTER TABLE .. CLUSTERED BY SORTED BY</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row id="alter_table_enable_privs"> + <entry>ALTER TABLE .. ENABLE/DISABLE</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row id="alter_table_partition_enable_privs"> + <entry>ALTER TABLE .. PARTITION ENABLE/DISABLE</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row id="alter_table_partition_rename_privs"> + <entry>ALTER TABLE .. PARTITION.. RENAME TO PARTITION</entry> + <entry>TABLE</entry> + <entry>ALL on DATABASE</entry> + <entry/> + </row> + <row> + <entry>MSCK REPAIR TABLE</entry> + <entry>TABLE</entry> + <entry>ALL</entry> + <entry/> + </row> + <row id="alter_database_privs"> + <entry>ALTER DATABASE</entry> + <entry>DATABASE</entry> + <entry>ALL</entry> + <entry/> + </row> + <row id="describe_database_privs"> + <entry>DESCRIBE DATABASE</entry> + <entry>DATABASE</entry> + <entry>SELECT/INSERT</entry> + <entry/> + </row> + <row id="show_columns_privs"> + <entry>SHOW COLUMNS<p>-Output for this operation filters columns + to which the user does not have explicit + <codeph>SELECT</codeph> access </p></entry> + <entry>TABLE</entry> + <entry>SELECT/INSERT</entry> + <entry/> + </row> + <row id="create_index_privs"> + <entry>CREATE INDEX</entry> + <entry>TABLE</entry> + <entry>ALL</entry> + <entry/> + </row> + <row id="drop_index_privs"> + <entry>DROP INDEX</entry> + <entry>TABLE</entry> + <entry>ALL</entry> + <entry/> + </row> + <row id="show_indexes_privs"> + <entry>SHOW INDEXES</entry> + <entry>TABLE</entry> + <entry>SELECT/INSERT</entry> + <entry/> + </row> + <row id="grant_privs"> + <entry>GRANT PRIVILEGE</entry> + <entry>Allowed only for Sentry admin users</entry> + <entry/> + <entry/> + </row> + <row id="revoke_privs"> + <entry>REVOKE PRIVILEGE</entry> + <entry>Allowed only for Sentry admin users</entry> + <entry/> + <entry/> + </row> + <row id="show_grants_privs"> + <entry>SHOW GRANTS</entry> + <entry>Allowed only for Sentry admin users</entry> + <entry/> + <entry/> + </row> + <row id="show_tblproperties_privs"> + <entry>SHOW TBLPROPERTIES</entry> + <entry>TABLE</entry> + <entry>SELECT/INSERT</entry> + <entry/> + </row> + <row id="describe_table_partition_privs"> + <entry>DESCRIBE TABLE .. PARTITION</entry> + <entry>TABLE</entry> + <entry>SELECT/INSERT</entry> + <entry/> + </row> + <row id="add_jar_privs"> + <entry>ADD JAR</entry> + <entry>Not Allowed</entry> + <entry/> + <entry/> + </row> + <row id="add_file_privs"> + <entry>ADD FILE</entry> + <entry>Not Allowed</entry> + <entry/> + <entry/> + </row> + <row id="dfs_privs"> + <entry>DFS</entry> + <entry>Not Allowed</entry> + <entry/> + <entry/> + </row> + <row> + <entry namest="col1" nameend="newCol4"><b>Impala-Only + Operations</b></entry> + </row> + <row id="explain_privs"> + <entry>EXPLAIN</entry> + <entry>TABLE; COLUMN</entry> + <entry>SELECT</entry> + <entry/> + </row> + <row id="invalidate_metadata_privs"> + <entry>INVALIDATE METADATA</entry> + <entry>SERVER</entry> + <entry>ALL</entry> + <entry/> + </row> + <row id="invalidate_metadata_table_privs"> + <entry>INVALIDATE METADATA <table name></entry> + <entry>TABLE</entry> + <entry>SELECT/INSERT</entry> + <entry/> + </row> + <row id="refresh_privs"> + <entry>REFRESH <table name> or REFRESH <table name> PARTITION (<partition_spec>)</entry> + <entry>TABLE</entry> + <entry>SELECT/INSERT</entry> + <entry/> + </row> + <row id="drop_function_privs"> + <entry>DROP FUNCTION</entry> + <entry>SERVER</entry> + <entry>ALL</entry> + <entry/> + </row> + <row id="compute_stats_privs"> + <entry>COMPUTE STATS</entry> + <entry>TABLE</entry> + <entry>ALL</entry> + <entry/> + </row> + </tbody> + </tgroup> + </table> + + <p rev="IMPALA-2660 CDH-40241" id="auth_to_local_instructions"> + In <keyword keyref="impala26_full"/> and higher, Impala recognizes the <codeph>auth_to_local</codeph> setting, + specified through the HDFS configuration setting + <codeph>hadoop.security.auth_to_local</codeph> + or the Cloudera Manager setting + <uicontrol>Additional Rules to Map Kerberos Principals to Short Names</uicontrol>. + This feature is disabled by default, to avoid an unexpected change in security-related behavior. + To enable it: + <ul> + <li> + <p> + For clusters not managed by Cloudera Manager, specify <codeph>--load_auth_to_local_rules=true</codeph> + in the <cmdname>impalad</cmdname> and <cmdname>catalogd</cmdname>configuration settings. + </p> + </li> + <li> + <p> + For clusters managed by Cloudera Manager, select the + <uicontrol>Use HDFS Rules to Map Kerberos Principals to Short Names</uicontrol> + checkbox to enable the service-wide <codeph>load_auth_to_local_rules</codeph> configuration setting. + Then restart the Impala service. + </p> + </li> + </ul> + See <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/sg_auth_to_local_isolate.html" scope="external" format="html">Using Auth-to-Local Rules to Isolate Cluster Users</xref> for general information about this feature. + </p> + + <note id="authentication_vs_authorization"> + Regardless of the authentication mechanism used, Impala always creates HDFS directories and data files + owned by the same user (typically <codeph>impala</codeph>). To implement user-level access to different + databases, tables, columns, partitions, and so on, use the Sentry authorization feature, as explained in + <xref href="../topics/impala_authorization.xml#authorization"/>. + </note> + +<!-- Contrived nesting needed to allow <ph> with ID to be reused inside the <title> of a conref. --> + + <p> + <b><ph id="title_sentry_debug">Debugging Failed Sentry Authorization Requests</ph></b> + </p> + + <p id="sentry_debug"> + Sentry logs all facts that lead up to authorization decisions at the debug level. If you do not understand + why Sentry is denying access, the best way to debug is to temporarily turn on debug logging: + <ul> + <li> + In Cloudera Manager, add <codeph>log4j.logger.org.apache.sentry=DEBUG</codeph> to the logging settings + for your service through the corresponding <uicontrol>Logging Safety Valve</uicontrol> field for the + Impala, Hive Server 2, or Solr Server services. + </li> + + <li> + On systems not managed by Cloudera Manager, add <codeph>log4j.logger.org.apache.sentry=DEBUG</codeph> + to the <filepath>log4j.properties</filepath> file on each host in the cluster, in the appropriate + configuration directory for each service. + </li> + </ul> + Specifically, look for exceptions and messages such as: +<codeblock xml:space="preserve">FilePermission server..., RequestPermission server...., result [true|false]</codeblock> + which indicate each evaluation Sentry makes. The <codeph>FilePermission</codeph> is from the policy file, + while <codeph>RequestPermission</codeph> is the privilege required for the query. A + <codeph>RequestPermission</codeph> will iterate over all appropriate <codeph>FilePermission</codeph> + settings until a match is found. If no matching privilege is found, Sentry returns <codeph>false</codeph> + indicating <q>Access Denied</q> . +<!-- +[1] +Impala: Impala Daemon -> Advanced -> Impala Daemon Logging Safety Valve +Hive: Hive Server 2 -> Advanced -> HiveServer2 Logging Safety Valve +Search: Solr Server -> Advanced -> HiveServer2 Logging Safety Valve +--> + </p> + + </section> + + <section id="cm"> + + <title>Cloudera Manager Terminology</title> + + <p> + Especially during the transition from CM 4 to CM 5, we'll use some stock phraseology to talk about fields + and such. Also there are some task steps etc. to conref under the Impala Service page that are easier + to keep track of here instead of in cm_common_elements.xml. (Although as part of Apache work, anything + CM might naturally move out of this file.) + </p> + + <p> + <ph id="safety_valve"> In Cloudera Manager 4, these fields are labelled <uicontrol>Safety + Valve</uicontrol>; in Cloudera Manager 5, they are called <uicontrol>Advanced Configuration + Snippet</uicontrol>. </ph> + </p> + + <ul> + <li id="go_impala_service">Go to the Impala service.</li> + <li id="restart_impala_service">Restart the Impala service.</li> + </ul> + + </section> + + <section id="citi"> + + <title>Items from the Citibank Escalation Spreadsheet</title> + + <p> + Paragraphs with IDs are intended to be reused both in the FAQ and the User's Guide. They refer to feature + requests or misunderstandings encountered by Citibank, captured in the escalation spreadsheet here: + <xref href="https://docs.google.com/a/cloudera.com/spreadsheet/ccc?key=0AplfwQJKyyTWdFdhY0E5WHVwNXZSTG9sMEZwQy1QZ1E&usp=drive_web#gid=0" scope="external" format="html"/>. + </p> + + <p id="string_concatenation"> + With Impala, you use the built-in <codeph>CONCAT()</codeph> function to concatenate two, three, or more + strings: +<codeblock xml:space="preserve">select concat('some prefix: ', col1) from t1; +select concat('abc','mno','xyz');</codeblock> + Impala does not currently support operators for string concatenation, such as <codeph>||</codeph> as seen + in some other database systems. + </p> + + <p id="column_aliases"> + You can specify column aliases with or without the <codeph>AS</codeph> keyword, and with no quotation + marks, single quotation marks, or double quotation marks. Some kind of quotation marks are required if the + column alias contains any spaces or other problematic characters. The alias text is displayed in the + <cmdname>impala-shell</cmdname> output as all-lowercase. For example: +<codeblock xml:space="preserve">[localhost:21000] > select c1 First_Column from t; +[localhost:21000] > select c1 as First_Column from t; ++--------------+ +| first_column | ++--------------+ +... + +[localhost:21000] > select c1 'First Column' from t; +[localhost:21000] > select c1 as 'First Column' from t; ++--------------+ +| first column | ++--------------+ +... + +[localhost:21000] > select c1 "First Column" from t; +[localhost:21000] > select c1 as "First Column" from t; ++--------------+ +| first column | ++--------------+ +...</codeblock> + </p> + + <p id="temp_tables"> + Currently, Impala does not support temporary tables. Some other database systems have a class of + <q>lightweight</q> tables that are held only in memory and/or that are only accessible by one connection + and disappear when the session ends. In Impala, creating new databases is a relatively lightweight + operation, so as an alternative, you could create a database with a unique name and use <codeph>CREATE + TABLE LIKE</codeph>, <codeph>CREATE TABLE AS SELECT</codeph>, and <codeph>INSERT</codeph> statements to + create a table in that database to hold the result set of a query, to use in subsequent queries. When + finished, issue a <codeph>DROP TABLE</codeph> statement followed by <codeph>DROP DATABASE</codeph>. + </p> + + </section> + + <section id="standards"> + + <title>Blurbs About Standards Compliance</title> + + <p> + The following blurbs simplify the process of flagging which SQL standard various features were first + introduced in. The wording and the tagging can be modified by editing one central instance of each blurb. + Not extensively used yet, just here and there in the SQL Language Reference section. + </p> + + <p id="sql1986"> +<!-- No Wikipedia page for SQL-1986, so no link. --> + <b>Standards compliance:</b> Introduced in SQL-1986. + </p> + + <p id="sql1989"> +<!-- No Wikipedia page for SQL-1989, so no link. --> + <b>Standards compliance:</b> Introduced in SQL-1989. + </p> + + <p id="sql1992"> + <b>Standards compliance:</b> Introduced in + <xref href="http://en.wikipedia.org/wiki/SQL-92" scope="external" format="html">SQL-1992</xref>. + </p> + + <p id="sql1999"> + <b>Standards compliance:</b> Introduced in + <xref href="http://en.wikipedia.org/wiki/SQL:1999" scope="external" format="html">SQL:1999</xref>. + </p> + + <p id="sql2003"> + <b>Standards compliance:</b> Introduced in + <xref href="http://en.wikipedia.org/wiki/SQL:2003" scope="external" format="html">SQL:2003</xref>. + </p> + + <p id="sql2008"> + <b>Standards compliance:</b> Introduced in + <xref href="http://en.wikipedia.org/wiki/SQL:2008" scope="external" format="html">SQL:2008</xref>. + </p> + + <p id="sql2011"> + <b>Standards compliance:</b> Introduced in + <xref href="http://en.wikipedia.org/wiki/SQL:2011" scope="external" format="html">SQL:2011</xref>. + </p> + + <p id="hiveql"> + <b>Standards compliance:</b> Extension first introduced in HiveQL. + </p> + + <p id="impalaql"> + <b>Standards compliance:</b> Extension first introduced in Impala. + </p> + + </section> + + <section id="refresh_invalidate"> + + <title>Background Info for REFRESH, INVALIDATE METADATA, and General Metadata Discussion</title> + + <p id="invalidate_then_refresh" rev="DOCS-1013"> + Because <codeph>REFRESH <varname>table_name</varname></codeph> only works for tables that the current + Impala node is already aware of, when you create a new table in the Hive shell, enter + <codeph>INVALIDATE METADATA <varname>new_table</varname></codeph> before you can see the new table in + <cmdname>impala-shell</cmdname>. Once the table is known by Impala, you can issue <codeph>REFRESH + <varname>table_name</varname></codeph> after you add data files for that table. + </p> + + <p id="refresh_vs_invalidate"> + <codeph>INVALIDATE METADATA</codeph> and <codeph>REFRESH</codeph> are counterparts: <codeph>INVALIDATE + METADATA</codeph> waits to reload the metadata when needed for a subsequent query, but reloads all the + metadata for the table, which can be an expensive operation, especially for large tables with many + partitions. <codeph>REFRESH</codeph> reloads the metadata immediately, but only loads the block location + data for newly added data files, making it a less expensive operation overall. If data was altered in some + more extensive way, such as being reorganized by the HDFS balancer, use <codeph>INVALIDATE + METADATA</codeph> to avoid a performance penalty from reduced local reads. If you used Impala version 1.0, + the <codeph>INVALIDATE METADATA</codeph> statement works just like the Impala 1.0 <codeph>REFRESH</codeph> + statement did, while the Impala 1.1 <codeph>REFRESH</codeph> is optimized for the common use case of adding + new data files to an existing table, thus the table name argument is now required. + </p> + + </section> + + <section id="sql_ref"> + + <title>SQL Language Reference Snippets</title> + + <p> + These reusable chunks were taken from conrefs originally in <filepath>ciiu_langref_sql.xml</filepath>. Or + they are primarily used in new SQL syntax topics underneath that parent topic. + </p> + +<codeblock id="parquet_fallback_schema_resolution_example"><![CDATA[ +create database schema_evolution; +use schema_evolution; +create table t1 (c1 int, c2 boolean, c3 string, c4 timestamp) + stored as parquet; +insert into t1 values + (1, true, 'yes', now()), + (2, false, 'no', now() + interval 1 day); + +select * from t1; ++----+-------+-----+-------------------------------+ +| c1 | c2 | c3 | c4 | ++----+-------+-----+-------------------------------+ +| 1 | true | yes | 2016-06-28 14:53:26.554369000 | +| 2 | false | no | 2016-06-29 14:53:26.554369000 | ++----+-------+-----+-------------------------------+ + +desc formatted t1; +... +| Location: | /user/hive/warehouse/schema_evolution.db/t1 | +... + +-- Make T2 have the same data file as in T1, including 2 +-- unused columns and column order different than T2 expects. +load data inpath '/user/hive/warehouse/schema_evolution.db/t1' + into table t2; ++----------------------------------------------------------+ +| summary | ++----------------------------------------------------------+ +| Loaded 1 file(s). Total files in destination location: 1 | ++----------------------------------------------------------+ + +-- 'position' is the default setting. +-- Impala cannot read the Parquet file if the column order does not match. +set PARQUET_FALLBACK_SCHEMA_RESOLUTION=position; +PARQUET_FALLBACK_SCHEMA_RESOLUTION set to position + +select * from t2; +WARNINGS: +File 'schema_evolution.db/t2/45331705_data.0.parq' +has an incompatible Parquet schema for column 'schema_evolution.t2.c4'. +Column type: TIMESTAMP, Parquet schema: optional int32 c1 [i:0 d:1 r:0] + +File 'schema_evolution.db/t2/45331705_data.0.parq' +has an incompatible Parquet schema for column 'schema_evolution.t2.c4'. +Column type: TIMESTAMP, Parquet schema: optional int32 c1 [i:0 d:1 r:0] + +-- With the 'name' setting, Impala can read the Parquet data files +-- despite mismatching column order. +set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; +PARQUET_FALLBACK_SCHEMA_RESOLUTION set to name + +select * from t2; ++-------------------------------+-------+ +| c4 | c2 | ++-------------------------------+-------+ +| 2016-06-28 14:53:26.554369000 | true | +| 2016-06-29 14:53:26.554369000 | false | ++-------------------------------+-------+ +]]> +</codeblock> + + <note rev="IMPALA-3334" id="one_but_not_true"> + In CDH 5.7.0 / Impala 2.5.0, only the value 1 enables the option, and the value + <codeph>true</codeph> is not recognized. This limitation is + tracked by the issue + <xref href="https://issues.cloudera.org/browse/IMPALA-3334" scope="external" format="html">IMPALA-3334</xref>, + which shows the releases where the problem is fixed. + </note> + + <p rev="IMPALA-3732" id="avro_2gb_strings"> + The Avro specification allows string values up to 2**64 bytes in length. + Impala queries for Avro tables use 32-bit integers to hold string lengths. + In <keyword keyref="impala25_full"/> and higher, Impala truncates <codeph>CHAR</codeph> + and <codeph>VARCHAR</codeph> values in Avro tables to (2**31)-1 bytes. + If a query encounters a <codeph>STRING</codeph> value longer than (2**31)-1 + bytes in an Avro table, the query fails. In earlier releases, + encountering such long values in an Avro table could cause a crash. + </p> + + <p rev="2.6.0 IMPALA-3369" id="set_column_stats_example"> + You specify a case-insensitive symbolic name for the kind of statistics: + <codeph>numDVs</codeph>, <codeph>numNulls</codeph>, <codeph>avgSize</codeph>, <codeph>maxSize</codeph>. + The key names and values are both quoted. This operation applies to an entire table, + not a specific partition. For example: +<codeblock> +create table t1 (x int, s string); +insert into t1 values (1, 'one'), (2, 'two'), (2, 'deux'); +show column stats t1; ++--------+--------+------------------+--------+----------+----------+ +| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | ++--------+--------+------------------+--------+----------+----------+ +| x | INT | -1 | -1 | 4 | 4 | +| s | STRING | -1 | -1 | -1 | -1 | ++--------+--------+------------------+--------+----------+----------+ +alter table t1 set column stats x ('numDVs'='2','numNulls'='0'); +alter table t1 set column stats s ('numdvs'='3','maxsize'='4'); +show column stats t1; ++--------+--------+------------------+--------+----------+----------+ +| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | ++--------+--------+------------------+--------+----------+----------+ +| x | INT | 2 | 0 | 4 | 4 | +| s | STRING | 3 | -1 | 4 | -1 | ++--------+--------+------------------+--------+----------+----------+ +</codeblock> + </p> + +<codeblock id="set_numrows_example">create table analysis_data stored as parquet as select * from raw_data; +Inserted 1000000000 rows in 181.98s +compute stats analysis_data; +insert into analysis_data select * from smaller_table_we_forgot_before; +Inserted 1000000 rows in 15.32s +-- Now there are 1001000000 rows. We can update this single data point in the stats. +alter table analysis_data set tblproperties('numRows'='1001000000', 'STATS_GENERATED_VIA_STATS_TASK'='true');</codeblock> + +<codeblock id="set_numrows_partitioned_example">-- If the table originally contained 1 million rows, and we add another partition with 30 thousand rows, +-- change the numRows property for the partition and the overall table. +alter table partitioned_data partition(year=2009, month=4) set tblproperties ('numRows'='30000', 'STATS_GENERATED_VIA_STATS_TASK'='true'); +alter table partitioned_data set tblproperties ('numRows'='1030000', 'STATS_GENERATED_VIA_STATS_TASK'='true');</codeblock> + + <p id="int_overflow_behavior"> + Impala does not return column overflows as <codeph>NULL</codeph>, so that customers can distinguish + between <codeph>NULL</codeph> data and overflow conditions similar to how they do so with traditional + database systems. Impala returns the largest or smallest value in the range for the type. For example, + valid values for a <codeph>tinyint</codeph> range from -128 to 127. In Impala, a <codeph>tinyint</codeph> + with a value of -200 returns -128 rather than <codeph>NULL</codeph>. A <codeph>tinyint</codeph> with a + value of 200 returns 127. + </p> + + <p rev="2.5.0" id="partition_key_optimization"> + If you frequently run aggregate functions such as <codeph>MIN()</codeph>, <codeph>MAX()</codeph>, and + <codeph>COUNT(DISTINCT)</codeph> on partition key columns, consider enabling the <codeph>OPTIMIZE_PARTITION_KEY_SCANS</codeph> + query option, which optimizes such queries. This feature is available in <keyword keyref="impala25_full"/> and higher. + See <xref href="../topics/impala_optimize_partition_key_scans.xml"/> + for the kinds of queries that this option applies to, and slight differences in how partitions are + evaluated when this query option is enabled. + </p> + + <p id="live_reporting_details"> + The output from this query option is printed to standard error. The output is only displayed in interactive mode, + that is, not when the <codeph>-q</codeph> or <codeph>-f</codeph> options are used. + </p> + + <p id="live_progress_live_summary_asciinema"> + To see how the <codeph>LIVE_PROGRESS</codeph> and <codeph>LIVE_SUMMARY</codeph> query options + work in real time, see <xref href="https://asciinema.org/a/1rv7qippo0fe7h5k1b6k4nexk" scope="external" format="html">this animated demo</xref>. + </p> + + <p rev="2.5.0" id="runtime_filter_mode_blurb"> + Because the runtime filtering feature is enabled by default only for local processing, + the other filtering-related query options have the greatest effect when used in + combination with the setting <codeph>RUNTIME_FILTER_MODE=GLOBAL</codeph>. + </p> + + <p rev="2.5.0" id="runtime_filtering_option_caveat"> + Because the runtime filtering feature applies mainly to resource-intensive + and long-running queries, only adjust this query option when tuning long-running queries + involving some combination of large partitioned tables and joins involving large tables. + </p> + + <p rev="2.3.0" id="impala_shell_progress_reports_compute_stats_caveat"> + The <codeph>LIVE_PROGRESS</codeph> and <codeph>LIVE_SUMMARY</codeph> query options + currently do not produce any output during <codeph>COMPUTE STATS</codeph> operations. + </p> + +<!-- This is a shorter version of the similar 'caveat' text. This shorter one can be reused more easily in various places. --> + <p rev="2.3.0" id="impala_shell_progress_reports_shell_only_blurb"> + The <codeph>LIVE_PROGRESS</codeph> and <codeph>LIVE_SUMMARY</codeph> query options only apply + inside the <cmdname>impala-shell</cmdname> interpreter. You cannot use them with the + <codeph>SET</codeph> statement from a JDBC or ODBC application. + </p> + + <p id="impala_shell_progress_reports_shell_only_caveat"> + Because the <codeph>LIVE_PROGRESS</codeph> and <codeph>LIVE_SUMMARY</codeph> query options + are available only within the <cmdname>impala-shell</cmdname> interpreter: + <ul> + <li> + <p> + You cannot change these query options through the SQL <codeph>SET</codeph> + statement using the JDBC or ODBC interfaces. The <codeph>SET</codeph> + command in <cmdname>impala-shell</cmdname> recognizes these names as + shell-only options. + </p> + </li> + <li> + <p> + Be careful when using <cmdname>impala-shell</cmdname> on a pre-CDH 5.5 + system to connect to Impala running on a CDH 5.5 or higher system. + The older <cmdname>impala-shell</cmdname> does not recognize these + query option names. Upgrade <cmdname>impala-shell</cmdname> on the + systems where you intend to use these query options. + </p> + </li> + <li> + <p> + Likewise, the <cmdname>impala-shell</cmdname> command relies on + some information only available in <keyword keyref="impala23_full"/> and higher + to prepare live progress reports and query summaries. The + <codeph>LIVE_PROGRESS</codeph> and <codeph>LIVE_SUMMARY</codeph> + query options have no effect when <cmdname>impala-shell</cmdname> connects + to a cluster running an older version of Impala. + </p> + </li> + </ul> + </p> + +<!-- Same example used in both CREATE DATABASE and DROP DATABASE. --> +<codeblock id="create_drop_db_example">create database first_db; +use first_db; +create table t1 (x int); + +create database second_db; +use second_db; +-- Each database has its own namespace for tables. +-- You can reuse the same table names in each database. +create table t1 (s string); + +create database temp; + +-- You can either USE a database after creating it, +-- or qualify all references to the table name with the name of the database. +-- Here, tables T2 and T3 are both created in the TEMP database. + +create table temp.t2 (x int, y int); +use database temp; +create table t3 (s string); + +-- You cannot drop a database while it is selected by the USE statement. +drop database temp; +<i>ERROR: AnalysisException: Cannot drop current default database: temp</i> + +-- The always-available database 'default' is a convenient one to USE +-- before dropping a database you created. +use default; + +-- Before dropping a database, first drop all the tables inside it, +<ph rev="2.3.0">-- or in <keyword keyref="impala23_full"/> and higher use the CASCADE clause.</ph> +drop database temp; +ERROR: ImpalaRuntimeException: Error making 'dropDatabase' RPC to Hive Metastore: +CAUSED BY: InvalidOperationException: Database temp is not empty +show tables in temp; ++------+ +| name | ++------+ +| t3 | ++------+ + +<ph rev="2.3.0">-- <keyword keyref="impala23_full"/> and higher:</ph> +<ph rev="2.3.0">drop database temp cascade;</ph> + +-- CDH 5.4 and lower: +drop table temp.t3; +drop database temp; +</codeblock> + + <p id="cast_convenience_fn_example"> + This example shows how to use the <codeph>castto*()</codeph> functions as an equivalent + to <codeph>CAST(<varname>value</varname> AS <varname>type</varname>)</codeph> expressions. + </p> + + <p id="cast_convenience_fn_usage"><b>Usage notes:</b> + A convenience function to skip the SQL <codeph>CAST <varname>value</varname> AS <varname>type</varname></codeph> syntax, + for example when programmatically generating SQL statements where a regular function call might be easier to construct. + </p> + + <p rev="2.3.0" id="current_timezone_tip"> + To determine the time zone of the server you are connected to, in CDH 5.5 / Impala 2.3 and + higher you can call the <codeph>timeofday()</codeph> function, which includes the time zone + specifier in its return value. Remember that with cloud computing, the server you interact + with might be in a different time zone than you are, or different sessions might connect to + servers in different time zones, or a cluster might include servers in more than one time zone. + </p> + + <p rev="2.2.0" id="timezone_conversion_caveat"> + The way this function deals with time zones when converting to or from <codeph>TIMESTAMP</codeph> + values is affected by the <codeph>-use_local_tz_for_unix_timestamp_conversions</codeph> startup flag for the + <cmdname>impalad</cmdname> daemon. See <xref href="../topics/impala_timestamp.xml#timestamp"/> for details about + how Impala handles time zone considerations for the <codeph>TIMESTAMP</codeph> data type. + </p> + + <p rev="2.6.0 CDH-39913 IMPALA-3558" id="s3_drop_table_purge"> + For best compatibility with the S3 write support in CDH 5.8 / Impala 2.6 + and higher: + <ul> + <li>Use native Hadoop techniques to create data files in S3 for querying through Impala.</li> + <li>Use the <codeph>PURGE</codeph> clause of <codeph>DROP TABLE</codeph> when dropping internal (managed) tables.</li> + </ul> + By default, when you drop an internal (managed) table, the data files are + moved to the HDFS trashcan. This operation is expensive for tables that + reside on the Amazon S3 filesystem. Therefore, for S3 tables, prefer to use + <codeph>DROP TABLE <varname>table_name</varname> PURGE</codeph> rather than the default <codeph>DROP TABLE</codeph> statement. + The <codeph>PURGE</codeph> clause makes Impala delete the data files immediately, + skipping the HDFS trashcan. + For the <codeph>PURGE</codeph> clause to work effectively, you must originally create the + data files on S3 using one of the tools from the Hadoop ecosystem, such as + <codeph>hadoop fs -cp</codeph>, or <codeph>INSERT</codeph> in Impala or Hive. + </p> + + <p rev="2.6.0 CDH-39913 IMPALA-1878" id="s3_dml_performance"> + Because of differences between S3 and traditional filesystems, DML operations + for S3 tables can take longer than for tables on HDFS. For example, both the + <codeph>LOAD DATA</codeph> statement and the final stage of the <codeph>INSERT</codeph> + and <codeph>CREATE TABLE AS SELECT</codeph> statements involve moving files from one directory + to another. (In the case of <codeph>INSERT</codeph> and <codeph>CREATE TABLE AS SELECT</codeph>, + the files are moved from a temporary staging directory to the final destination directory.) + Because S3 does not support a <q>rename</q> operation for existing objects, in these cases Impala + actually copies the data files from one location to another and then removes the original files. + In CDH 5.8 / Impala 2.6, the <codeph>S3_SKIP_INSERT_STAGING</codeph> query option provides a way + to speed up <codeph>INSERT</codeph> statements for S3 tables and partitions, with the tradeoff + that a problem during statement execution could leave data in an inconsistent state. + It does not apply to <codeph>INSERT OVERWRITE</codeph> or <codeph>LOAD DATA</codeph> statements. + See <xref href="../topics/impala_s3_skip_insert_staging.xml#s3_skip_insert_staging"/> for details. + </p> + + <p rev="2.6.0 CDH-40329 IMPALA-3453" id="s3_block_splitting"> + In <keyword keyref="impala26_full"/> and higher, Impala queries are optimized for files stored in Amazon S3. + For Impala tables that use the file formats Parquet, RCFile, SequenceFile, + Avro, and uncompressed text, the setting <codeph>fs.s3a.block.size</codeph> + in the <filepath>core-site.xml</filepath> configuration file determines + how Impala divides the I/O work of reading the data files. This configuration + setting is specified in bytes. By default, this + value is 33554432 (32 MB), meaning that Impala parallelizes S3 read operations on the files + as if they were made up of 32 MB blocks. For example, if your S3 queries primarily access + Parquet files written by MapReduce or Hive, increase <codeph>fs.s3a.block.size</codeph> + to 134217728 (128 MB) to match the row group size of those files. If most S3 queries involve + Parquet files written by Impala, increase <codeph>fs.s3a.block.size</codeph> + to 268435456 (256 MB) to match the row group size produced by Impala. + </p> + + <note rev="2.6.0 CDH-39913 IMPALA-1878" id="s3_production" type="important"> + <p> + In <keyword keyref="impala26_full"/> and higher, Impala supports both queries (<codeph>SELECT</codeph>) + and DML (<codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, <codeph>CREATE TABLE AS SELECT</codeph>) + for data residing on Amazon S3. With the inclusion of write support, + <!-- and configuration settings for more secure S3 key management, --> + the Impala support for S3 is now considered ready for production use. + </p> + </note> + + <note rev="2.2.0" id="s3_caveat" type="important"> + <p> Impala query support for Amazon S3 is included in CDH 5.4.0, but is + not currently supported or recommended for production use. To try this + feature, use it in a test environment until Cloudera resolves + currently existing issues and limitations to make it ready for + production use. </p> + </note> + + <p rev="2.6.0 CDH-39913 IMPALA-1878" id="s3_ddl"> + In <keyword keyref="impala26_full"/> and higher, Impala DDL statements such as + <codeph>CREATE DATABASE</codeph>, <codeph>CREATE TABLE</codeph>, <codeph>DROP DATABASE CASCADE</codeph>, + <codeph>DROP TABLE</codeph>, and <codeph>ALTER TABLE [ADD|DROP] PARTITION</codeph> can create or remove folders + as needed in the Amazon S3 system. Prior to CDH 5.8 / Impala 2.6, you had to create folders yourself and point + Impala database, tables, or partitions at them, and manually remove folders when no longer needed. + See <xref href="../topics/impala_s3.xml#s3"/> for details about reading and writing S3 data with Impala. + </p> + + <p rev="2.6.0 CDH-39913 IMPALA-1878" id="s3_dml"> + In <keyword keyref="impala26_full"/> and higher, the Impala DML statements (<codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, + and <codeph>CREATE TABLE AS SELECT</codeph>) can write data into a table or partition that resides in the + Amazon Simple Storage Service (S3). + The syntax of the DML statements is the same as for any other tables, because the S3 location for tables and + partitions is specified by an <codeph>s3a://</codeph> prefix in the + <codeph>LOCATION</codeph> attribute of + <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statements. + If you bring data into S3 using the normal S3 transfer mechanisms instead of Impala DML statements, + issue a <codeph>REFRESH</codeph> statement for the table before using Impala to query the S3 data. + </p> + + <!-- Formerly part of s3_dml element. Moved out to avoid a circular link in the S3 topic itelf. --> + <!-- See <xref href="../topics/impala_s3.xml#s3"/> for details about reading and writing S3 data with Impala. --> + + <p rev="2.2.0" id="s3_metadata"> + The <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> statements also cache metadata + for tables where the data resides in the Amazon Simple Storage Service (S3). + In particular, issue a <codeph>REFRESH</codeph> for a table after adding or removing files + in the associated S3 data directory. + See <xref href="../topics/impala_s3.xml#s3"/> for details about working with S3 tables. + </p> + + <p id="y2k38" rev="2.2.0"> + In Impala 2.2.0 and higher, built-in functions that accept or return integers representing <codeph>TIMESTAMP</codeph> values + use the <codeph>BIGINT</codeph> type for parameters and return values, rather than <codeph>INT</codeph>. + This change lets the date and time functions avoid an overflow error that would otherwise occur + on January 19th, 2038 (known as the + <xref href="http://en.wikipedia.org/wiki/Year_2038_problem" scope="external" format="html"><q>Year 2038 problem</q> or <q>Y2K38 problem</q></xref>). + This change affects the <codeph>from_unixtime()</codeph> and <codeph>unix_timestamp()</codeph> functions. + You might need to change application code that interacts with these functions, change the types of + columns that store the return values, or add <codeph>CAST()</codeph> calls to SQL statements that + call these functions. + </p> + + <p id="timestamp_conversions"> + Impala automatically converts <codeph>STRING</codeph> literals of the correct format into + <codeph>TIMESTAMP</codeph> values. Timestamp values are accepted in the format + <codeph>"yyyy-MM-dd HH:mm:ss.SSSSSS"</codeph>, and can consist of just the date, or just the time, with or + without the fractional second portion. For example, you can specify <codeph>TIMESTAMP</codeph> values such as + <codeph>'1966-07-30'</codeph>, <codeph>'08:30:00'</codeph>, or <codeph>'1985-09-25 17:45:30.005'</codeph>. + <ph conref="../shared/impala_common.xml#common/cast_int_to_timestamp"/> + </p> + + + <p> + <ph id="cast_int_to_timestamp">Casting an integer or floating-point value <codeph>N</codeph> to + <codeph>TIMESTAMP</codeph> produces a value that is <codeph>N</codeph> seconds past the start of the epoch + date (January 1, 1970). By default, the result value represents a date and time in the UTC time zone. + If the setting <codeph>-use_local_tz_for_unix_timestamp_conversions=true</codeph> is in effect, + the resulting <codeph>TIMESTAMP</codeph> represents a date and time in the local time zone.</ph> + </p> + + <p id="redaction_yes" rev="2.2.0"> + If these statements in your environment contain sensitive literal values such as credit card numbers or tax + identifiers, Impala can redact this sensitive information when displaying the statements in log files and + other administrative contexts. See + <xref audience="integrated" href="../topics/sg_redaction.xml#log_redact"/><xref audience="standalone" href="http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/sg_redaction.html" scope="external" format="html"/> + for details. + </p> + + <p id="incremental_partition_spec"> + The <codeph>PARTITION</codeph> clause is only allowed in combination with the <codeph>INCREMENTAL</codeph> + clause. It is optional for <codeph>COMPUTE INCREMENTAL STATS</codeph>, and required for <codeph>DROP + INCREMENTAL STATS</codeph>. Whenever you specify partitions through the <codeph>PARTITION + (<varname>partition_spec</varname>)</codeph> clause in a <codeph>COMPUTE INCREMENTAL STATS</codeph> or + <codeph>DROP INCREMENTAL STATS</codeph> statement, you must include all the partitioning columns in the + specification, and specify constant values for all the partition key columns. + </p> + + <p id="udf_persistence_restriction" rev="2.5.0 IMPALA-1748"> + In <keyword keyref="impala25_full"/> and higher, Impala UDFs and UDAs written in C++ are persisted in the metastore database. + Java UDFs are also persisted, if they were created with the new <codeph>CREATE FUNCTION</codeph> syntax for Java UDFs, + where the Java function argument and return types are omitted. + Java-based UDFs created with the old <codeph>CREATE FUNCTION</codeph> syntax do not persist across restarts + because they are held in the memory of the <cmdname>catalogd</cmdname> daemon. + Until you re-create such Java UDFs using the new <codeph>CREATE FUNCTION</codeph> syntax, + you must reload those Java-based UDFs by running the original <codeph>CREATE FUNCTION</codeph> statements again each time + you restart the <cmdname>catalogd</cmdname> daemon. + Prior to <keyword keyref="impala25_full"/> the requirement to reload functions after a restart applied to both C++ and Java functions. + </p> + + <p id="current_user_caveat" rev="CDH-36552"> + The Hive <codeph>current_user()</codeph> function cannot be + called from a Java UDF through Impala. + </p> + + <note id="add_partition_set_location"> + If you are creating a partition for the first time and specifying its location, for maximum efficiency, use + a single <codeph>ALTER TABLE</codeph> statement including both the <codeph>ADD PARTITION</codeph> and + <codeph>LOCATION</codeph> clauses, rather than separate statements with <codeph>ADD PARTITION</codeph> and + <codeph>SET LOCATION</codeph> clauses. + </note> + + <p id="insert_hidden_work_directory"> + The <codeph>INSERT</codeph> statement has always left behind a hidden work directory inside the data + directory of the table. Formerly, this hidden work directory was named + <filepath>.impala_insert_staging</filepath> . In Impala 2.0.1 and later, this directory name is changed to + <filepath>_impala_insert_staging</filepath> . (While HDFS tools are expected to treat names beginning + either with underscore and dot as hidden, in practice names beginning with an underscore are more widely + supported.) If you have any scripts, cleanup jobs, and so on that rely on the name of this work directory, + adjust them to use the new name. + </p> + + <p id="check_internal_external_table"> + To see whether a table is internal or external, and its associated HDFS location, issue the statement + <codeph>DESCRIBE FORMATTED <varname>table_name</varname></codeph>. The <codeph>Table Type</codeph> field + displays <codeph>MANAGED_TABLE</codeph> for internal tables and <codeph>EXTERNAL_TABLE</codeph> for + external tables. The <codeph>Location</codeph> field displays the path of the table directory as an HDFS + URI. + </p> + + <p id="switch_internal_external_table"> + You can switch a table from internal to external, or from external to internal, by using the <codeph>ALTER + TABLE</codeph> statement: +<codeblock xml:space="preserve"> +-- Switch a table from internal to external. +ALTER TABLE <varname>table_name</varname> SET TBLPROPERTIES('EXTERNAL'='TRUE'); + +-- Switch a table from external to internal. +ALTER TABLE <varname>table_name</varname> SET TBLPROPERTIES('EXTERNAL'='FALSE'); +</codeblock> + </p> + +<!-- The data to show sensible output from these queries is in the TPC-DS schema 'CUSTOMER' table. + If you want to show real output, add a LIMIT 5 or similar clause to each query to avoid + too-long output. --> + +<codeblock id="regexp_rlike_examples" xml:space="preserve">-- Find all customers whose first name starts with 'J', followed by 0 or more of any character. +select c_first_name, c_last_name from customer where c_first_name regexp '^J.*'; +select c_first_name, c_last_name from customer where c_first_name rlike '^J.*'; + +-- Find 'Macdonald', where the first 'a' is optional and the 'D' can be upper- or lowercase. +-- The ^...$ are required, to match the start and end of the value. +select c_first_name, c_last_name from customer where c_last_name regexp '^Ma?c[Dd]onald$'; +select c_first_name, c_last_name from customer where c_last_name rlike '^Ma?c[Dd]onald$'; + +-- Match multiple character sequences, either 'Mac' or 'Mc'. +select c_first_name, c_last_name from customer where c_last_name regexp '^(Mac|Mc)donald$'; +select c_first_name, c_last_name from customer where c_last_name rlike '^(Mac|Mc)donald$'; + +-- Find names starting with 'S', then one or more vowels, then 'r', then any other characters. +-- Matches 'Searcy', 'Sorenson', 'Sauer'. +select c_first_name, c_last_name from customer where c_last_name regexp '^S[aeiou]+r.*$'; +select c_first_name, c_last_name from customer where c_last_name rlike '^S[aeiou]+r.*$'; + +-- Find names that end with 2 or more vowels: letters from the set a,e,i,o,u. +select c_first_name, c_last_name from customer where c_last_name regexp '.*[aeiou]{2,}$'; +select c_first_name, c_last_name from customer where c_last_name rlike '.*[aeiou]{2,}$'; + +-- You can use letter ranges in the [] blocks, for example to find names starting with A, B, or C. +select c_first_name, c_last_name from customer where c_last_name regexp '^[A-C].*'; +select c_first_name, c_last_name from customer where c_last_name rlike '^[A-C].*'; + +-- If you are not sure about case, leading/trailing spaces, and so on, you can process the +-- column using string functions first. +select c_first_name, c_last_name from customer where lower(trim(c_last_name)) regexp '^de.*'; +select c_first_name, c_last_name from customer where lower(trim(c_last_name)) rlike '^de.*'; +</codeblock> + + <p id="case_insensitive_comparisons_tip" rev="2.5.0 IMPALA-1787"> + In <keyword keyref="impala25_full"/> and higher, you can simplify queries that + use many <codeph>UPPER()</codeph> and <codeph>LOWER()</codeph> calls + to do case-insensitive comparisons, by using the <codeph>ILIKE</codeph> + or <codeph>IREGEXP</codeph> operators instead. See + <xref href="../topics/impala_operators.xml#ilike"/> and + <xref href="../topics/impala_operators.xml#iregexp"/> for details. + </p> + + <p id="show_security"> + When authorization is enabled, the output of the <codeph>SHOW</codeph> statement is limited to those + objects for which you have some privilege. There might be other database, tables, and so on, but their + names are concealed. If you believe an object exists but you cannot see it in the <codeph>SHOW</codeph> + output, check with the system administrator if you need to be granted a new privilege for that object. See + <xref href="../topics/impala_authorization.xml#authorization"/> for how to set up authorization and add + privileges for specific kinds of objects. + </p> + + <p id="infinity_and_nan" rev="IMPALA-3267"> + Infinity and NaN can be specified in text data files as <codeph>inf</codeph> and <codeph>nan</codeph> + respectively, and Impala interprets them as these special values. They can also be produced by certain + arithmetic expressions; for example, <codeph>pow(-1, 0.5)</codeph> returns <codeph>Infinity</codeph> and + <codeph>1/0</codeph> returns <codeph>NaN</codeph>. Or you can cast the literal values, such as <codeph>CAST('nan' AS + DOUBLE)</codeph> or <codeph>CAST('inf' AS DOUBLE)</codeph>. + </p> + + <p rev="2.0.0" id="user_kerberized"> + In Impala 2.0 and later, <codeph>user()</codeph> returns the full Kerberos principal string, such as + <codeph>[email protected]</codeph>, in a Kerberized environment. + </p> + + <ul> + <li id="grant_revoke_single"> + Currently, each Impala <codeph>GRANT</codeph> or <codeph>REVOKE</codeph> statement can only grant or + revoke a single privilege to or from a single role. + </li> + </ul> + + <p id="blobs_are_strings"> + All data in <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> columns must be in a character encoding that + is compatible with UTF-8. If you have binary data from another database system (that is, a BLOB type), use + a <codeph>STRING</codeph> column to hold it. + </p> + +<!-- The codeblock is nested inside this paragraph, so the intro text + and the code get conref'ed as a unit. --> + + <p id="create_drop_view_examples"> + The following example creates a series of views and then drops them. These examples illustrate how views + are associated with a particular database, and both the view definitions and the view names for + <codeph>CREATE VIEW</codeph> and <codeph>DROP VIEW</codeph> can refer to a view in the current database or + a fully qualified view name. +<codeblock xml:space="preserve"> +-- Create and drop a view in the current database. +CREATE VIEW few_rows_from_t1 AS SELECT * FROM t1 LIMIT 10; +DROP VIEW few_rows_from_t1; + +-- Create and drop a view referencing a table in a different database. +CREATE VIEW table_from_other_db AS SELECT x FROM db1.foo WHERE x IS NOT NULL; +DROP VIEW table_from_other_db; + +USE db1; +-- Create a view in a different database. +CREATE VIEW db2.v1 AS SELECT * FROM db2.foo; +-- Switch into the other database and drop the view. +USE db2; +DROP VIEW v1; + +USE db1; +-- Create a view in a different database. +CREATE VIEW db2.v1 AS SELECT * FROM db2.foo; +-- Drop a view in the other database. +DROP VIEW db2.v1; +</codeblock> + </p> + + <p id="char_varchar_cast_from_string"> + For <codeph>INSERT</codeph> operations into <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> columns, you + must cast all <codeph>STRING</codeph> literals or expressions returning <codeph>STRING</codeph> to to a + <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> type with the appropriate length. + </p> + + <p rev="2.0.0" id="subquery_no_limit"> + Correlated subqueries used in <codeph>EXISTS</codeph> and <codeph>IN</codeph> operators cannot include a + <codeph>LIMIT</codeph> clause. + </p> + + <p id="avro_no_timestamp"> + Currently, Avro tables cannot contain <codeph>TIMESTAMP</codeph> columns. If you need to store date and + time values in Avro tables, as a workaround you can use a <codeph>STRING</codeph> representation of the + values, convert the values to <codeph>BIGINT</codeph> with the <codeph>UNIX_TIMESTAMP()</codeph> function, + or create separate numeric columns for individual date and time fields using the <codeph>EXTRACT()</codeph> + function. + </p> + + <p id="zero_length_strings"> + <b>Zero-length strings:</b> For purposes of clauses such as <codeph>DISTINCT</codeph> and <codeph>GROUP + BY</codeph>, Impala considers zero-length strings (<codeph>""</codeph>), <codeph>NULL</codeph>, and space + to all be different values. + </p> + + <p rev="2.5.0 IMPALA-3054" id="spill_to_disk_vs_dynamic_partition_pruning"> + When the spill-to-disk feature is activated for a join node within a query, Impala does not + produce any runtime filters for that join operation on that host. Other join nodes within + the query are not affected. + </p> + +<codeblock id="simple_dpp_example"> +create table yy (s string) partitioned by (year int) stored as parquet; +insert into yy partition (year) values ('1999', 1999), ('2000', 2000), + ('2001', 2001), ('2010',2010); +compute stats yy; + +create table yy2 (s string) partitioned by (year int) stored as parquet; +insert into yy2 partition (year) values ('1999', 1999), ('2000', 2000), + ('2001', 2001); +compute stats yy2; + +-- The query reads an unknown number of partitions, whose key values are only +-- known at run time. The 'runtime filters' lines show how the information about +-- the partitions is calculated in query fragment 02, and then used in query +-- fragment 00 to decide which partitions to skip. +explain select s from yy2 where year in (select year from yy where year between 2000 and 2005); ++----------------------------------------------------------+ +| Explain String | ++----------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=16.00MB VCores=2 | +| | +| 04:EXCHANGE [UNPARTITIONED] | +| | | +| 02:HASH JOIN [LEFT SEMI JOIN, BROADCAST] | +| | hash predicates: year = year | +| | <b>runtime filters: RF000 <- year</b> | +| | | +| |--03:EXCHANGE [BROADCAST] | +| | | | +| | 01:SCAN HDFS [dpp.yy] | +| | partitions=2/4 files=2 size=468B | +| | | +| 00:SCAN HDFS [dpp.yy2] | +| partitions=2/3 files=2 size=468B | +| <b>runtime filters: RF000 -> year</b> | ++----------------------------------------------------------+ +</codeblock> + <p id="order_by_scratch_dir"> + By default, intermediate files used during large sort, join, aggregation, or analytic function operations + are stored in the directory <filepath>/tmp/impala-scratch</filepath> . These files are removed when the + operation finishes. (Multiple concurrent queries can perform operations that use the <q>spill to disk</q> + technique, without any name conflicts for these temporary files.) You can specify a different location by + starting the <cmdname>impalad</cmdname> daemon with the + <codeph>--scratch_dirs="<varname>path_to_directory</varname>"</codeph> configuration option or the + equivalent configuration option in the Cloudera Manager user interface. You can specify a single directory, + or a comma-separated list of directories. The scratch directories must be on the local filesystem, not in + HDFS. You might specify different directory paths for different hosts, depending on the capacity and speed + of the available storage devices. In CDH 5.5 / Impala 2.3 or higher, Impala successfully starts (with a warning + written to the log) if it cannot create or read and write files in one of the scratch directories. + If there is less than 1 GB free on the filesystem where that directory resides, Impala still runs, but writes a + warning message to its log. If Impala encounters an error reading or writing files in a scratch directory during + a query, Impala logs the error and the query fails. + </p> + + <p id="order_by_view_restriction"> + An <codeph>ORDER BY</codeph> clause without an additional <codeph>LIMIT</codeph> clause is ignored in any + view definition. If you need to sort the entire result set from a view, use an <codeph>ORDER BY</codeph> + clause in the <codeph>SELECT</codeph> statement that queries the view. You can still make a simple <q>top + 10</q> report by combining the <codeph>ORDER BY</codeph> and <codeph>LIMIT</codeph> clauses in the same + view definition: +<codeblock xml:space="preserve">[localhost:21000] > create table unsorted (x bigint); +[localhost:21000] > insert into unsorted values (1), (9), (3), (7), (5), (8), (4), (6), (2); +[localhost:21000] > create view sorted_view as select x from unsorted order by x; +[localhost:21000] > select x from sorted_view; -- ORDER BY clause in view has no effect. ++---+ +| x | ++---+ +| 1 | +| 9 | +| 3 | +| 7 | +| 5 | +| 8 | +| 4 | +| 6 | +| 2 | ++---+ +[localhost:21000] > select x from sorted_view order by x; -- View query requires ORDER BY at outermost level. ++---+ +| x | ++---+ +| 1 | +| 2 | +| 3 | +| 4 | +| 5 | +| 6 | +| 7 | +| 8 | +| 9 | ++---+ +[localhost:21000] > create view top_3_view as select x from unsorted order by x limit 3; +[localhost:21000] > select x from top_3_view; -- ORDER BY and LIMIT together in view definition are preserved. ++---+ +| x | ++---+ +| 1 | +| 2 | +| 3 | ++---+ +</codeblock> + </p> + + <p id="precision_scale_example"> + The following examples demonstrate how to check the precision and scale of numeric literals or other + numeric expressions. Impala represents numeric literals in the smallest appropriate type. 5 is a + <codeph>TINYINT</codeph> value, which ranges from -128 to 127, therefore 3 decimal digits are needed to + represent the entire range, and because it is an integer value there are no fractional digits. 1.333 is + interpreted as a <codeph>DECIMAL</codeph> value, with 4 digits total and 3 digits after the decimal point. +<codeblock xml:space="preserve">[localhost:21000] > select precision(5), scale(5); ++--------------+----------+ +| precision(5) | scale(5) | ++--------------+----------+ +| 3 | 0 | ++--------------+----------+ +[localhost:21000] > select precision(1.333), scale(1.333); ++------------------+--------------+ +| precision(1.333) | scale(1.333) | ++------------------+--------------+ +| 4 | 3 | ++------------------+--------------+ +[localhost:21000] > with t1 as + ( select cast(12.34 as decimal(20,2)) x union select cast(1 as decimal(8,6)) x ) + select precision(x), scale(x) from t1 limit 1; ++--------------+----------+ +| precision(x) | scale(x) | ++--------------+----------+ +| 24 | 6 | ++--------------+----------+ +</codeblock> + </p> + +<!-- These 'type_' entries are for query options, where the type doesn't match up exactly with an Impala data type. --> + + <p id="type_boolean"> + <b>Type:</b> Boolean; recognized values are 1 and 0, or <codeph>true</codeph> and <codeph>false</codeph>; + any other value interpreted as <codeph>false</codeph> + </p> + + <p id="type_string"> + <b>Type:</b> string + </p> + + <p id="type_integer"> + <b>Type:</b> integer + </p> + + <p id="default_false"> + <b>Default:</b> <codeph>false</codeph> + </p> + + <p id="default_false_0"> + <b>Default:</b> <codeph>false</codeph> (shown as 0 in output of <codeph>SET</codeph> statement) + </p> + + <p id="default_true_1"> + <b>Default:</b> <codeph>true</codeph> (shown as 1 in output of <codeph>SET</codeph> statement) + </p> + + <p id="odd_return_type_string"> + Currently, the return value is always a <codeph>STRING</codeph>. The return type is subject to change in + future releases. Always use <codeph>CAST()</codeph> to convert the result to whichever data type is + appropriate for your computations. + </p> + + <p rev="2.0.0" id="former_odd_return_type_string"> + <b>Return type:</b> <codeph>DOUBLE</codeph> in Impala 2.0 and higher; <codeph>STRING</codeph> in earlier + releases + </p> + + <p id="for_compatibility_only"> + <b>Usage notes:</b> Primarily for compatibility with code containing industry extensions to SQL. + </p> + + <p id="return_type_boolean"> + <b>Return type:</b> <codeph>BOOLEAN</codeph> + </p> + + <p id="return_type_double"> + <b>Return type:</b> <codeph>DOUBLE</codeph> + </p> + + <p id="return_type_same"> + <b>Return type:</b> Same as the input value + </p> + + <p id="return_type_same_except_string"> + <b>Return type:</b> Same as the input value, except for <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> + arguments which produce a <codeph>STRING</codeph> result + </p> + + <p id="builtins_db"> + Impala includes another predefined database, <codeph>_impala_builtins</codeph>, that serves as the location + for the <xref href="../topics/impala_functions.xml#builtins">built-in functions</xref>. To see the built-in + functions, use a statement like the following: +<codeblock xml:space="preserve">show functions in _impala_builtins; +show functions in _impala_builtins like '*<varname>substring</varname>*'; +</codeblock> + </p> + + <p id="sum_double"> + Due to the way arithmetic on <codeph>FLOAT</codeph> and <codeph>DOUBLE</codeph> columns uses + high-performance hardware instructions, and distributed queries can perform these operations in different + order for each query, results can vary slightly for aggregate function calls such as <codeph>SUM()</codeph> + and <codeph>AVG()</codeph> for <codeph>FLOAT</codeph> and <codeph>DOUBLE</codeph> columns, particularly on + large data sets where millions or billions of values are summed or averaged. For perfect consistency and + repeatability, use the <codeph>DECIMAL</codeph> data type for such operations instead of + <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph>. + </p> + + <p id="float_double_decimal_caveat"> + The inability to exactly represent certain floating-point values means that + <codeph>DECIMAL</codeph> is sometimes a better choice than <codeph>DOUBLE</codeph> + or <codeph>FLOAT</codeph> when precision is critical, particularly when + transferring data from other database systems that use different representations + or file formats. + </p> + + <p rev="1.4.0" id="decimal_no_stats"> + Currently, the <codeph>COMPUTE STATS</codeph> statement under CDH 4 does not store any statistics for + <codeph>DECIMAL</codeph> columns. When Impala runs under CDH 5, which has better support for + <codeph>DECIMAL</codeph> in the metastore database, <codeph>COMPUTE STATS</codeph> does collect statistics + for <codeph>DECIMAL</codeph> columns and Impala uses the statistics to optimize query performance. + </p> + + <p rev="CDH-35866" id="hive_column_stats_caveat"> + If you run the Hive statement <codeph>ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS</codeph>, + Impala can only use the resulting column statistics if the table is unpartitioned. + Impala cannot use Hive-generated column statistics for a partitioned table. + </p> + + <p id="datetime_function_chaining"> + <codeph>unix_timestamp()</codeph> and <codeph>from_unixtime()</codeph> are often used in combination to + convert a <codeph>TIMESTAMP</codeph> value into a particular string format. For example: +<codeblock xml:space="preserve">select from_unixtime(unix_timestamp(now() + interval 3 days), + 'yyyy/MM/dd HH:mm') as yyyy_mm_dd_hh_mm; ++------------------+ +| yyyy_mm_dd_hh_mm | ++------------------+ +| 2016/06/03 11:38 | ++------------------+ +</codeblock> + </p> + + <p rev="1.4.0 obwl" id="insert_sort_blurb"> + <b>Sorting considerations:</b> Although you can specify an <codeph>ORDER BY</codeph> clause in an + <codeph>INSERT ... SELECT</codeph> statement, any <codeph>ORDER BY</codeph> clause is ignored and the + results are not necessarily sorted. An <codeph>INSERT ... SELECT</codeph> operation potentially creates + many different data files, prepared on different data nodes, and therefore the notion of the data being + stored in sorted order is impractical. + </p> + + <p rev="1.4.0" id="create_table_like_view"> + Prior to Impala 1.4.0, it was not possible to use the <codeph>CREATE TABLE LIKE + <varname>view_name</varname></codeph> syntax. In Impala 1.4.0 and higher, you can create a table with the + same column definitions as a view using the <codeph>CREATE TABLE LIKE</codeph> technique. Although + <codeph>CREATE TABLE LIKE</codeph> normally inherits the file format of the original table, a view has no + underlying file format, so <codeph>CREATE TABLE LIKE <varname>view_name</varname></codeph> produces a text + table by default. To specify a different file format, include a <codeph>STORED AS + <varname>file_format</varname></codeph> clause at the end of the <codeph>CREATE TABLE LIKE</codeph> + statement. + </p> + + <note rev="1.4.0" id="compute_stats_nulls"> Prior to Impala 1.4.0, + <codeph>COMPUTE STATS</codeph> counted the number of + <codeph>NULL</codeph> values in each column and recorded that figure + in the metastore database. Because Impala does not currently use the + <codeph>NULL</codeph> count during query planning, Impala 1.4.0 and + higher speeds up the <codeph>COMPUTE STATS</codeph> statement by + skipping this <codeph>NULL</codeph> counting. </note> + + <p id="regular_expression_whole_string"> + The regular expression must match the entire value, not just occur somewhere inside it. Use <codeph>.*</codeph> at the beginning, + the end, or both if you only need to match characters anywhere in the middle. Thus, the <codeph>^</codeph> and <codeph>$</codeph> + atoms are often redundant, although you might already have them in your expression strings that you reuse from elsewhere. + </p> + + <p rev="1.3.1" id="regexp_matching"> + In Impala 1.3.1 and higher, the <codeph>REGEXP</codeph> and <codeph>RLIKE</codeph> operators now match a + regular expression string that occurs anywhere inside the target string, the same as if the regular + expression was enclosed on each side by <codeph>.*</codeph>. See + <xref href="../topics/impala_operators.xml#regexp"/> for examples. Previously, these operators only + succeeded when the regular expression matched the entire target string. This change improves compatibility + with the regular expression support for popular database systems. There is no change to the behavior of the + <codeph>regexp_extract()</codeph> and <codeph>regexp_replace()</codeph> built-in functions. + </p> + + <p rev="1.3.1" id="insert_inherit_permissions"> + By default, if an <codeph>INSERT</codeph> statement creates any new subdirectories underneath a partitioned + table, those subdirectories are assigned default HDFS permissions for the <codeph>impala</codeph> user. To + make each subdirectory have the same permissions as its parent directory in HDFS, specify the + <codeph>--insert_inherit_permissions</codeph> startup option for the <cmdname>impalad</cmdname> daemon. + </p> + + <note id="multiple_count_distinct"> + <p> + By default, Impala only allows a single <codeph>COUNT(DISTINCT <varname>columns</varname>)</codeph> + expression in each query. + </p> + <p> + If you do not need precise accuracy, you can produce an estimate of the distinct values for a column by + specifying <codeph>NDV(<varname>column</varname>)</codeph>; a query can contain multiple instances of + <codeph>NDV(<varname>column</varname>)</codeph>. To make Impala automatically rewrite + <codeph>COUNT(DISTINCT)</codeph> expressions to <codeph>NDV()</codeph>, enable the + <codeph>APPX_COUNT_DISTINCT</codeph> query option. + </p> + <p> + To produce the same result as multiple <codeph>COUNT(DISTINCT)</codeph> expressions, you can use the + following technique for queries involving a single table: + </p> +<codeblock xml:space="preserve">select v1.c1 result1, v2.c1 result2 from + (select count(distinct col1) as c1 from t1) v1 + cross join + (select count(distinct col2) as c1 from t1) v2; +</codeblock> + <p> + Because <codeph>CROSS JOIN</codeph> is an expensive operation, prefer to use the <codeph>NDV()</codeph> + technique wherever practical. + </p> + </note> + + <p> + <ph id="union_all_vs_union">Prefer <codeph>UNION ALL</codeph> over <codeph>UNION</codeph> when you know the + data sets are disjoint or duplicate values are not a problem; <codeph>UNION ALL</codeph> is more efficient + because it avoids materializing and sorting the entire result set to eliminate duplicate values.</ph> + </p> + + <note id="thorn"> + The <codeph>CREATE TABLE</codeph> clauses <codeph>FIELDS TERMINATED BY</codeph>, <codeph>ESCAPED + BY</codeph>, and <codeph>LINES TERMINATED BY</codeph> have special rules for the string literal used for + their argument, because they all require a single character. You can use a regular character surrounded by + single or double quotation marks, an octal sequence such as <codeph>'\054'</codeph> (representing a comma), + or an integer in the range '-127'..'128' (with quotation marks but no backslash), which is interpreted as a + single-byte ASCII character. Negative values are subtracted from 256; for example, <codeph>FIELDS + TERMINATED BY '-2'</codeph> sets the field delimiter to ASCII code 254, the <q>Icelandic Thorn</q> + character used as a delimiter by some data formats. + </note> + + <p id="sqoop_blurb"> + <b>Sqoop considerations:</b> + </p> + + <p id="sqoop_timestamp_caveat" rev="IMPALA-2111 CDH-37399"> If you use Sqoop to + convert RDBMS data to Parquet, be careful with interpreting any + resulting values from <codeph>DATE</codeph>, <codeph>DATETIME</codeph>, + or <codeph>TIMESTAMP</codeph> columns. The underlying values are + represented as the Parquet <codeph>INT64</codeph> type, which is + represented as <codeph>BIGINT</codeph> in the Impala table. The Parquet + values represent the time in milliseconds, while Impala interprets + <codeph>BIGINT</codeph> as the time in seconds. Therefore, if you have + a <codeph>BIGINT</codeph> column in a Parquet table that was imported + this way from Sqoop, divide the values by 1000 when interpreting as the + <codeph>TIMESTAMP</codeph> type.</p> + + <p id="command_line_blurb"> + <b>Command-line equivalent:</b> + </p> + + <p rev="2.3.0" id="complex_types_blurb"> + <b>Complex type considerations:</b> + </p> + + <p id="complex_types_combo"> + Because complex types are often used in combination, + for example an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> + elements, if you are unfamiliar with the Impala complex types, + start with <xref href="../topics/impala_complex_types.xml#complex_types"/> for + background information and usage examples. + </p> + + <ul id="complex_types_restrictions"> + <li> + <p> + Columns with this da
<TRUNCATED>
