http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_sync_ddl.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_sync_ddl.xml b/docs/topics/impala_sync_ddl.xml new file mode 100644 index 0000000..7453e94 --- /dev/null +++ b/docs/topics/impala_sync_ddl.xml @@ -0,0 +1,59 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.2.1" id="sync_ddl"> + + <title>SYNC_DDL Query Option</title> + <titlealts audience="PDF"><navtitle>SYNC_DDL</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="DDL"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">SYNC_DDL query option</indexterm> + When enabled, causes any DDL operation such as <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> + to return only when the changes have been propagated to all other Impala nodes in the cluster by the Impala + catalog service. That way, if you issue a subsequent <codeph>CONNECT</codeph> statement in + <cmdname>impala-shell</cmdname> to connect to a different node in the cluster, you can be sure that other + node will already recognize any added or changed tables. (The catalog service automatically broadcasts the + DDL changes to all nodes automatically, but without this option there could be a period of inconsistency if + you quickly switched to another node, such as by issuing a subsequent query through a load-balancing proxy.) + </p> + + <p> + Although <codeph>INSERT</codeph> is classified as a DML statement, when the <codeph>SYNC_DDL</codeph> option + is enabled, <codeph>INSERT</codeph> statements also delay their completion until all the underlying data and + metadata changes are propagated to all Impala nodes. Internally, Impala inserts have similarities with DDL + statements in traditional database systems, because they create metadata needed to track HDFS block locations + for new files and they potentially add new partitions to partitioned tables. + </p> + + <note> + Because this option can introduce a delay after each write operation, if you are running a sequence of + <codeph>CREATE DATABASE</codeph>, <codeph>CREATE TABLE</codeph>, <codeph>ALTER TABLE</codeph>, + <codeph>INSERT</codeph>, and similar statements within a setup script, to minimize the overall delay you can + enable the <codeph>SYNC_DDL</codeph> query option only near the end, before the final DDL statement. + </note> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + <p conref="../shared/impala_common.xml#common/default_false_0"/> + + <!-- To do: + Example could be useful here. + --> + + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_ddl.xml#ddl"/> + </p> + + </conbody> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_tables.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_tables.xml b/docs/topics/impala_tables.xml new file mode 100644 index 0000000..80b4000 --- /dev/null +++ b/docs/topics/impala_tables.xml @@ -0,0 +1,258 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="tables"> + + <title>Overview of Impala Tables</title> + <titlealts audience="PDF"><navtitle>Tables</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Databases"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Tables"/> + <data name="Category" value="Schemas"/> + </metadata> + </prolog> + + <conbody> + + <p/> + + <p> + Tables are the primary containers for data in Impala. They have the familiar row and column layout similar to + other database systems, plus some features such as partitioning often associated with higher-end data + warehouse systems. + </p> + + <p> + Logically, each table has a structure based on the definition of its columns, partitions, and other + properties. + </p> + + <p> + Physically, each table that uses HDFS storage is associated with a directory in HDFS. The table data consists of all the data files + underneath that directory: + </p> + + <ul> + <li> + <xref href="impala_tables.xml#internal_tables">Internal tables</xref> are managed by Impala, and use directories + inside the designated Impala work area. + </li> + + <li> + <xref href="impala_tables.xml#external_tables">External tables</xref> use arbitrary HDFS directories, where + the data files are typically shared between different Hadoop components. + </li> + + <li> + Large-scale data is usually handled by partitioned tables, where the data files are divided among different + HDFS subdirectories. + </li> + </ul> + + <p rev="2.2.0"> + Impala tables can also represent data that is stored in HBase, or in the Amazon S3 filesystem (CDH 5.4.0 or higher), + or on Isilon storage devices (CDH 5.4.3 or higher). See <xref href="impala_hbase.xml#impala_hbase"/>, + <xref href="impala_s3.xml#s3"/>, and <xref href="impala_isilon.xml#impala_isilon"/> + for details about those special kinds of tables. + </p> + + <p conref="../shared/impala_common.xml#common/ignore_file_extensions"/> + + <p> + <b>Related statements:</b> <xref href="impala_create_table.xml#create_table"/>, + <xref href="impala_drop_table.xml#drop_table"/>, <xref href="impala_alter_table.xml#alter_table"/> + <xref href="impala_insert.xml#insert"/>, <xref href="impala_load_data.xml#load_data"/>, + <xref href="impala_select.xml#select"/> + </p> + </conbody> + + <concept id="internal_tables"> + + <title>Internal Tables</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">internal tables</indexterm> + The default kind of table produced by the <codeph>CREATE TABLE</codeph> statement is known as an internal + table. (Its counterpart is the external table, produced by the <codeph>CREATE EXTERNAL TABLE</codeph> + syntax.) + </p> + + <ul> + <li> + <p> + Impala creates a directory in HDFS to hold the data files. + </p> + </li> + + <li> + <p> + You can create data in internal tables by issuing <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> + statements. + </p> + </li> + + <li> + <p> + If you add or replace data using HDFS operations, issue the <codeph>REFRESH</codeph> command in + <cmdname>impala-shell</cmdname> so that Impala recognizes the changes in data files, block locations, + and so on. + </p> + </li> + + <li> + <p> + When you issue a <codeph>DROP TABLE</codeph> statement, Impala physically removes all the data files + from the directory. + </p> + </li> + + <li> + <p conref="../shared/impala_common.xml#common/check_internal_external_table"/> + </li> + + <li> + <p> + When you issue an <codeph>ALTER TABLE</codeph> statement to rename an internal table, all data files + are moved into the new HDFS directory for the table. The files are moved even if they were formerly in + a directory outside the Impala data directory, for example in an internal table with a + <codeph>LOCATION</codeph> attribute pointing to an outside HDFS directory. + </p> + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p conref="../shared/impala_common.xml#common/switch_internal_external_table"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_tables.xml#external_tables"/>, <xref href="impala_create_table.xml#create_table"/>, + <xref href="impala_drop_table.xml#drop_table"/>, <xref href="impala_alter_table.xml#alter_table"/>, + <xref href="impala_describe.xml#describe"/> + </p> + </conbody> + </concept> + + <concept id="external_tables"> + + <title>External Tables</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">external tables</indexterm> + The syntax <codeph>CREATE EXTERNAL TABLE</codeph> sets up an Impala table that points at existing data + files, potentially in HDFS locations outside the normal Impala data directories.. This operation saves the + expense of importing the data into a new table when you already have the data files in a known location in + HDFS, in the desired file format. + </p> + + <ul> + <li> + <p> + You can use Impala to query the data in this table. + </p> + </li> + + <li> + <p> + You can create data in external tables by issuing <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> + statements. + </p> + </li> + + <li> + <p> + If you add or replace data using HDFS operations, issue the <codeph>REFRESH</codeph> command in + <cmdname>impala-shell</cmdname> so that Impala recognizes the changes in data files, block locations, + and so on. + </p> + </li> + + <li> + <p> + When you issue a <codeph>DROP TABLE</codeph> statement in Impala, that removes the connection that + Impala has with the associated data files, but does not physically remove the underlying data. You can + continue to use the data files with other Hadoop components and HDFS operations. + </p> + </li> + + <li> + <p conref="../shared/impala_common.xml#common/check_internal_external_table"/> + </li> + + <li> + <p> + When you issue an <codeph>ALTER TABLE</codeph> statement to rename an external table, all data files + are left in their original locations. + </p> + </li> + + <li> + <p> + You can point multiple external tables at the same HDFS directory by using the same + <codeph>LOCATION</codeph> attribute for each one. The tables could have different column definitions, + as long as the number and types of columns are compatible with the schema evolution considerations for + the underlying file type. For example, for text data files, one table might define a certain column as + a <codeph>STRING</codeph> while another defines the same column as a <codeph>BIGINT</codeph>. + </p> + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p conref="../shared/impala_common.xml#common/switch_internal_external_table"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_tables.xml#internal_tables"/>, <xref href="impala_create_table.xml#create_table"/>, + <xref href="impala_drop_table.xml#drop_table"/>, <xref href="impala_alter_table.xml#alter_table"/>, + <xref href="impala_describe.xml#describe"/> + </p> + </conbody> + </concept> + + <concept id="table_file_formats"> + <title>File Formats</title> + <conbody> + <p> + Each table has an associated file format, which determines how Impala interprets the + associated data files. See <xref href="impala_file_formats.xml#file_formats"/> for details. + </p> + <p> + You set the file format during the <codeph>CREATE TABLE</codeph> statement, + or change it later using the <codeph>ALTER TABLE</codeph> statement. + Partitioned tables can have a different file format for individual partitions, + allowing you to change the file format used in your ETL process for new data + without going back and reconverting all the existing data in the same table. + </p> + <p> + Any <codeph>INSERT</codeph> statements produce new data files with the current file format of the table. + For existing data files, changing the file format of the table does not automatically do any data conversion. + You must use <codeph>TRUNCATE TABLE</codeph> or <codeph>INSERT OVERWRITE</codeph> to remove any previous data + files that use the old file format. + Then you use the <codeph>LOAD DATA</codeph> statement, <codeph>INSERT ... SELECT</codeph>, or other mechanism + to put data files of the correct format into the table. + </p> + <p> + The default file format, text, is the most flexible and easy to produce when you are just getting started with + Impala. The Parquet file format offers the highest query performance and uses compression to reduce storage + requirements; therefore, <ph rev="upstream">Cloudera</ph> recommends using Parquet for Impala tables with substantial amounts of data. + <ph rev="2.3.0">Also, the complex types (<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>) + available in <keyword keyref="impala23_full"/> and higher are currently only supported with the Parquet file type.</ph> + Based on your existing ETL workflow, you might use other file formats such as Avro, possibly doing a final + conversion step to Parquet to take advantage of its performance for analytic queries. + </p> + </conbody> + </concept> + +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_timeouts.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_timeouts.xml b/docs/topics/impala_timeouts.xml new file mode 100644 index 0000000..03ad773 --- /dev/null +++ b/docs/topics/impala_timeouts.xml @@ -0,0 +1,172 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.2.1" id="timeouts"> + + <title>Setting Timeout Periods for Daemons, Queries, and Sessions</title> + + <titlealts audience="PDF"> + + <navtitle>Setting Timeouts</navtitle> + + </titlealts> + + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Scheduling"/> + <data name="Category" value="Scalability"/> + </metadata> + </prolog> + + <conbody> + + <p> + Depending on how busy your CDH cluster is, you might increase or decrease various timeout + values. Increase timeouts if Impala is cancelling operations prematurely, when the system + is responding slower than usual but the operations are still successful if given extra + time. Decrease timeouts if operations are idle or hanging for long periods, and the idle + or hung operations are consuming resources and reducing concurrency. + </p> + + <p outputclass="toc inpage"/> + + </conbody> + + <concept id="statestore_timeout"> + + <title>Increasing the Statestore Timeout</title> + + <conbody> + + <p rev="IMP-1210"> + If you have an extensive Impala schema, for example with hundreds of databases, tens of + thousands of tables, and so on, you might encounter timeout errors during startup as the + Impala catalog service broadcasts metadata to all the Impala nodes using the statestore + service. To avoid such timeout errors on startup, increase the statestore timeout value + from its default of 10 seconds. Specify the timeout value using the + <codeph>-statestore_subscriber_timeout_seconds</codeph> option for the statestore + service, using the configuration instructions in + <xref href="impala_config_options.xml#config_options"/>. The symptom of this problem is + messages in the <codeph>impalad</codeph> log such as: + </p> + +<codeblock>Connection with state-store lost +Trying to re-register with state-store</codeblock> + + <p> + See <xref href="impala_scalability.xml#statestore_scalability"/> for more details about + statestore operation and settings on clusters with a large number of Impala-related + objects such as tables and partitions. + </p> + + </conbody> + + </concept> + + <concept id="impalad_timeout"> + + <title>Setting the Idle Query and Idle Session Timeouts for impalad</title> + + <conbody> + + <p> + To keep long-running queries or idle sessions from tying up cluster resources, you can + set timeout intervals for both individual queries, and entire sessions. + </p> + + <note conref="../shared/impala_common.xml#common/timeout_clock_blurb"/> + + <p> + Specify the following startup options for the <cmdname>impalad</cmdname> daemon: + </p> + + <ul> + <li> + The <codeph>--idle_query_timeout</codeph> option specifies the time in seconds after + which an idle query is cancelled. This could be a query whose results were all fetched + but was never closed, or one whose results were partially fetched and then the client + program stopped requesting further results. This condition is most likely to occur in + a client program using the JDBC or ODBC interfaces, rather than in the interactive + <cmdname>impala-shell</cmdname> interpreter. Once the query is cancelled, the client + program cannot retrieve any further results. + </li> + + <li> + The <codeph>--idle_session_timeout</codeph> option specifies the time in seconds after + which an idle session is expired. A session is idle when no activity is occurring for + any of the queries in that session, and the session has not started any new queries. + Once a session is expired, you cannot issue any new query requests to it. The session + remains open, but the only operation you can perform is to close it. The default value + of 0 means that sessions never expire. + </li> + </ul> + + <p> + For instructions on changing <cmdname>impalad</cmdname> startup options, see + <xref href="impala_config_options.xml#config_options"/>. + </p> + + <p rev="2.0.0"> + You can reduce the idle query timeout by using the <codeph>QUERY_TIMEOUT_S</codeph> + query option. Any value specified for the <codeph>--idle_query_timeout</codeph> startup + option serves as an upper limit for the <codeph>QUERY_TIMEOUT_S</codeph> query option. + See <xref href="impala_query_timeout_s.xml#query_timeout_s"/> for details. + </p> + + </conbody> + + </concept> + <concept id="concept_rfy_jl1_rx"> + <title>Setting Timeout and Retries for Thrift Connections to the Backend + Client</title> + <conbody> + <p>Impala connections to the backend client are subject to failure in + cases when the network is momentarily overloaded. To avoid failed + queries due to transient network problems, you can configure the number + of Thrift connection retries using the following option: </p> + <ul id="ul_bj3_ql1_rx"> + <li>The <codeph>--backend_client_connection_num_retries</codeph> option + specifies the number of times Impala will try connecting to the + backend client after the first connection attempt fails. By default, + <cmdname>impalad</cmdname> will attempt three re-connections before + it returns a failure. </li> + </ul> + <p>You can configure timeouts for sending and receiving data from the + backend client. Therefore, if for some reason a query hangs, instead of + waiting indefinitely for a response, Impala will terminate the + connection after a configurable timeout.</p> + <ul id="ul_vm2_2v1_rx"> + <li>The <codeph>--backend_client_rpc_timeout_ms</codeph> option can be + used to specify the number of milliseconds Impala should wait for a + response from the backend client before it terminates the connection + and signals a failure. The default value for this property is 300000 + milliseconds, or 5 minutes. </li> + </ul> + </conbody> + </concept> + + <concept id="cancel_query"> + + <title>Cancelling a Query</title> + + <conbody> + + <p> + Sometimes, an Impala query might run for an unexpectedly long time, tying up resources + in the cluster. You can cancel the query explicitly, independent of the timeout period, + by going into the web UI for the <cmdname>impalad</cmdname> host (on port 25000 by + default), and using the link on the <codeph>/queries</codeph> tab to cancel the running + query. Various client applications let you interactively cancel queries submitted or + monitored through those applications. For example, by pressing <codeph>^C</codeph> in + <cmdname>impala-shell</cmdname>, clicking the <uicontrol>Cancel</uicontrol> button from + the <uicontrol>Watch</uicontrol> page in Hue, clicking <uicontrol>Actions > + Cancel</uicontrol> from the <uicontrol>Queries</uicontrol> list in Cloudera Manager, and + so on. + </p> + + </conbody> + + </concept> + +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_timestamp.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_timestamp.xml b/docs/topics/impala_timestamp.xml new file mode 100644 index 0000000..2f4fa40 --- /dev/null +++ b/docs/topics/impala_timestamp.xml @@ -0,0 +1,445 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="timestamp"> + + <title>TIMESTAMP Data Type</title> + <titlealts audience="PDF"><navtitle>TIMESTAMP</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Dates and Times"/> + </metadata> + </prolog> + + <conbody> + + <p> + A data type used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements, representing a + point in time. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p> + In the column definition of a <codeph>CREATE TABLE</codeph> statement: + </p> + +<codeblock><varname>column_name</varname> TIMESTAMP</codeblock> + + <p> + <b>Range:</b> Allowed date values range from 1400-01-01 to 9999-12-31; this range is different from the Hive + <codeph>TIMESTAMP</codeph> type. Internally, the resolution of the time portion of a + <codeph>TIMESTAMP</codeph> value is in nanoseconds. + </p> + + <p> + <b>INTERVAL expressions:</b> + </p> + + <p> + You can perform date arithmetic by adding or subtracting a specified number of time units, using the + <codeph>INTERVAL</codeph> keyword and the <codeph>+</codeph> and <codeph>-</codeph> operators or + <codeph>date_add()</codeph> and <codeph>date_sub()</codeph> functions. You can specify units as + <codeph>YEAR[S]</codeph>, <codeph>MONTH[S]</codeph>, <codeph>WEEK[S]</codeph>, <codeph>DAY[S]</codeph>, + <codeph>HOUR[S]</codeph>, <codeph>MINUTE[S]</codeph>, <codeph>SECOND[S]</codeph>, + <codeph>MILLISECOND[S]</codeph>, <codeph>MICROSECOND[S]</codeph>, and <codeph>NANOSECOND[S]</codeph>. You can + only specify one time unit in each interval expression, for example <codeph>INTERVAL 3 DAYS</codeph> or + <codeph>INTERVAL 25 HOURS</codeph>, but you can produce any granularity by adding together successive + <codeph>INTERVAL</codeph> values, such as <codeph><varname>timestamp_value</varname> + INTERVAL 3 WEEKS - + INTERVAL 1 DAY + INTERVAL 10 MICROSECONDS</codeph>. + </p> + + <p> + For example: + </p> + +<codeblock>select now() + interval 1 day; +select date_sub(now(), interval 5 minutes); +insert into auction_details + select auction_id, auction_start_time, auction_start_time + interval 2 days + interval 12 hours + from new_auctions;</codeblock> + + <p> + <b>Time zones:</b> + </p> + + <p> + By default, Impala does not store timestamps using the local timezone, to avoid undesired results from + unexpected time zone issues. Timestamps are stored and interpreted relative to UTC, both when written to or + read from data files, or when converted to or from Unix time values through functions such as + <codeph>from_unixtime()</codeph> or <codeph>unix_timestamp()</codeph>. To convert such a + <codeph>TIMESTAMP</codeph> value to one that represents the date and time in a specific time zone, convert + the original value with the <codeph>from_utc_timestamp()</codeph> function. + </p> + + <p> + Because Impala does not assume that <codeph>TIMESTAMP</codeph> values are in any particular time zone, you + must be conscious of the time zone aspects of data that you query, insert, or convert. + </p> + + <p> + For consistency with Unix system calls, the <codeph>TIMESTAMP</codeph> returned by the <codeph>now()</codeph> + function represents the local time in the system time zone, rather than in UTC. To store values relative to + the current time in a portable way, convert any <codeph>now()</codeph> return values using the + <codeph>to_utc_timestamp()</codeph> function first. For example, the following example shows that the current + time in California (where this Impala cluster is located) is shortly after 2 PM. If that value was written to a data + file, and shipped off to a distant server to be analyzed alongside other data from far-flung locations, the + dates and times would not match up precisely because of time zone differences. Therefore, the + <codeph>to_utc_timestamp()</codeph> function converts it using a common reference point, the UTC time zone + (descended from the old Greenwich Mean Time standard). The <codeph>'PDT'</codeph> argument indicates that the + original value is from the Pacific time zone with Daylight Saving Time in effect. When servers in all + geographic locations run the same transformation on any local date and time values (with the appropriate time + zone argument), the stored data uses a consistent representation. Impala queries can use functions such as + <codeph>EXTRACT()</codeph>, <codeph>MIN()</codeph>, <codeph>AVG()</codeph>, and so on to do time-series + analysis on those timestamps. + </p> + +<codeblock>[localhost:21000] > select now(); ++-------------------------------+ +| now() | ++-------------------------------+ +| 2015-04-09 14:07:46.580465000 | ++-------------------------------+ +[localhost:21000] > select to_utc_timestamp(now(), 'PDT'); ++--------------------------------+ +| to_utc_timestamp(now(), 'pdt') | ++--------------------------------+ +| 2015-04-09 21:08:07.664547000 | ++--------------------------------+ +</codeblock> + + <p> + The converse function, <codeph>from_utc_timestamp()</codeph>, lets you take stored <codeph>TIMESTAMP</codeph> + data or calculated results and convert back to local date and time for processing on the application side. + The following example shows how you might represent some future date (such as the ending date and time of an + auction) in UTC, and then convert back to local time when convenient for reporting or other processing. The + final query in the example tests whether this arbitrary UTC date and time has passed yet, by converting it + back to the local time zone and comparing it against the current date and time. + </p> + +<codeblock>[localhost:21000] > select to_utc_timestamp(now() + interval 2 weeks, 'PDT'); ++---------------------------------------------------+ +| to_utc_timestamp(now() + interval 2 weeks, 'pdt') | ++---------------------------------------------------+ +| 2015-04-23 21:08:34.152923000 | ++---------------------------------------------------+ +[localhost:21000] > select from_utc_timestamp('2015-04-23 21:08:34.152923000','PDT'); ++------------------------------------------------------------+ +| from_utc_timestamp('2015-04-23 21:08:34.152923000', 'pdt') | ++------------------------------------------------------------+ +| 2015-04-23 14:08:34.152923000 | ++------------------------------------------------------------+ +[localhost:21000] > select from_utc_timestamp('2015-04-23 21:08:34.152923000','PDT') < now(); ++--------------------------------------------------------------------+ +| from_utc_timestamp('2015-04-23 21:08:34.152923000', 'pdt') < now() | ++--------------------------------------------------------------------+ +| false | ++--------------------------------------------------------------------+ +</codeblock> + + <p rev="2.2.0"> + If you have data files written by Hive, those <codeph>TIMESTAMP</codeph> values represent the local timezone + of the host where the data was written, potentially leading to inconsistent results when processed by Impala. + To avoid compatibility problems or having to code workarounds, you can specify one or both of these + <cmdname>impalad</cmdname> startup flags: <codeph>-use_local_tz_for_unix_timestamp_conversions=true</codeph> + <codeph>-convert_legacy_hive_parquet_utc_timestamps=true</codeph>. Although + <codeph>-convert_legacy_hive_parquet_utc_timestamps</codeph> is turned off by default to avoid performance overhead, <ph rev="upstream">Cloudera</ph> recommends + turning it on when processing <codeph>TIMESTAMP</codeph> columns in Parquet files written by Hive, to avoid unexpected behavior. + </p> + + <p rev="2.2.0"> + The <codeph>-use_local_tz_for_unix_timestamp_conversions</codeph> setting affects conversions from + <codeph>TIMESTAMP</codeph> to <codeph>BIGINT</codeph>, or from <codeph>BIGINT</codeph> + to <codeph>TIMESTAMP</codeph>. By default, Impala treats all <codeph>TIMESTAMP</codeph> values as UTC, + to simplify analysis of time-series data from different geographic regions. When you enable the + <codeph>-use_local_tz_for_unix_timestamp_conversions</codeph> setting, these operations + treat the input values as if they are in the local tie zone of the host doing the processing. + See <xref href="impala_datetime_functions.xml#datetime_functions"/> for the list of functions + affected by the <codeph>-use_local_tz_for_unix_timestamp_conversions</codeph> setting. + </p> + + <p> + The following sequence of examples shows how the interpretation of <codeph>TIMESTAMP</codeph> values in + Parquet tables is affected by the setting of the <codeph>-convert_legacy_hive_parquet_utc_timestamps</codeph> + setting. + </p> + + <p> + Regardless of the <codeph>-convert_legacy_hive_parquet_utc_timestamps</codeph> setting, + <codeph>TIMESTAMP</codeph> columns in text tables can be written and read interchangeably by Impala and Hive: + </p> + +<codeblock>Impala DDL and queries for text table: + +[localhost:21000] > create table t1 (x timestamp); +[localhost:21000] > insert into t1 values (now()), (now() + interval 1 day); +[localhost:21000] > select x from t1; ++-------------------------------+ +| x | ++-------------------------------+ +| 2015-04-07 15:43:02.892403000 | +| 2015-04-08 15:43:02.892403000 | ++-------------------------------+ +[localhost:21000] > select to_utc_timestamp(x, 'PDT') from t1; ++-------------------------------+ +| to_utc_timestamp(x, 'pdt') | ++-------------------------------+ +| 2015-04-07 22:43:02.892403000 | +| 2015-04-08 22:43:02.892403000 | ++-------------------------------+ + +Hive query for text table: + +hive> select * from t1; +OK +2015-04-07 15:43:02.892403 +2015-04-08 15:43:02.892403 +Time taken: 1.245 seconds, Fetched: 2 row(s) +</codeblock> + + <p> + When the table uses Parquet format, Impala expects any time zone adjustment to be applied prior to writing, + while <codeph>TIMESTAMP</codeph> values written by Hive are adjusted to be in the UTC time zone. When Hive + queries Parquet data files that it wrote, it adjusts the <codeph>TIMESTAMP</codeph> values back to the local + time zone, while Impala does no conversion. Hive does no time zone conversion when it queries Impala-written + Parquet files. + </p> + +<codeblock>Impala DDL and queries for Parquet table: + +[localhost:21000] > create table p1 stored as parquet as select x from t1; ++-------------------+ +| summary | ++-------------------+ +| Inserted 2 row(s) | ++-------------------+ +[localhost:21000] > select x from p1; ++-------------------------------+ +| x | ++-------------------------------+ +| 2015-04-07 15:43:02.892403000 | +| 2015-04-08 15:43:02.892403000 | ++-------------------------------+ + +Hive DDL and queries for Parquet table: + +hive> create table h1 (x timestamp) stored as parquet; +OK +hive> insert into h1 select * from p1; +... +OK +Time taken: 35.573 seconds +hive> select x from p1; +OK +2015-04-07 15:43:02.892403 +2015-04-08 15:43:02.892403 +Time taken: 0.324 seconds, Fetched: 2 row(s) +hive> select x from h1; +OK +2015-04-07 15:43:02.892403 +2015-04-08 15:43:02.892403 +Time taken: 0.197 seconds, Fetched: 2 row(s) +</codeblock> + + <p> + The discrepancy arises when Impala queries the Hive-created Parquet table. The underlying values in the + <codeph>TIMESTAMP</codeph> column are different from the ones written by Impala, even though they were copied + from one table to another by an <codeph>INSERT ... SELECT</codeph> statement in Hive. Hive did an implicit + conversion from the local time zone to UTC as it wrote the values to Parquet. + </p> + +<codeblock>Impala query for TIMESTAMP values from Impala-written and Hive-written data: + +[localhost:21000] > select * from p1; ++-------------------------------+ +| x | ++-------------------------------+ +| 2015-04-07 15:43:02.892403000 | +| 2015-04-08 15:43:02.892403000 | ++-------------------------------+ +Fetched 2 row(s) in 0.29s +[localhost:21000] > select * from h1; ++-------------------------------+ +| x | ++-------------------------------+ +| 2015-04-07 22:43:02.892403000 | +| 2015-04-08 22:43:02.892403000 | ++-------------------------------+ +Fetched 2 row(s) in 0.41s + +Underlying integer values for Impala-written and Hive-written data: + +[localhost:21000] > select cast(x as bigint) from p1; ++-------------------+ +| cast(x as bigint) | ++-------------------+ +| 1428421382 | +| 1428507782 | ++-------------------+ +Fetched 2 row(s) in 0.38s +[localhost:21000] > select cast(x as bigint) from h1; ++-------------------+ +| cast(x as bigint) | ++-------------------+ +| 1428446582 | +| 1428532982 | ++-------------------+ +Fetched 2 row(s) in 0.20s +</codeblock> + + <p> + When the <codeph>-convert_legacy_hive_parquet_utc_timestamps</codeph> setting is enabled, Impala recognizes + the Parquet data files written by Hive, and applies the same UTC-to-local-timezone conversion logic during + the query as Hive uses, making the contents of the Impala-written <codeph>P1</codeph> table and the + Hive-written <codeph>H1</codeph> table appear identical, whether represented as <codeph>TIMESTAMP</codeph> + values or the underlying <codeph>BIGINT</codeph> integers: + </p> + +<codeblock>[localhost:21000] > select x from p1; ++-------------------------------+ +| x | ++-------------------------------+ +| 2015-04-07 15:43:02.892403000 | +| 2015-04-08 15:43:02.892403000 | ++-------------------------------+ +Fetched 2 row(s) in 0.37s +[localhost:21000] > select x from h1; ++-------------------------------+ +| x | ++-------------------------------+ +| 2015-04-07 15:43:02.892403000 | +| 2015-04-08 15:43:02.892403000 | ++-------------------------------+ +Fetched 2 row(s) in 0.19s +[localhost:21000] > select cast(x as bigint) from p1; ++-------------------+ +| cast(x as bigint) | ++-------------------+ +| 1428446582 | +| 1428532982 | ++-------------------+ +Fetched 2 row(s) in 0.29s +[localhost:21000] > select cast(x as bigint) from h1; ++-------------------+ +| cast(x as bigint) | ++-------------------+ +| 1428446582 | +| 1428532982 | ++-------------------+ +Fetched 2 row(s) in 0.22s +</codeblock> + + <p> + <b>Conversions:</b> + </p> + + <p conref="../shared/impala_common.xml#common/timestamp_conversions"/> + + <p> + In Impala 1.3 and higher, the <codeph>FROM_UNIXTIME()</codeph> and <codeph>UNIX_TIMESTAMP()</codeph> + functions allow a wider range of format strings, with more flexibility in element order, repetition of letter + placeholders, and separator characters. In <keyword keyref="impala23_full"/> and higher, the <codeph>UNIX_TIMESTAMP()</codeph> + function also allows a numeric timezone offset to be specified as part of the input string. + See <xref href="impala_datetime_functions.xml#datetime_functions"/> for details. + </p> + + <p conref="../shared/impala_common.xml#common/y2k38"/> + + <p> + <b>Partitioning:</b> + </p> + + <p> + Although you cannot use a <codeph>TIMESTAMP</codeph> column as a partition key, you can extract the + individual years, months, days, hours, and so on and partition based on those columns. Because the partition + key column values are represented in HDFS directory names, rather than as fields in the data files + themselves, you can also keep the original <codeph>TIMESTAMP</codeph> values if desired, without duplicating + data or wasting storage space. See <xref href="impala_partitioning.xml#partition_key_columns"/> for more + details on partitioning with date and time values. + </p> + +<codeblock>[localhost:21000] > create table timeline (event string) partitioned by (happened timestamp); +ERROR: AnalysisException: Type 'TIMESTAMP' is not supported as partition-column type in column: happened +</codeblock> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<codeblock>select cast('1966-07-30' as timestamp); +select cast('1985-09-25 17:45:30.005' as timestamp); +select cast('08:30:00' as timestamp); +select hour('1970-01-01 15:30:00'); -- Succeeds, returns 15. +select hour('1970-01-01 15:30'); -- Returns NULL because seconds field required. +select hour('1970-01-01 27:30:00'); -- Returns NULL because hour value out of range. +select dayofweek('2004-06-13'); -- Returns 1, representing Sunday. +select dayname('2004-06-13'); -- Returns 'Sunday'. +select date_add('2004-06-13', 365); -- Returns 2005-06-13 with zeros for hh:mm:ss fields. +select day('2004-06-13'); -- Returns 13. +select datediff('1989-12-31','1984-09-01'); -- How many days between these 2 dates? +select now(); -- Returns current date and time in local timezone. + +create table dates_and_times (t timestamp); +insert into dates_and_times values + ('1966-07-30'), ('1985-09-25 17:45:30.005'), ('08:30:00'), (now()); +</codeblock> + + <p conref="../shared/impala_common.xml#common/null_bad_timestamp_cast"/> + + <p conref="../shared/impala_common.xml#common/partitioning_worrisome"/> + + <p conref="../shared/impala_common.xml#common/hbase_ok"/> + + <p conref="../shared/impala_common.xml#common/parquet_ok"/> + + <p conref="../shared/impala_common.xml#common/text_bulky"/> + +<!-- <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> --> + + <p conref="../shared/impala_common.xml#common/internals_16_bytes"/> + + <p conref="../shared/impala_common.xml#common/added_forever"/> + + <p conref="../shared/impala_common.xml#common/column_stats_constant"/> + + <p conref="../shared/impala_common.xml#common/sqoop_blurb"/> + + <p conref="../shared/impala_common.xml#common/sqoop_timestamp_caveat"/> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p> + If you cast a <codeph>STRING</codeph> with an unrecognized format to a <codeph>TIMESTAMP</codeph>, the result + is <codeph>NULL</codeph> rather than an error. Make sure to test your data pipeline to be sure any textual + date and time values are in a format that Impala <codeph>TIMESTAMP</codeph> can recognize. + </p> + + <p conref="../shared/impala_common.xml#common/avro_no_timestamp"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <ul> + <li> +<!-- The Timestamp Literals topic is pretty brief. Consider adding more examples there. --> + <xref href="impala_literals.xml#timestamp_literals"/>. + </li> + + <li> + To convert to or from different date formats, or perform date arithmetic, use the date and time functions + described in <xref href="impala_datetime_functions.xml#datetime_functions"/>. In particular, the + <codeph>from_unixtime()</codeph> function requires a case-sensitive format string such as + <codeph>"yyyy-MM-dd HH:mm:ss.SSSS"</codeph>, matching one of the allowed variations of a + <codeph>TIMESTAMP</codeph> value (date plus time, only date, only time, optional fractional seconds). + </li> + + <li> + See <xref href="impala_langref_unsupported.xml#langref_hiveql_delta"/> for details about differences in + <codeph>TIMESTAMP</codeph> handling between Impala and Hive. + </li> + </ul> + + </conbody> + +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_tinyint.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_tinyint.xml b/docs/topics/impala_tinyint.xml new file mode 100644 index 0000000..2756cb8 --- /dev/null +++ b/docs/topics/impala_tinyint.xml @@ -0,0 +1,103 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="tinyint"> + + <title>TINYINT Data Type</title> + <titlealts audience="PDF"><navtitle>TINYINT</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Schemas"/> + </metadata> + </prolog> + + <conbody> + + <p> + A 1-byte integer data type used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p> + In the column definition of a <codeph>CREATE TABLE</codeph> statement: + </p> + +<codeblock><varname>column_name</varname> TINYINT</codeblock> + + <p> + <b>Range:</b> -128 .. 127. There is no <codeph>UNSIGNED</codeph> subtype. + </p> + + <p> + <b>Conversions:</b> Impala automatically converts to a larger integer type (<codeph>SMALLINT</codeph>, + <codeph>INT</codeph>, or <codeph>BIGINT</codeph>) or a floating-point type (<codeph>FLOAT</codeph> or + <codeph>DOUBLE</codeph>) automatically. Use <codeph>CAST()</codeph> to convert to <codeph>STRING</codeph> or + <codeph>TIMESTAMP</codeph>. + <ph conref="../shared/impala_common.xml#common/cast_int_to_timestamp"/> + </p> + + <p conref="../shared/impala_common.xml#common/int_overflow_behavior"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + For a convenient and automated way to check the bounds of the <codeph>TINYINT</codeph> type, call the + functions <codeph>MIN_TINYINT()</codeph> and <codeph>MAX_TINYINT()</codeph>. + </p> + + <p> + If an integer value is too large to be represented as a <codeph>TINYINT</codeph>, use a + <codeph>SMALLINT</codeph> instead. + </p> + + <p conref="../shared/impala_common.xml#common/null_bad_numeric_cast"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<codeblock>CREATE TABLE t1 (x TINYINT); +SELECT CAST(100 AS TINYINT); +</codeblock> + + <p conref="../shared/impala_common.xml#common/parquet_blurb"/> + +<!-- Duplicated under TINYINT and SMALLINT. Turn into a conref in both places. --> + + <p rev="1.4.0"> + Physically, Parquet files represent <codeph>TINYINT</codeph> and <codeph>SMALLINT</codeph> values as 32-bit + integers. Although Impala rejects attempts to insert out-of-range values into such columns, if you create a + new table with the <codeph>CREATE TABLE ... LIKE PARQUET</codeph> syntax, any <codeph>TINYINT</codeph> or + <codeph>SMALLINT</codeph> columns in the original table turn into <codeph>INT</codeph> columns in the new + table. + </p> + +<!-- <p conref="../shared/impala_common.xml#common/partitioning_blurb"/> --> + + <p conref="../shared/impala_common.xml#common/hbase_ok"/> + + <p conref="../shared/impala_common.xml#common/text_bulky"/> + +<!-- <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> --> + + <p conref="../shared/impala_common.xml#common/internals_1_bytes"/> + + <p conref="../shared/impala_common.xml#common/added_forever"/> + + <p conref="../shared/impala_common.xml#common/column_stats_constant"/> + +<!-- <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> --> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_literals.xml#numeric_literals"/>, <xref href="impala_tinyint.xml#tinyint"/>, + <xref href="impala_smallint.xml#smallint"/>, <xref href="impala_int.xml#int"/>, + <xref href="impala_bigint.xml#bigint"/>, <xref href="impala_decimal.xml#decimal"/>, + <xref href="impala_math_functions.xml#math_functions"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_trouble_bad_results.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_trouble_bad_results.xml b/docs/topics/impala_trouble_bad_results.xml new file mode 100644 index 0000000..dc15c5f --- /dev/null +++ b/docs/topics/impala_trouble_bad_results.xml @@ -0,0 +1,25 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="trouble_bad_results"> + + <title>Troubleshooting Incorrect Results</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Troubleshooting"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">incorrect results</indexterm> + <indexterm audience="Cloudera">bad results</indexterm> + <indexterm audience="Cloudera">wrong results</indexterm> + </p> + + <p outputclass="toc inpage"/> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_trouble_memory.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_trouble_memory.xml b/docs/topics/impala_trouble_memory.xml new file mode 100644 index 0000000..ae6ab2a --- /dev/null +++ b/docs/topics/impala_trouble_memory.xml @@ -0,0 +1,25 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="trouble_memory"> + + <title>Troubleshooting Out-of-Memory Issues</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Troubleshooting"/> + <data name="Category" value="Memory"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">memory errors</indexterm> + <indexterm audience="Cloudera">out-of-memory errors</indexterm> + </p> + + <p outputclass="toc inpage"/> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_trouble_query_fail.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_trouble_query_fail.xml b/docs/topics/impala_trouble_query_fail.xml new file mode 100644 index 0000000..6b1d1ee --- /dev/null +++ b/docs/topics/impala_trouble_query_fail.xml @@ -0,0 +1,24 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="trouble_query_fail"> + + <title>Troubleshooting Query Errors</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Troubleshooting"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">query errors</indexterm> + </p> + + <p outputclass="toc inpage"/> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_trouble_sql.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_trouble_sql.xml b/docs/topics/impala_trouble_sql.xml new file mode 100644 index 0000000..25d79f9 --- /dev/null +++ b/docs/topics/impala_trouble_sql.xml @@ -0,0 +1,25 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="trouble_sql"> + + <title>Troubleshooting SQL Syntax Errors</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Troubleshooting"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">SQL errors</indexterm> + <indexterm audience="Cloudera">syntax errors</indexterm> + </p> + + <p outputclass="toc inpage"/> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_trouble_startup.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_trouble_startup.xml b/docs/topics/impala_trouble_startup.xml new file mode 100644 index 0000000..ff4b49e --- /dev/null +++ b/docs/topics/impala_trouble_startup.xml @@ -0,0 +1,25 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="trouble_startup"> + + <title>Troubleshooting Startup Problems</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Troubleshooting"/> + <data name="Category" value="Starting and Stopping"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">startup errors</indexterm> + </p> + + <p outputclass="toc inpage"/> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_troubleshooting.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_troubleshooting.xml b/docs/topics/impala_troubleshooting.xml new file mode 100644 index 0000000..f7ebe4e --- /dev/null +++ b/docs/topics/impala_troubleshooting.xml @@ -0,0 +1,447 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="troubleshooting"> + + <title>Troubleshooting Impala</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Troubleshooting"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">troubleshooting</indexterm> + Troubleshooting for Impala requires being able to diagnose and debug problems + with performance, network connectivity, out-of-memory conditions, disk space usage, + and crash or hang conditions in any of the Impala-related daemons. + </p> + + <p outputclass="toc inpage" audience="PDF"> + The following sections describe the general troubleshooting procedures to diagnose + different kinds of problems: + </p> + + </conbody> + + <concept id="trouble_sql"> + + <title>Troubleshooting Impala SQL Syntax Issues</title> + + <conbody> + + <p> + In general, if queries issued against Impala fail, you can try running these same queries against Hive. + </p> + + <ul> + <li> + If a query fails against both Impala and Hive, it is likely that there is a problem with your query or + other elements of your CDH environment: + <ul> + <li> + Review the <xref href="impala_langref.xml#langref">Language Reference</xref> to ensure your query is + valid. + </li> + + <li> + Check <xref href="impala_reserved_words.xml#reserved_words"/> to see if any database, table, + column, or other object names in your query conflict with Impala reserved words. + Quote those names with backticks (<codeph>``</codeph>) if so. + </li> + + <li> + Check <xref href="impala_functions.xml#builtins"/> to confirm whether Impala supports all the + built-in functions being used by your query, and whether argument and return types are the + same as you expect. + </li> + + <li> + Review the <xref href="impala_logging.xml#logs_debug">contents of the Impala logs</xref> for any information that may be useful in identifying the + source of the problem. + </li> + </ul> + </li> + + <li> + If a query fails against Impala but not Hive, it is likely that there is a problem with your Impala + installation. + </li> + </ul> + </conbody> + </concept> + + <concept id="trouble_io" rev="CDH-19201"> + <title>Troubleshooting I/O Capacity Problems</title> + <conbody> + <p> + Impala queries are typically I/O-intensive. If there is an I/O problem with storage devices, + or with HDFS itself, Impala queries could show slow response times with no obvious cause + on the Impala side. Slow I/O on even a single DataNode could result in an overall slowdown, because + queries involving clauses such as <codeph>ORDER BY</codeph>, <codeph>GROUP BY</codeph>, or <codeph>JOIN</codeph> + do not start returning results until all DataNodes have finished their work. + </p> + <p> + To test whether the Linux I/O system itself is performing as expected, run Linux commands like + the following on each DataNode: + </p> +<codeblock> +$ sudo sysctl -w vm.drop_caches=3 vm.drop_caches=0 +vm.drop_caches = 3 +vm.drop_caches = 0 +$ sudo dd if=/dev/sda bs=1M of=/dev/null count=1k +1024+0 records in +1024+0 records out +1073741824 bytes (1.1 GB) copied, 5.60373 s, 192 MB/s +$ sudo dd if=/dev/sdb bs=1M of=/dev/null count=1k +1024+0 records in +1024+0 records out +1073741824 bytes (1.1 GB) copied, 5.51145 s, 195 MB/s +$ sudo dd if=/dev/sdc bs=1M of=/dev/null count=1k +1024+0 records in +1024+0 records out +1073741824 bytes (1.1 GB) copied, 5.58096 s, 192 MB/s +$ sudo dd if=/dev/sdd bs=1M of=/dev/null count=1k +1024+0 records in +1024+0 records out +1073741824 bytes (1.1 GB) copied, 5.43924 s, 197 MB/s +</codeblock> + <p> + On modern hardware, a throughput rate of less than 100 MB/s typically indicates + a performance issue with the storage device. Correct the hardware problem before + continuing with Impala tuning or benchmarking. + </p> + </conbody> + </concept> + + + <concept id="trouble_cookbook"> + + <title>Impala Troubleshooting Quick Reference</title> + + <conbody> + + <p> + The following table lists common problems and potential solutions. + </p> + + <table> + <tgroup cols="3"> + <colspec colname="1" colwidth="10*"/> + <colspec colname="2" colwidth="30*"/> + <colspec colname="3" colwidth="30*"/> + <thead> + <row> + <entry> + Symptom + </entry> + <entry> + Explanation + </entry> + <entry> + Recommendation + </entry> + </row> + </thead> + <tbody> + <row> + <entry> + Impala takes a long time to start. + </entry> + <entry> + Impala instances with large numbers of tables, partitions, or data files take longer to start + because the metadata for these objects is broadcast to all <cmdname>impalad</cmdname> nodes and + cached. + </entry> + <entry> + Adjust timeout and synchronicity settings. + </entry> + </row> + <row> + <entry> + <p> + Joins fail to complete. + </p> + </entry> + <entry> + <p> + There may be insufficient memory. During a join, data from the second, third, and so on sets to + be joined is loaded into memory. If Impala chooses an inefficient join order or join mechanism, + the query could exceed the total memory available. + </p> + </entry> + <entry> + <p> + Start by gathering statistics with the <codeph>COMPUTE STATS</codeph> statement for each table + involved in the join. Consider specifying the <codeph>[SHUFFLE]</codeph> hint so that data from + the joined tables is split up between nodes rather than broadcast to each node. If tuning at the + SQL level is not sufficient, add more memory to your system or join smaller data sets. + </p> + </entry> + </row> + <row> + <entry> + <p> + Queries return incorrect results. + </p> + </entry> + <entry> + <p> + Impala metadata may be outdated after changes are performed in Hive. + </p> + </entry> + <entry> + <p> + Where possible, use the appropriate Impala statement (<codeph>INSERT</codeph>, <codeph>LOAD + DATA</codeph>, <codeph>CREATE TABLE</codeph>, <codeph>ALTER TABLE</codeph>, <codeph>COMPUTE + STATS</codeph>, and so on) rather than switching back and forth between Impala and Hive. Impala + automatically broadcasts the results of DDL and DML operations to all Impala nodes in the + cluster, but does not automatically recognize when such changes are made through Hive. After + inserting data, adding a partition, or other operation in Hive, refresh the metadata for the + table as described in <xref href="impala_refresh.xml#refresh"/>. + </p> + </entry> + </row> + <row> + <entry> + <p> + Queries are slow to return results. + </p> + </entry> + <entry> + <p> + Some <codeph>impalad</codeph> instances may not have started. Using a browser, connect to the + host running the Impala state store. Connect using an address of the form + <codeph>http://<varname>hostname</varname>:<varname>port</varname>/metrics</codeph>. + </p> + + <p> + <note> + Replace <varname>hostname</varname> and <varname>port</varname> with the hostname and port of + your Impala state store host machine and web server port. The default port is 25010. + </note> + The number of <codeph>impalad</codeph> instances listed should match the expected number of + <codeph>impalad</codeph> instances installed in the cluster. There should also be one + <codeph>impalad</codeph> instance installed on each DataNode + </p> + </entry> + <entry> + <p> + Ensure Impala is installed on all DataNodes. Start any <codeph>impalad</codeph> instances that + are not running. + </p> + </entry> + </row> + <row> + <entry> + <p> + Queries are slow to return results. + </p> + </entry> + <entry> + <p> + Impala may not be configured to use native checksumming. Native checksumming uses + machine-specific instructions to compute checksums over HDFS data very quickly. Review Impala + logs. If you find instances of "<codeph>INFO util.NativeCodeLoader: Loaded the + native-hadoop</codeph>" messages, native checksumming is not enabled. + </p> + </entry> + <entry> + <p> + Ensure Impala is configured to use native checksumming as described in + <xref href="impala_config_performance.xml#config_performance"/>. + </p> + </entry> + </row> + <row> + <entry> + <p> + Queries are slow to return results. + </p> + </entry> + <entry> + <p> + Impala may not be configured to use data locality tracking. + </p> + </entry> + <entry> + <p> + Test Impala for data locality tracking and make configuration changes as necessary. Information + on this process can be found in <xref href="impala_config_performance.xml#config_performance"/>. + </p> + </entry> + </row> + <row> + <entry> + <p> + Attempts to complete Impala tasks such as executing INSERT-SELECT actions fail. The Impala logs + include notes that files could not be opened due to permission denied. + </p> + </entry> + <entry> + <p> + This can be the result of permissions issues. For example, you could use the Hive shell as the + hive user to create a table. After creating this table, you could attempt to complete some + action, such as an INSERT-SELECT on the table. Because the table was created using one user and + the INSERT-SELECT is attempted by another, this action may fail due to permissions issues. + </p> + </entry> + <entry> + <p> + In general, ensure the Impala user has sufficient permissions. In the preceding example, ensure + the Impala user has sufficient permissions to the table that the Hive user created. + </p> + </entry> + </row> + <row rev="IMP-1210"> + <entry> + <p> + Impala fails to start up, with the <cmdname>impalad</cmdname> logs referring to errors connecting + to the statestore service and attempts to re-register. + </p> + </entry> + <entry> + <p> + A large number of databases, tables, partitions, and so on can require metadata synchronization, + particularly on startup, that takes longer than the default timeout for the statestore service. + </p> + </entry> + <entry> + <p> + Configure the statestore timeout value and possibly other settings related to the frequency of + statestore updates and metadata loading. See + <xref href="impala_timeouts.xml#statestore_timeout"/> and + <xref href="impala_scalability.xml#statestore_scalability"/>. + </p> + </entry> + </row> + </tbody> + </tgroup> + </table> + + <p audience="Cloudera"> + Some or all of these settings might also be useful. +<codeblock>NUM_SCANNER_THREADS: 0 +ABORT_ON_DEFAULT_LIMIT_EXCEEDED: 0 +MAX_IO_BUFFERS: 0 +DEFAULT_ORDER_BY_LIMIT: -1 +BATCH_SIZE: 0 +NUM_NODES: 0 +DISABLE_CODEGEN: 0 +MAX_ERRORS: 0 +ABORT_ON_ERROR: 0 +MAX_SCAN_RANGE_LENGTH: 0 +ALLOW_UNSUPPORTED_FORMATS: 0 +SUPPORT_START_OVER: false +DEBUG_ACTION: +MEM_LIMIT: 0 +</codeblock> + </p> + </conbody> + </concept> + + <concept audience="Cloudera" id="core_dumps"> + + <title>Enabling Core Dumps for Impala</title> + + <conbody> + + <p> + Fill in details, then unhide. + </p> + + <p> + From Nong: + </p> + + <p> + In a CM-managed cluster, search for "core" from the impala configuration page. You should see the "enable + core dump" config. + </p> + + <p> + From <xref href="impala_config_options.xml#config_options"/>: + </p> + +<codeblock>export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-false}</codeblock> + + <note conref="../shared/impala_common.xml#common/core_dump_considerations"/> + + <p></p> + </conbody> + </concept> + + <concept audience="Cloudera" id="io_throughput"> + <title>Verifying I/O Throughput</title> + <conbody> + <p> + Optimal Impala query performance depends on being able to perform I/O across multiple storage devices + in parallel, with the data transferred at or close to the maximum throughput for each device. + If a hardware or configuration issue causes a reduction in I/O throughput, even if the problem only + affects a subset of storage devices, you might experience + slow query performance that cannot be improved by using regular SQL tuning techniques. + </p> + <p> + As a general guideline, expect each commodity storage device (for example, a standard rotational + hard drive) to be able to transfer approximately 100 MB per second. If you see persistent slow query + perormance, examine the Impala logs to check + </p> + +<codeblock> +<![CDATA[ +Useful test from beta at Visa. +SME: Jayant@ + +Symptoms: +* Queries running slow +* Scan rate of IO in Impala logs show noticeably less than expected IO rate for each disk (typical commodity disk should provide ~100 MB/s + +Actions: +* Validate disk read from OS to confirm no issue at hardware or OS level +* Validate disk read at HDFS to see if issue at HDFS config + +Specifics: +Testing Linux and hardware IO: +# First running: +sudo sysctl -w vm.drop_caches=3 vm.drop_caches=0 + +# Then Running: +sudo dd if=/dev/sda bs=1M of=/dev/null count=1k +& sudo dd if=/dev/sdb bs=1M of=/dev/null count=1k +& sudo dd if=/dev/sdc bs=1M of=/dev/null count=1k +& sudo dd if=/dev/sdd bs=1M of=/dev/null count=1k & wait + +Testing HDFS IO: +# You can use TestDFSIO. Its documented here ; http://answers.oreilly.com/topic/460-how-to-benchmark-a-hadoop-cluster/ +# You can also use sar, dd and iostat for monitoring the disk. + +# writes 10 files each of 1000 MB +hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000 + +# run the read benchmark +hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000 + +# clean up the data +hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -clean +]]> +</codeblock> + + </conbody> + </concept> + + <concept id="webui_snippet" audience="PDF"> + <title conref="impala_webui.xml#webui/webui_title"/> + <conbody> + <p conref="impala_webui.xml#webui/webui_intro"/> + <p> + For full details, see <xref href="impala_webui.xml#webui"/>. + </p> + </conbody> + </concept> + +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_truncate_table.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_truncate_table.xml b/docs/topics/impala_truncate_table.xml new file mode 100644 index 0000000..72f072f --- /dev/null +++ b/docs/topics/impala_truncate_table.xml @@ -0,0 +1,184 @@ +<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="2.3.0" id="truncate_table"> + + <title>TRUNCATE TABLE Statement (<keyword keyref="impala23"/> or higher only)</title> + <titlealts audience="PDF"><navtitle>TRUNCATE TABLE</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="DDL"/> + <data name="Category" value="S3"/> + <data name="Category" value="Tables"/> + <data name="Category" value="Disk Storage"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p rev="2.3.0"> + <indexterm audience="Cloudera">TRUNCATE TABLE statement</indexterm> + Removes the data from an Impala table while leaving the table itself. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>TRUNCATE TABLE <ph rev="2.5.0 IMPALA-2641">[IF EXISTS]</ph> [<varname>db_name</varname>.]<varname>table_name</varname></codeblock> + + <p conref="../shared/impala_common.xml#common/ddl_blurb"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + Often used to empty tables that are used during ETL cycles, after the data has been copied to another + table for the next stage of processing. This statement is a low-overhead alternative to dropping and + recreating the table, or using <codeph>INSERT OVERWRITE</codeph> to replace the data during the + next ETL cycle. + </p> + + <p> + This statement removes all the data and associated data files in the table. It can remove data files from internal tables, + external tables, partitioned tables, and tables mapped to HBase or the Amazon Simple Storage Service (S3). + The data removal applies to the entire table, including all partitions of a partitioned table. + </p> + + <p> + Any statistics produced by the <codeph>COMPUTE STATS</codeph> statement are reset when the data is removed. + </p> + + <p> + Make sure that you are in the correct database before truncating a table, either by issuing a + <codeph>USE</codeph> statement first or by using a fully qualified name + <codeph><varname>db_name</varname>.<varname>table_name</varname></codeph>. + </p> + + <p rev="2.5.0 IMPALA-2641"> + The optional <codeph>IF EXISTS</codeph> clause makes the statement succeed whether or not the table exists. + If the table does exist, it is truncated; if it does not exist, the statement has no effect. This capability is + useful in standardized setup scripts that are might be run both before and after some of the tables exist. + This clause is available in <keyword keyref="impala25_full"/> and higher. + </p> + + <p> + Any HDFS data files removed by this statement go into the HDFS trashcan, from which you can recover them + within a defined time interval if this operation turns out to be a mistake. + </p> + + <p conref="../shared/impala_common.xml#common/disk_space_blurb"/> + + <p conref="../shared/impala_common.xml#common/s3_blurb"/> + <p rev="2.2.0"> + Although Impala cannot write new data to a table stored in the Amazon + S3 filesystem, the <codeph>TRUNCATE TABLE</codeph> statement can remove data files from S3. + See <xref href="impala_s3.xml#s3"/> for details about working with S3 tables. + </p> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb"/> + <p rev="CDH-19187"> + The user ID that the <cmdname>impalad</cmdname> daemon runs under, + typically the <codeph>impala</codeph> user, must have write + permission for all the files and directories that make up the table. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example shows a table containing some data and with table and column statistics. + After the <codeph>TRUNCATE TABLE</codeph> statement, the data is removed and the statistics + are reset. + </p> + +<codeblock>CREATE TABLE truncate_demo (x INT); +INSERT INTO truncate_demo VALUES (1), (2), (4), (8); +SELECT COUNT(*) FROM truncate_demo; ++----------+ +| count(*) | ++----------+ +| 4 | ++----------+ +COMPUTE STATS truncate_demo; ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 1 partition(s) and 1 column(s). | ++-----------------------------------------+ +SHOW TABLE STATS truncate_demo; ++-------+--------+------+--------------+-------------------+--------+-------------------+ +| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | ++-------+--------+------+--------------+-------------------+--------+-------------------+ +| 4 | 1 | 8B | NOT CACHED | NOT CACHED | TEXT | false | ++-------+--------+------+--------------+-------------------+--------+-------------------+ +SHOW COLUMN STATS truncate_demo; ++--------+------+------------------+--------+----------+----------+ +| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | ++--------+------+------------------+--------+----------+----------+ +| x | INT | 4 | -1 | 4 | 4 | ++--------+------+------------------+--------+----------+----------+ + +-- After this statement, the data and the table/column stats will be gone. +TRUNCATE TABLE truncate_demo; + +SELECT COUNT(*) FROM truncate_demo; ++----------+ +| count(*) | ++----------+ +| 0 | ++----------+ +SHOW TABLE STATS truncate_demo; ++-------+--------+------+--------------+-------------------+--------+-------------------+ +| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | ++-------+--------+------+--------------+-------------------+--------+-------------------+ +| -1 | 0 | 0B | NOT CACHED | NOT CACHED | TEXT | false | ++-------+--------+------+--------------+-------------------+--------+-------------------+ +SHOW COLUMN STATS truncate_demo; ++--------+------+------------------+--------+----------+----------+ +| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | ++--------+------+------------------+--------+----------+----------+ +| x | INT | -1 | -1 | 4 | 4 | ++--------+------+------------------+--------+----------+----------+ +</codeblock> + + <p> + The following example shows how the <codeph>IF EXISTS</codeph> clause allows the <codeph>TRUNCATE TABLE</codeph> + statement to be run without error whether or not the table exists: + </p> + +<codeblock rev="2.5.0 IMPALA-2641">CREATE TABLE staging_table1 (x INT, s STRING); +Fetched 0 row(s) in 0.33s + +SHOW TABLES LIKE 'staging*'; ++----------------+ +| name | ++----------------+ +| staging_table1 | ++----------------+ +Fetched 1 row(s) in 0.25s + +-- Our ETL process involves removing all data from several staging tables +-- even though some might be already dropped, or not created yet. + +TRUNCATE TABLE IF EXISTS staging_table1; +Fetched 0 row(s) in 5.04s + +TRUNCATE TABLE IF EXISTS staging_table2; +Fetched 0 row(s) in 0.25s + +TRUNCATE TABLE IF EXISTS staging_table3; +Fetched 0 row(s) in 0.25s +</codeblock> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_tables.xml#tables"/>, + <xref href="impala_alter_table.xml#alter_table"/>, <xref href="impala_create_table.xml#create_table"/>, + <xref href="impala_partitioning.xml#partitioning"/>, <xref href="impala_tables.xml#internal_tables"/>, + <xref href="impala_tables.xml#external_tables"/> + </p> + + </conbody> +</concept>
