http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_components.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_components.xml b/docs/topics/impala_components.xml new file mode 100644 index 0000000..44e5c34 --- /dev/null +++ b/docs/topics/impala_components.xml @@ -0,0 +1,180 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="intro_components"> + + <title>Components of the Impala Server</title> + <titlealts audience="PDF"><navtitle>Components</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Concepts"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + The Impala server is a distributed, massively parallel processing (MPP) database engine. It consists of + different daemon processes that run on specific hosts within your CDH cluster. + </p> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="intro_impalad"> + + <title>The Impala Daemon</title> + + <conbody> + + <p> + The core Impala component is a daemon process that runs on each DataNode of the cluster, physically represented + by the <codeph>impalad</codeph> process. It reads and writes to data files; accepts queries transmitted + from the <codeph>impala-shell</codeph> command, Hue, JDBC, or ODBC; parallelizes the queries and + distributes work across the cluster; and transmits intermediate query results back to the + central coordinator node. + </p> + + <p> + You can submit a query to the Impala daemon running on any DataNode, and that instance of the daemon serves as the + <term>coordinator node</term> for that query. The other nodes transmit partial results back to the + coordinator, which constructs the final result set for a query. When running experiments with functionality + through the <codeph>impala-shell</codeph> command, you might always connect to the same Impala daemon for + convenience. For clusters running production workloads, you might load-balance by + submitting each query to a different Impala daemon in round-robin style, using the JDBC or ODBC interfaces. + </p> + + <p> + The Impala daemons are in constant communication with the <term>statestore</term>, to confirm which nodes + are healthy and can accept new work. + </p> + + <p rev="1.2"> + They also receive broadcast messages from the <cmdname>catalogd</cmdname> daemon (introduced in Impala 1.2) + whenever any Impala node in the cluster creates, alters, or drops any type of object, or when an + <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> statement is processed through Impala. This + background communication minimizes the need for <codeph>REFRESH</codeph> or <codeph>INVALIDATE + METADATA</codeph> statements that were needed to coordinate metadata across nodes prior to Impala 1.2. + </p> + + <p> + <b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>, + <xref href="impala_processes.xml#processes"/>, <xref href="impala_timeouts.xml#impalad_timeout"/>, + <xref href="impala_ports.xml#ports"/>, <xref href="impala_proxy.xml#proxy"/> + </p> + </conbody> + </concept> + + <concept id="intro_statestore"> + + <title>The Impala Statestore</title> + + <conbody> + + <p> + The Impala component known as the <term>statestore</term> checks on the health of Impala daemons on all the + DataNodes in a cluster, and continuously relays its findings to each of those daemons. It is physically + represented by a daemon process named <codeph>statestored</codeph>; you only need such a process on one + host in the cluster. If an Impala daemon goes offline due to hardware failure, network error, software issue, + or other reason, the statestore informs all the other Impala daemons so that future queries can avoid making + requests to the unreachable node. + </p> + + <p> + Because the statestore's purpose is to help when things go wrong, it is not critical to the normal + operation of an Impala cluster. If the statestore is not running or becomes unreachable, the Impala daemons + continue running and distributing work among themselves as usual; the cluster just becomes less robust if + other Impala daemons fail while the statestore is offline. When the statestore comes back online, it re-establishes + communication with the Impala daemons and resumes its monitoring function. + </p> + + <p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/> + + <p> + <b>Related information:</b> + </p> + + <p> + <xref href="impala_scalability.xml#statestore_scalability"/>, + <xref href="impala_config_options.xml#config_options"/>, <xref href="impala_processes.xml#processes"/>, + <xref href="impala_timeouts.xml#statestore_timeout"/>, <xref href="impala_ports.xml#ports"/> + </p> + </conbody> + </concept> + + <concept rev="1.2" id="intro_catalogd"> + + <title>The Impala Catalog Service</title> + + <conbody> + + <p> + The Impala component known as the <term>catalog service</term> relays the metadata changes from Impala SQL + statements to all the DataNodes in a cluster. It is physically represented by a daemon process named + <codeph>catalogd</codeph>; you only need such a process on one host in the cluster. Because the requests + are passed through the statestore daemon, it makes sense to run the <cmdname>statestored</cmdname> and + <cmdname>catalogd</cmdname> services on the same host. + </p> + + <p> + The catalog service avoids the need to issue + <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> statements when the metadata changes are + performed by statements issued through Impala. When you create a table, load data, and so on through Hive, + you do need to issue <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> on an Impala node + before executing a query there. + </p> + + <p> + This feature touches a number of aspects of Impala: + </p> + +<!-- This was formerly a conref, but since the list of links also included a link + to this same topic, materializing the list here and removing that + circular link. (The conref is still used in Incompatible Changes.) + + <ul conref="../shared/impala_common.xml#common/catalogd_xrefs"> + <li/> + </ul> +--> + + <ul id="catalogd_xrefs"> + <li> + <p> + See <xref href="impala_install.xml#install"/>, <xref href="impala_upgrading.xml#upgrading"/> and + <xref href="impala_processes.xml#processes"/>, for usage information for the + <cmdname>catalogd</cmdname> daemon. + </p> + </li> + + <li> + <p> + The <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> statements are not needed + when the <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, or other table-changing or + data-changing operation is performed through Impala. These statements are still needed if such + operations are done through Hive or by manipulating data files directly in HDFS, but in those cases the + statements only need to be issued on one Impala node rather than on all nodes. See + <xref href="impala_refresh.xml#refresh"/> and + <xref href="impala_invalidate_metadata.xml#invalidate_metadata"/> for the latest usage information for + those statements. + </p> + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/load_catalog_in_background"/> + + <p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/> + + <note> + <p conref="../shared/impala_common.xml#common/catalog_server_124"/> + </note> + + <p> + <b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>, + <xref href="impala_processes.xml#processes"/>, <xref href="impala_ports.xml#ports"/> + </p> + </conbody> + </concept> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_compression_codec.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_compression_codec.xml b/docs/topics/impala_compression_codec.xml new file mode 100644 index 0000000..739c651 --- /dev/null +++ b/docs/topics/impala_compression_codec.xml @@ -0,0 +1,98 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="2.0.0" id="compression_codec"> + + <title>COMPRESSION_CODEC Query Option (<keyword keyref="impala20"/> or higher only)</title> + <titlealts audience="PDF"><navtitle>COMPRESSION_CODEC</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Compression"/> + <data name="Category" value="File Formats"/> + <data name="Category" value="Parquet"/> + <data name="Category" value="Snappy"/> + <data name="Category" value="Gzip"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + +<!-- The initial part of this paragraph is copied straight from the #parquet_compression topic. --> + +<!-- Could turn into a conref. --> + + <p rev="2.0.0"> + <indexterm audience="Cloudera">COMPRESSION_CODEC query option</indexterm> + When Impala writes Parquet data files using the <codeph>INSERT</codeph> statement, the underlying compression + is controlled by the <codeph>COMPRESSION_CODEC</codeph> query option. + </p> + + <note> + Prior to Impala 2.0, this option was named <codeph>PARQUET_COMPRESSION_CODEC</codeph>. In Impala 2.0 and + later, the <codeph>PARQUET_COMPRESSION_CODEC</codeph> name is not recognized. Use the more general name + <codeph>COMPRESSION_CODEC</codeph> for new code. + </note> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>SET COMPRESSION_CODEC=<varname>codec_name</varname>;</codeblock> + + <p> + The allowed values for this query option are <codeph>SNAPPY</codeph> (the default), <codeph>GZIP</codeph>, + and <codeph>NONE</codeph>. + </p> + + <note> + A Parquet file created with <codeph>COMPRESSION_CODEC=NONE</codeph> is still typically smaller than the + original data, due to encoding schemes such as run-length encoding and dictionary encoding that are applied + separately from compression. + </note> + + <p></p> + + <p> + The option value is not case-sensitive. + </p> + + <p> + If the option is set to an unrecognized value, all kinds of queries will fail due to the invalid option + setting, not just queries involving Parquet tables. (The value <codeph>BZIP2</codeph> is also recognized, but + is not compatible with Parquet tables.) + </p> + + <p> + <b>Type:</b> <codeph>STRING</codeph> + </p> + + <p> + <b>Default:</b> <codeph>SNAPPY</codeph> + </p> + + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<codeblock>set compression_codec=gzip; +insert into parquet_table_highly_compressed select * from t1; + +set compression_codec=snappy; +insert into parquet_table_compression_plus_fast_queries select * from t1; + +set compression_codec=none; +insert into parquet_table_no_compression select * from t1; + +set compression_codec=foo; +select * from t1 limit 5; +ERROR: Invalid compression codec: foo +</codeblock> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + For information about how compressing Parquet data files affects query performance, see + <xref href="impala_parquet.xml#parquet_compression"/>. + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_compute_stats.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_compute_stats.xml b/docs/topics/impala_compute_stats.xml new file mode 100644 index 0000000..b915b77 --- /dev/null +++ b/docs/topics/impala_compute_stats.xml @@ -0,0 +1,432 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.2.2" id="compute_stats"> + + <title>COMPUTE STATS Statement</title> + <titlealts audience="PDF"><navtitle>COMPUTE STATS</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Scalability"/> + <data name="Category" value="ETL"/> + <data name="Category" value="Ingest"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Tables"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">COMPUTE STATS statement</indexterm> + Gathers information about volume and distribution of data in a table and all associated columns and + partitions. The information is stored in the metastore database, and used by Impala to help optimize queries. + For example, if Impala can determine that a table is large or small, or has many or few distinct values it + can organize parallelize the work appropriately for a join query or insert operation. For details about the + kinds of information gathered by this statement, see <xref href="impala_perf_stats.xml#perf_stats"/>. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock rev="2.1.0">COMPUTE STATS [<varname>db_name</varname>.]<varname>table_name</varname> +COMPUTE INCREMENTAL STATS [<varname>db_name</varname>.]<varname>table_name</varname> [PARTITION (<varname>partition_spec</varname>)] + +<varname>partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname> +</codeblock> + + <p conref="../shared/impala_common.xml#common/incremental_partition_spec"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + Originally, Impala relied on users to run the Hive <codeph>ANALYZE TABLE</codeph> statement, but that method + of gathering statistics proved unreliable and difficult to use. The Impala <codeph>COMPUTE STATS</codeph> + statement is built from the ground up to improve the reliability and user-friendliness of this operation. + <codeph>COMPUTE STATS</codeph> does not require any setup steps or special configuration. You only run a + single Impala <codeph>COMPUTE STATS</codeph> statement to gather both table and column statistics, rather + than separate Hive <codeph>ANALYZE TABLE</codeph> statements for each kind of statistics. + </p> + + <p rev="2.1.0"> + The <codeph>COMPUTE INCREMENTAL STATS</codeph> variation is a shortcut for partitioned tables that works on a + subset of partitions rather than the entire table. The incremental nature makes it suitable for large tables + with many partitions, where a full <codeph>COMPUTE STATS</codeph> operation takes too long to be practical + each time a partition is added or dropped. See <xref href="impala_perf_stats.xml#perf_stats_incremental"/> + for full usage details. + </p> + + <p> + <codeph>COMPUTE INCREMENTAL STATS</codeph> only applies to partitioned tables. If you use the + <codeph>INCREMENTAL</codeph> clause for an unpartitioned table, Impala automatically uses the original + <codeph>COMPUTE STATS</codeph> statement. Such tables display <codeph>false</codeph> under the + <codeph>Incremental stats</codeph> column of the <codeph>SHOW TABLE STATS</codeph> output. + </p> + + <note> + Because many of the most performance-critical and resource-intensive operations rely on table and column + statistics to construct accurate and efficient plans, <codeph>COMPUTE STATS</codeph> is an important step at + the end of your ETL process. Run <codeph>COMPUTE STATS</codeph> on all tables as your first step during + performance tuning for slow queries, or troubleshooting for out-of-memory conditions: + <ul> + <li> + Accurate statistics help Impala construct an efficient query plan for join queries, improving performance + and reducing memory usage. + </li> + + <li> + Accurate statistics help Impala distribute the work effectively for insert operations into Parquet + tables, improving performance and reducing memory usage. + </li> + + <li rev="1.3.0"> + Accurate statistics help Impala estimate the memory required for each query, which is important when you + use resource management features, such as admission control and the YARN resource management framework. + The statistics help Impala to achieve high concurrency, full utilization of available memory, and avoid + contention with workloads from other Hadoop components. + </li> + </ul> + </note> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p rev="2.3.0"> + Currently, the statistics created by the <codeph>COMPUTE STATS</codeph> statement do not include + information about complex type columns. The column stats metrics for complex columns are always shown + as -1. For queries involving complex type columns, Impala uses + heuristics to estimate the data distribution within such columns. + </p> + + <p conref="../shared/impala_common.xml#common/hbase_blurb"/> + + <p> + <codeph>COMPUTE STATS</codeph> works for HBase tables also. The statistics gathered for HBase tables are + somewhat different than for HDFS-backed tables, but that metadata is still used for optimization when HBase + tables are involved in join queries. + </p> + + <p conref="../shared/impala_common.xml#common/s3_blurb"/> + + <p rev="2.2.0"> + <codeph>COMPUTE STATS</codeph> also works for tables where data resides in the Amazon Simple Storage Service (S3). + See <xref href="impala_s3.xml#s3"/> for details. + </p> + + <p conref="../shared/impala_common.xml#common/performance_blurb"/> + + <p> + The statistics collected by <codeph>COMPUTE STATS</codeph> are used to optimize join queries + <codeph>INSERT</codeph> operations into Parquet tables, and other resource-intensive kinds of SQL statements. + See <xref href="impala_perf_stats.xml#perf_stats"/> for details. + </p> + + <p> + For large tables, the <codeph>COMPUTE STATS</codeph> statement itself might take a long time and you + might need to tune its performance. The <codeph>COMPUTE STATS</codeph> statement does not work with the + <codeph>EXPLAIN</codeph> statement, or the <codeph>SUMMARY</codeph> command in <cmdname>impala-shell</cmdname>. + You can use the <codeph>PROFILE</codeph> statement in <cmdname>impala-shell</cmdname> to examine timing information + for the statement as a whole. If a basic <codeph>COMPUTE STATS</codeph> statement takes a long time for a + partitioned table, consider switching to the <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax so that only + newly added partitions are analyzed each time. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + This example shows two tables, <codeph>T1</codeph> and <codeph>T2</codeph>, with a small number distinct + values linked by a parent-child relationship between <codeph>T1.ID</codeph> and <codeph>T2.PARENT</codeph>. + <codeph>T1</codeph> is tiny, while <codeph>T2</codeph> has approximately 100K rows. Initially, the statistics + includes physical measurements such as the number of files, the total size, and size measurements for + fixed-length columns such as with the <codeph>INT</codeph> type. Unknown values are represented by -1. After + running <codeph>COMPUTE STATS</codeph> for each table, much more information is available through the + <codeph>SHOW STATS</codeph> statements. If you were running a join query involving both of these tables, you + would need statistics for both tables to get the most effective optimization for the query. + </p> + +<!-- Note: chopped off any excess characters at position 87 and after, + to avoid weird wrapping in PDF. + Applies to any subsequent examples with output from SHOW ... STATS too. --> + +<codeblock>[localhost:21000] > show table stats t1; +Query: show table stats t1 ++-------+--------+------+--------+ +| #Rows | #Files | Size | Format | ++-------+--------+------+--------+ +| -1 | 1 | 33B | TEXT | ++-------+--------+------+--------+ +Returned 1 row(s) in 0.02s +[localhost:21000] > show table stats t2; +Query: show table stats t2 ++-------+--------+----------+--------+ +| #Rows | #Files | Size | Format | ++-------+--------+----------+--------+ +| -1 | 28 | 960.00KB | TEXT | ++-------+--------+----------+--------+ +Returned 1 row(s) in 0.01s +[localhost:21000] > show column stats t1; +Query: show column stats t1 ++--------+--------+------------------+--------+----------+----------+ +| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | ++--------+--------+------------------+--------+----------+----------+ +| id | INT | -1 | -1 | 4 | 4 | +| s | STRING | -1 | -1 | -1 | -1 | ++--------+--------+------------------+--------+----------+----------+ +Returned 2 row(s) in 1.71s +[localhost:21000] > show column stats t2; +Query: show column stats t2 ++--------+--------+------------------+--------+----------+----------+ +| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | ++--------+--------+------------------+--------+----------+----------+ +| parent | INT | -1 | -1 | 4 | 4 | +| s | STRING | -1 | -1 | -1 | -1 | ++--------+--------+------------------+--------+----------+----------+ +Returned 2 row(s) in 0.01s +[localhost:21000] > compute stats t1; +Query: compute stats t1 ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 1 partition(s) and 2 column(s). | ++-----------------------------------------+ +Returned 1 row(s) in 5.30s +[localhost:21000] > show table stats t1; +Query: show table stats t1 ++-------+--------+------+--------+ +| #Rows | #Files | Size | Format | ++-------+--------+------+--------+ +| 3 | 1 | 33B | TEXT | ++-------+--------+------+--------+ +Returned 1 row(s) in 0.01s +[localhost:21000] > show column stats t1; +Query: show column stats t1 ++--------+--------+------------------+--------+----------+----------+ +| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | ++--------+--------+------------------+--------+----------+----------+ +| id | INT | 3 | -1 | 4 | 4 | +| s | STRING | 3 | -1 | -1 | -1 | ++--------+--------+------------------+--------+----------+----------+ +Returned 2 row(s) in 0.02s +[localhost:21000] > compute stats t2; +Query: compute stats t2 ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 1 partition(s) and 2 column(s). | ++-----------------------------------------+ +Returned 1 row(s) in 5.70s +[localhost:21000] > show table stats t2; +Query: show table stats t2 ++-------+--------+----------+--------+ +| #Rows | #Files | Size | Format | ++-------+--------+----------+--------+ +| 98304 | 1 | 960.00KB | TEXT | ++-------+--------+----------+--------+ +Returned 1 row(s) in 0.03s +[localhost:21000] > show column stats t2; +Query: show column stats t2 ++--------+--------+------------------+--------+----------+----------+ +| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | ++--------+--------+------------------+--------+----------+----------+ +| parent | INT | 3 | -1 | 4 | 4 | +| s | STRING | 6 | -1 | 14 | 9.3 | ++--------+--------+------------------+--------+----------+----------+ +Returned 2 row(s) in 0.01s</codeblock> + + <p rev="2.1.0"> + The following example shows how to use the <codeph>INCREMENTAL</codeph> clause, available in Impala 2.1.0 and + higher. The <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax lets you collect statistics for newly added or + changed partitions, without rescanning the entire table. + </p> + +<codeblock>-- Initially the table has no incremental stats, as indicated +-- by -1 under #Rows and false under Incremental stats. +show table stats item_partitioned; ++-------------+-------+--------+----------+--------------+---------+------------------ +| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats ++-------------+-------+--------+----------+--------------+---------+------------------ +| Books | -1 | 1 | 223.74KB | NOT CACHED | PARQUET | false +| Children | -1 | 1 | 230.05KB | NOT CACHED | PARQUET | false +| Electronics | -1 | 1 | 232.67KB | NOT CACHED | PARQUET | false +| Home | -1 | 1 | 232.56KB | NOT CACHED | PARQUET | false +| Jewelry | -1 | 1 | 223.72KB | NOT CACHED | PARQUET | false +| Men | -1 | 1 | 231.25KB | NOT CACHED | PARQUET | false +| Music | -1 | 1 | 237.90KB | NOT CACHED | PARQUET | false +| Shoes | -1 | 1 | 234.90KB | NOT CACHED | PARQUET | false +| Sports | -1 | 1 | 227.97KB | NOT CACHED | PARQUET | false +| Women | -1 | 1 | 226.27KB | NOT CACHED | PARQUET | false +| Total | -1 | 10 | 2.25MB | 0B | | ++-------------+-------+--------+----------+--------------+---------+------------------ + +-- After the first COMPUTE INCREMENTAL STATS, +-- all partitions have stats. +compute incremental stats item_partitioned; ++-------------------------------------------+ +| summary | ++-------------------------------------------+ +| Updated 10 partition(s) and 21 column(s). | ++-------------------------------------------+ +show table stats item_partitioned; ++-------------+-------+--------+----------+--------------+---------+------------------ +| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats ++-------------+-------+--------+----------+--------------+---------+------------------ +| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true +| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true +| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true +| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true +| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true +| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true +| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true +| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true +| Sports | 1783 | 1 | 227.97KB | NOT CACHED | PARQUET | true +| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true +| Total | 17957 | 10 | 2.25MB | 0B | | ++-------------+-------+--------+----------+--------------+---------+------------------ + +-- Add a new partition... +alter table item_partitioned add partition (i_category='Camping'); +-- Add or replace files in HDFS outside of Impala, +-- rendering the stats for a partition obsolete. +!import_data_into_sports_partition.sh +refresh item_partitioned; +drop incremental stats item_partitioned partition (i_category='Sports'); +-- Now some partitions have incremental stats +-- and some do not. +show table stats item_partitioned; ++-------------+-------+--------+----------+--------------+---------+------------------ +| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats ++-------------+-------+--------+----------+--------------+---------+------------------ +| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true +| Camping | -1 | 1 | 408.02KB | NOT CACHED | PARQUET | false +| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true +| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true +| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true +| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true +| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true +| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true +| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true +| Sports | -1 | 1 | 227.97KB | NOT CACHED | PARQUET | false +| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true +| Total | 17957 | 11 | 2.65MB | 0B | | ++-------------+-------+--------+----------+--------------+---------+------------------ + +-- After another COMPUTE INCREMENTAL STATS, +-- all partitions have incremental stats, and only the 2 +-- partitions without incremental stats were scanned. +compute incremental stats item_partitioned; ++------------------------------------------+ +| summary | ++------------------------------------------+ +| Updated 2 partition(s) and 21 column(s). | ++------------------------------------------+ +show table stats item_partitioned; ++-------------+-------+--------+----------+--------------+---------+------------------ +| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats ++-------------+-------+--------+----------+--------------+---------+------------------ +| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true +| Camping | 5328 | 1 | 408.02KB | NOT CACHED | PARQUET | true +| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true +| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true +| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true +| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true +| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true +| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true +| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true +| Sports | 1783 | 1 | 227.97KB | NOT CACHED | PARQUET | true +| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true +| Total | 17957 | 11 | 2.65MB | 0B | | ++-------------+-------+--------+----------+--------------+---------+------------------ +</codeblock> + + <p conref="../shared/impala_common.xml#common/file_format_blurb"/> + + <p> + The <codeph>COMPUTE STATS</codeph> statement works with tables created with any of the file formats supported + by Impala. See <xref href="impala_file_formats.xml#file_formats"/> for details about working with the + different file formats. The following considerations apply to <codeph>COMPUTE STATS</codeph> depending on the + file format of the table. + </p> + + <p> + The <codeph>COMPUTE STATS</codeph> statement works with text tables with no restrictions. These tables can be + created through either Impala or Hive. + </p> + + <p> + The <codeph>COMPUTE STATS</codeph> statement works with Parquet tables. These tables can be created through + either Impala or Hive. + </p> + + <p> + The <codeph>COMPUTE STATS</codeph> statement works with Avro tables without restriction in CDH 5.4 / Impala 2.2 + and higher. In earlier releases, <codeph>COMPUTE STATS</codeph> worked only for Avro tables created through Hive, + and required the <codeph>CREATE TABLE</codeph> statement to use SQL-style column names and types rather than an + Avro-style schema specification. + </p> + + <p> + The <codeph>COMPUTE STATS</codeph> statement works with RCFile tables with no restrictions. These tables can + be created through either Impala or Hive. + </p> + + <p> + The <codeph>COMPUTE STATS</codeph> statement works with SequenceFile tables with no restrictions. These + tables can be created through either Impala or Hive. + </p> + + <p> + The <codeph>COMPUTE STATS</codeph> statement works with partitioned tables, whether all the partitions use + the same file format, or some partitions are defined through <codeph>ALTER TABLE</codeph> to use different + file formats. + </p> + + <p conref="../shared/impala_common.xml#common/ddl_blurb"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_maybe"/> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p conref="../shared/impala_common.xml#common/decimal_no_stats"/> + + <note conref="../shared/impala_common.xml#common/compute_stats_nulls"/> + + <p conref="../shared/impala_common.xml#common/internals_blurb"/> + <p> + Behind the scenes, the <codeph>COMPUTE STATS</codeph> statement + executes two statements: one to count the rows of each partition + in the table (or the entire table if unpartitioned) through the + <codeph>COUNT(*)</codeph> function, + and another to count the approximate number of distinct values + in each column through the <codeph>NDV()</codeph> function. + You might see these queries in your monitoring and diagnostic displays. + The same factors that affect the performance, scalability, and + execution of other queries (such as parallel execution, memory usage, + admission control, and timeouts) also apply to the queries run by the + <codeph>COMPUTE STATS</codeph> statement. + </p> + + <p conref="../shared/impala_common.xml#common/permissions_blurb"/> + <p rev="CDH-19187"> + The user ID that the <cmdname>impalad</cmdname> daemon runs under, + typically the <codeph>impala</codeph> user, must have read + permission for all affected files in the source directory: + all files in the case of an unpartitioned table or + a partitioned table in the case of <codeph>COMPUTE STATS</codeph>; + or all the files in partitions without incremental stats in + the case of <codeph>COMPUTE INCREMENTAL STATS</codeph>. + It must also have read and execute permissions for all + relevant directories holding the data files. + (Essentially, <codeph>COMPUTE STATS</codeph> requires the + same permissions as the underlying <codeph>SELECT</codeph> queries it runs + against the table.) + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_drop_stats.xml#drop_stats"/>, <xref href="impala_show.xml#show_table_stats"/>, + <xref href="impala_show.xml#show_column_stats"/>, <xref href="impala_perf_stats.xml#perf_stats"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_concepts.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_concepts.xml b/docs/topics/impala_concepts.xml new file mode 100644 index 0000000..74c1016 --- /dev/null +++ b/docs/topics/impala_concepts.xml @@ -0,0 +1,296 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="concepts"> + + <title>Impala Concepts and Architecture</title> + <titlealts audience="PDF"><navtitle>Concepts and Architecture</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Concepts"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Stub Pages"/> + </metadata> + </prolog> + + <conbody> + <draft-comment author="-dita-use-conref-target" audience="integrated" + conref="../shared/cdh_cm_common.xml#id_dgz_rhr_kv/draft-comment-test"/> + + <p> + The following sections provide background information to help you become productive using Impala and + its features. Where appropriate, the explanations include context to help understand how aspects of Impala + relate to other technologies you might already be familiar with, such as relational database management + systems and data warehouses, or other Hadoop components such as Hive, HDFS, and HBase. + </p> + + <p outputclass="toc"/> + </conbody> + +<!-- These other topics are waiting to be filled in. Could become subtopics or top-level topics depending on the depth of coverage in each case. --> + + <concept id="intro_data_lifecycle" audience="Cloudera"> + + <title>Overview of the Data Lifecycle for Impala</title> + + <conbody/> + </concept> + + <concept id="intro_etl" audience="Cloudera"> + + <title>Overview of the Extract, Transform, Load (ETL) Process for Impala</title> + <prolog> + <metadata> + <data name="Category" value="ETL"/> + <data name="Category" value="Ingest"/> + <data name="Category" value="Concepts"/> + </metadata> + </prolog> + + <conbody/> + </concept> + + <concept id="intro_hadoop_data" audience="Cloudera"> + + <title>How Impala Works with Hadoop Data Files</title> + + <conbody/> + </concept> + + <concept id="intro_web_ui" audience="Cloudera"> + + <title>Overview of the Impala Web Interface</title> + + <conbody/> + </concept> + + <concept id="intro_bi" audience="Cloudera"> + + <title>Using Impala with Business Intelligence Tools</title> + + <conbody/> + </concept> + + <concept id="intro_ha" audience="Cloudera"> + + <title>Overview of Impala Availability and Fault Tolerance</title> + + <conbody/> + </concept> + +<!-- This is pretty much ready to go. Decide if it should go under "Concepts" or "Performance", + and if it should be split out into a separate file, and then take out the audience= attribute + to make it visible. +--> + + <concept id="intro_llvm" audience="Cloudera"> + + <title>Overview of Impala Runtime Code Generation</title> + + <conbody> + +<!-- Adapted from the CIDR15 paper written by the Impala team. --> + + <p> + Impala uses <term>LLVM</term> (a compiler library and collection of related tools) to perform just-in-time + (JIT) compilation within the running <cmdname>impalad</cmdname> process. This runtime code generation + technique improves query execution times by generating native code optimized for the architecture of each + host in your particular cluster. Performance gains of 5 times or more are typical for representative + workloads. + </p> + + <p> + Impala uses runtime code generation to produce query-specific versions of functions that are critical to + performance. In particular, code generation is applied to <term>inner loop</term> functions, that is, those + that are executed many times (for every tuple) in a given query, and thus constitute a large portion of the + total time the query takes to execute. For example, when Impala scans a data file, it calls a function to + parse each record into Impalaâs in-memory tuple format. For queries scanning large tables, billions of + records could result in billions of function calls. This function must therefore be extremely efficient for + good query performance, and removing even a few instructions from each function call can result in large + query speedups. + </p> + + <p> + Overall, JIT compilation has an effect similar to writing custom code to process a query. For example, it + eliminates branches, unrolls loops, propagates constants, offsets and pointers, and inlines functions. + Inlining is especially valuable for functions used internally to evaluate expressions, where the function + call itself is more expensive than the function body (for example, a function that adds two numbers). + Inlining functions also increases instruction-level parallelism, and allows the compiler to make further + optimizations such as subexpression elimination across expressions. + </p> + + <p> + Impala generates runtime query code automatically, so you do not need to do anything special to get this + performance benefit. This technique is most effective for complex and long-running queries that process + large numbers of rows. If you need to issue a series of short, small queries, you might turn off this + feature to avoid the overhead of compilation time for each query. In this case, issue the statement + <codeph>SET DISABLE_CODEGEN=true</codeph> to turn off runtime code generation for the duration of the + current session. + </p> + +<!-- + <p> + Without code generation, + functions tend to be suboptimal + to handle situations that cannot be predicted in advance. + For example, + a record-parsing function that + only handles integer types will be faster at parsing an integer-only file + than a function that handles other data types + such as strings and floating-point numbers. + However, the schemas of the files to + be scanned are unknown at compile time, + and so a general-purpose function must be used, even if at runtime + it is known that more limited functionality is sufficient. + </p> + + <p> + A source of large runtime overheads are virtual functions. Virtual function calls incur a large performance + penalty, particularly when the called function is very simple, as the calls cannot be inlined. + If the type of the object instance is known at runtime, we can use code generation to replace the virtual + function call with a call directly to the correct function, which can then be inlined. This is especially + valuable when evaluating expression trees. In Impala (as in many systems), expressions are composed of a + tree of individual operators and functions. + </p> + + <p> + Each type of expression that can appear in a query is implemented internally by overriding a virtual function. + Many of these expression functions are quite simple, for example, adding two numbers. + The virtual function call can be more expensive than the function body itself. By resolving the virtual + function calls with code generation and then inlining the resulting function calls, Impala can evaluate expressions + directly with no function call overhead. Inlining functions also increases + instruction-level parallelism, and allows the compiler to make further optimizations such as subexpression + elimination across expressions. + </p> +--> + </conbody> + </concept> + +<!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. --> + + <concept audience="Cloudera" id="intro_io"> + + <title>Overview of Impala I/O</title> + + <conbody> + + <p> + Efficiently retrieving data from HDFS is a challenge for all SQL-on-Hadoop systems. To perform + data scans from both disk and memory at or near hardware speed, Impala uses an HDFS feature called + <term>short-circuit local reads</term> to bypass the DataNode protocol when reading from local disk. Impala + can read at almost disk bandwidth (approximately 100 MB/s per disk) and is typically able to saturate all + available disks. For example, with 12 disks, Impala is typically capable of sustaining I/O at 1.2 GB/sec. + Furthermore, <term>HDFS caching</term> allows Impala to access memory-resident data at memory bus speed, + and saves CPU cycles as there is no need to copy or checksum data blocks within memory. + </p> + + <p> + The I/O manager component interfaces with storage devices to read and write data. I/O manager assigns a + fixed number of worker threads per physical disk (currently one thread per rotational disk and eight per + SSD), providing an asynchronous interface to clients (<term>scanner threads</term>). + </p> + </conbody> + </concept> + +<!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. --> + +<!-- Although good idea to get some answers from Henry first. --> + + <concept audience="Cloudera" id="intro_state_distribution"> + + <title>State distribution</title> + + <conbody> + + <p> + As a massively parallel database that can run on hundreds of nodes, Impala must coordinate and synchronize + its metadata across the entire cluster. Impala's symmetric-node architecture means that any node can accept + and execute queries, and thus each node needs up-to-date versions of the system catalog and a knowledge of + which hosts the <cmdname>impalad</cmdname> daemons run on. To avoid the overhead of TCP connections and + remote procedure calls to retrieve metadata during query planning, Impala implements a simple + publish-subscribe service called the <term>statestore</term> to push metadata changes to a set of + subscribers (the <cmdname>impalad</cmdname> daemons running on all the DataNodes). + </p> + + <p> + The statestore maintains a set of topics, which are arrays of <codeph>(<varname>key</varname>, + <varname>value</varname>, <varname>version</varname>)</codeph> triplets called <term>entries</term> where + <varname>key</varname> and <varname>value</varname> are byte arrays, and <varname>version</varname> is a + 64-bit integer. A topic is defined by an application, and so the statestore has no understanding of the + contents of any topic entry. Topics are persistent through the lifetime of the statestore, but are not + persisted across service restarts. Processes that receive updates to any topic are called + <term>subscribers</term>, and express their interest by registering with the statestore at startup and + providing a list of topics. The statestore responds to registration by sending the subscriber an initial + topic update for each registered topic, which consists of all the entries currently in that topic. + </p> + +<!-- Henry: OK, but in practice, what is in these topic messages for Impala? --> + + <p> + After registration, the statestore periodically sends two kinds of messages to each subscriber. The first + kind of message is a topic update, and consists of all changes to a topic (new entries, modified entries + and deletions) since the last update was successfully sent to the subscriber. Each subscriber maintains a + per-topic most-recent-version identifier which allows the statestore to only send the delta between + updates. In response to a topic update, each subscriber sends a list of changes it intends to make to its + subscribed topics. Those changes are guaranteed to have been applied by the time the next update is + received. + </p> + + <p> + The second kind of statestore message is a <term>heartbeat</term>, formerly sometimes called + <term>keepalive</term>. The statestore uses heartbeat messages to maintain the connection to each + subscriber, which would otherwise time out its subscription and attempt to re-register. + </p> + + <p> + Prior to Impala 2.0, both kinds of communication were combined in a single kind of message. Because these + messages could be very large in instances with thousands of tables, partitions, data files, and so on, + Impala 2.0 and higher divides the types of messages so that the small heartbeat pings can be transmitted + and acknowledged quickly, increasing the reliability of the statestore mechanism that detects when Impala + nodes become unavailable. + </p> + + <p> + If the statestore detects a failed subscriber (for example, by repeated failed heartbeat deliveries), it + stops sending updates to that node. +<!-- Henry: what are examples of these transient topic entries? --> + Some topic entries are marked as transient, meaning that if their owning subscriber fails, they are + removed. + </p> + + <p> + Although the asynchronous nature of this mechanism means that metadata updates might take some time to + propagate across the entire cluster, that does not affect the consistency of query planning or results. + Each query is planned and coordinated by a particular node, so as long as the coordinator node is aware of + the existence of the relevant tables, data files, and so on, it can distribute the query work to other + nodes even if those other nodes have not received the latest metadata updates. +<!-- Henry: need another example here of what's in a topic, e.g. is it the list of available tables? --> +<!-- + For example, query planning is performed on a single node based on the + catalog metadata topic, and once a full plan has been computed, all information required to execute that + plan is distributed directly to the executing nodes. + There is no requirement that an executing node should + know about the same version of the catalog metadata topic. +--> + </p> + + <p> + We have found that the statestore process with default settings scales well to medium sized clusters, and + can serve our largest deployments with some configuration changes. +<!-- Henry: elaborate on the configuration changes. --> + </p> + + <p> +<!-- Henry: other examples like load information? How is load information used? --> + The statestore does not persist any metadata to disk: all current metadata is pushed to the statestore by + its subscribers (for example, load information). Therefore, should a statestore restart, its state can be + recovered during the initial subscriber registration phase. Or if the machine that the statestore is + running on fails, a new statestore process can be started elsewhere, and subscribers can fail over to it. + There is no built-in failover mechanism in Impala, instead deployments commonly use a retargetable DNS + entry to force subscribers to automatically move to the new process instance. +<!-- Henry: translate that last sentence into instructions / guidelines. --> + </p> + </conbody> + </concept> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_conditional_functions.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_conditional_functions.xml b/docs/topics/impala_conditional_functions.xml new file mode 100644 index 0000000..23de779 --- /dev/null +++ b/docs/topics/impala_conditional_functions.xml @@ -0,0 +1,443 @@ +<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="conditional_functions"> + + <title>Impala Conditional Functions</title> + <titlealts audience="PDF"><navtitle>Conditional Functions</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + + <conbody> + + <p> + Impala supports the following conditional functions for testing equality, comparison operators, and nullity: + </p> + + <dl> + <dlentry id="case"> + + <dt> + <codeph>CASE a WHEN b THEN c [WHEN d THEN e]... [ELSE f] END</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">CASE expression</indexterm> + <b>Purpose:</b> Compares an expression to one or more possible values, and returns a corresponding result + when a match is found. + <p conref="../shared/impala_common.xml#common/return_same_type"/> + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + <p> + In this form of the <codeph>CASE</codeph> expression, the initial value <codeph>A</codeph> + being evaluated for each row it typically a column reference, or an expression involving + a column. This form can only compare against a set of specified values, not ranges, + multi-value comparisons such as <codeph>BETWEEN</codeph> or <codeph>IN</codeph>, + regular expressions, or <codeph>NULL</codeph>. + </p> + <p conref="../shared/impala_common.xml#common/example_blurb"/> + <p> + Although this example is split across multiple lines, you can put any or all parts of a <codeph>CASE</codeph> expression + on a single line, with no punctuation or other separators between the <codeph>WHEN</codeph>, + <codeph>ELSE</codeph>, and <codeph>END</codeph> clauses. + </p> +<codeblock>select case x + when 1 then 'one' + when 2 then 'two' + when 0 then 'zero' + else 'out of range' + end + from t1; +</codeblock> + </dd> + + </dlentry> + + <dlentry id="case2"> + + <dt> + <codeph>CASE WHEN a THEN b [WHEN c THEN d]... [ELSE e] END</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">CASE expression</indexterm> + <b>Purpose:</b> Tests whether any of a sequence of expressions is true, and returns a corresponding + result for the first true expression. + <p conref="../shared/impala_common.xml#common/return_same_type"/> + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + <p> + <codeph>CASE</codeph> expressions without an initial test value have more flexibility. + For example, they can test different columns in different <codeph>WHEN</codeph> clauses, + or use comparison operators such as <codeph>BETWEEN</codeph>, <codeph>IN</codeph> and <codeph>IS NULL</codeph> + rather than comparing against discrete values. + </p> + <p> + <codeph>CASE</codeph> expressions are often the foundation of long queries that + summarize and format results for easy-to-read reports. For example, you might + use a <codeph>CASE</codeph> function call to turn values from a numeric column + into category strings corresponding to integer values, or labels such as <q>Small</q>, + <q>Medium</q> and <q>Large</q> based on ranges. Then subsequent parts of the + query might aggregate based on the transformed values, such as how many + values are classified as small, medium, or large. You can also use <codeph>CASE</codeph> + to signal problems with out-of-bounds values, <codeph>NULL</codeph> values, + and so on. + </p> + <p> + By using operators such as <codeph>OR</codeph>, <codeph>IN</codeph>, + <codeph>REGEXP</codeph>, and so on in <codeph>CASE</codeph> expressions, + you can build extensive tests and transformations into a single query. + Therefore, applications that construct SQL statements often rely heavily on <codeph>CASE</codeph> + calls in the generated SQL code. + </p> + <p> + Because this flexible form of the <codeph>CASE</codeph> expressions allows you to perform + many comparisons and call multiple functions when evaluating each row, be careful applying + elaborate <codeph>CASE</codeph> expressions to queries that process large amounts of data. + For example, when practical, evaluate and transform values through <codeph>CASE</codeph> + after applying operations such as aggregations that reduce the size of the result set; + transform numbers to strings after performing joins with the original numeric values. + </p> + <p conref="../shared/impala_common.xml#common/example_blurb"/> + <p> + Although this example is split across multiple lines, you can put any or all parts of a <codeph>CASE</codeph> expression + on a single line, with no punctuation or other separators between the <codeph>WHEN</codeph>, + <codeph>ELSE</codeph>, and <codeph>END</codeph> clauses. + </p> +<codeblock>select case + when dayname(now()) in ('Saturday','Sunday') then 'result undefined on weekends' + when x > y then 'x greater than y' + when x = y then 'x and y are equal' + when x is null or y is null then 'one of the columns is null' + else null + end + from t1; +</codeblock> + </dd> + + </dlentry> + + <dlentry id="coalesce"> + + <dt> + <codeph>coalesce(type v1, type v2, ...)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">coalesce() function</indexterm> + <b>Purpose:</b> Returns the first specified argument that is not <codeph>NULL</codeph>, or + <codeph>NULL</codeph> if all arguments are <codeph>NULL</codeph>. + <p conref="../shared/impala_common.xml#common/return_same_type"/> + </dd> + + </dlentry> + + <dlentry rev="2.0.0" id="decode"> + + <dt> + <codeph>decode(type expression, type search1, type result1 [, type search2, type result2 ...] [, type + default] )</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">decode() function</indexterm> + <b>Purpose:</b> Compares an expression to one or more possible values, and returns a corresponding result + when a match is found. + <p conref="../shared/impala_common.xml#common/return_same_type"/> + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + <p> + Can be used as shorthand for a <codeph>CASE</codeph> expression. + </p> + <p> + The original expression and the search expressions must of the same type or convertible types. The + result expression can be a different type, but all result expressions must be of the same type. + </p> + <p> + Returns a successful match If the original expression is <codeph>NULL</codeph> and a search expression + is also <codeph>NULL</codeph>. the + </p> + <p> + Returns <codeph>NULL</codeph> if the final <codeph>default</codeph> value is omitted and none of the + search expressions match the original expression. + </p> + <p conref="../shared/impala_common.xml#common/example_blurb"/> + <p> + The following example translates numeric day values into descriptive names: + </p> +<codeblock>SELECT event, decode(day_of_week, 1, "Monday", 2, "Tuesday", 3, "Wednesday", + 4, "Thursday", 5, "Friday", 6, "Saturday", 7, "Sunday", "Unknown day") + FROM calendar; +</codeblock> + </dd> + + </dlentry> + + <dlentry id="if"> + + <dt> + <codeph>if(boolean condition, type ifTrue, type ifFalseOrNull)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">if() function</indexterm> + <b>Purpose:</b> Tests an expression and returns a corresponding result depending on whether the result is + true, false, or <codeph>NULL</codeph>. + <p> + <b>Return type:</b> Same as the <codeph>ifTrue</codeph> argument value + </p> + </dd> + + </dlentry> + + <dlentry rev="1.3.0" id="ifnull"> + + <dt> + <codeph>ifnull(type a, type ifNull)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">isnull() function</indexterm> + <b>Purpose:</b> Alias for the <codeph>isnull()</codeph> function, with the same behavior. To simplify + porting SQL with vendor extensions to Impala. + <p conref="../shared/impala_common.xml#common/added_in_130"/> + </dd> + + </dlentry> + + <dlentry id="isfalse" rev="2.2.0"> + + <dt> + <codeph>isfalse(<varname>boolean</varname>)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">isfalse() function</indexterm> + <b>Purpose:</b> Tests if a Boolean expression is <codeph>false</codeph> or not. + Returns <codeph>true</codeph> if so. + If the argument is <codeph>NULL</codeph>, returns <codeph>false</codeph>. + Identical to <codeph>isnottrue()</codeph>, except it returns the opposite value for a <codeph>NULL</codeph> argument. + <p conref="../shared/impala_common.xml#common/return_type_boolean"/> + <p conref="../shared/impala_common.xml#common/added_in_220"/> + </dd> + + </dlentry> + + <dlentry id="isnotfalse" rev="2.2.0"> + + <dt> + <codeph>isnotfalse(<varname>boolean</varname>)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">isnotfalse() function</indexterm> + <b>Purpose:</b> Tests if a Boolean expression is not <codeph>false</codeph> (that is, either <codeph>true</codeph> or <codeph>NULL</codeph>). + Returns <codeph>true</codeph> if so. + If the argument is <codeph>NULL</codeph>, returns <codeph>true</codeph>. + Identical to <codeph>istrue()</codeph>, except it returns the opposite value for a <codeph>NULL</codeph> argument. + <p conref="../shared/impala_common.xml#common/return_type_boolean"/> + <p conref="../shared/impala_common.xml#common/for_compatibility_only"/> + <p conref="../shared/impala_common.xml#common/added_in_220"/> + </dd> + + </dlentry> + + <dlentry id="isnottrue" rev="2.2.0"> + + <dt> + <codeph>isnottrue(<varname>boolean</varname>)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">isnottrue() function</indexterm> + <b>Purpose:</b> Tests if a Boolean expression is not <codeph>true</codeph> (that is, either <codeph>false</codeph> or <codeph>NULL</codeph>). + Returns <codeph>true</codeph> if so. + If the argument is <codeph>NULL</codeph>, returns <codeph>true</codeph>. + Identical to <codeph>isfalse()</codeph>, except it returns the opposite value for a <codeph>NULL</codeph> argument. + <p conref="../shared/impala_common.xml#common/return_type_boolean"/> + <p conref="../shared/impala_common.xml#common/added_in_220"/> + </dd> + + </dlentry> + + <dlentry id="isnull"> + + <dt> + <codeph>isnull(type a, type ifNull)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">isnull() function</indexterm> + <b>Purpose:</b> Tests if an expression is <codeph>NULL</codeph>, and returns the expression result value + if not. If the first argument is <codeph>NULL</codeph>, returns the second argument. + <p> + <b>Compatibility notes:</b> Equivalent to the <codeph>nvl()</codeph> function from Oracle Database or + <codeph>ifnull()</codeph> from MySQL. The <codeph>nvl()</codeph> and <codeph>ifnull()</codeph> + functions are also available in Impala. + </p> + <p> + <b>Return type:</b> Same as the first argument value + </p> + </dd> + + </dlentry> + + <dlentry id="istrue" rev="2.2.0"> + + <dt> + <codeph>istrue(<varname>boolean</varname>)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">istrue() function</indexterm> + <b>Purpose:</b> Tests if a Boolean expression is <codeph>true</codeph> or not. + Returns <codeph>true</codeph> if so. + If the argument is <codeph>NULL</codeph>, returns <codeph>false</codeph>. + Identical to <codeph>isnotfalse()</codeph>, except it returns the opposite value for a <codeph>NULL</codeph> argument. + <p conref="../shared/impala_common.xml#common/return_type_boolean"/> + <p conref="../shared/impala_common.xml#common/for_compatibility_only"/> + <p conref="../shared/impala_common.xml#common/added_in_220"/> + </dd> + + </dlentry> + + <dlentry id="nonnullvalue" rev="2.2.0"> + + <dt> + <codeph>nonnullvalue(<varname>expression</varname>)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">function</indexterm> + <b>Purpose:</b> Tests if an expression (of any type) is <codeph>NULL</codeph> or not. + Returns <codeph>false</codeph> if so. + The converse of <codeph>nullvalue()</codeph>. + <p conref="../shared/impala_common.xml#common/return_type_boolean"/> + <p conref="../shared/impala_common.xml#common/for_compatibility_only"/> + <p conref="../shared/impala_common.xml#common/added_in_220"/> + </dd> + + </dlentry> + + <dlentry rev="1.3.0" id="nullif"> + + <dt> + <codeph>nullif(<varname>expr1</varname>,<varname>expr2</varname>)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">nullif() function</indexterm> + <b>Purpose:</b> Returns <codeph>NULL</codeph> if the two specified arguments are equal. If the specified + arguments are not equal, returns the value of <varname>expr1</varname>. The data types of the expressions + must be compatible, according to the conversion rules from <xref href="impala_datatypes.xml#datatypes"/>. + You cannot use an expression that evaluates to <codeph>NULL</codeph> for <varname>expr1</varname>; that + way, you can distinguish a return value of <codeph>NULL</codeph> from an argument value of + <codeph>NULL</codeph>, which would never match <varname>expr2</varname>. + <p> + <b>Usage notes:</b> This function is effectively shorthand for a <codeph>CASE</codeph> expression of + the form: + </p> +<codeblock>CASE + WHEN <varname>expr1</varname> = <varname>expr2</varname> THEN NULL + ELSE <varname>expr1</varname> +END</codeblock> + <p> + It is commonly used in division expressions, to produce a <codeph>NULL</codeph> result instead of a + divide-by-zero error when the divisor is equal to zero: + </p> +<codeblock>select 1.0 / nullif(c1,0) as reciprocal from t1;</codeblock> + <p> + You might also use it for compatibility with other database systems that support the same + <codeph>NULLIF()</codeph> function. + </p> + <p conref="../shared/impala_common.xml#common/return_same_type"/> + <p conref="../shared/impala_common.xml#common/added_in_130"/> + </dd> + + </dlentry> + + <dlentry rev="1.3.0" id="nullifzero"> + + <dt> + <codeph>nullifzero(<varname>numeric_expr</varname>)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">nullifzero() function</indexterm> + <b>Purpose:</b> Returns <codeph>NULL</codeph> if the numeric expression evaluates to 0, otherwise returns + the result of the expression. + <p> + <b>Usage notes:</b> Used to avoid error conditions such as divide-by-zero in numeric calculations. + Serves as shorthand for a more elaborate <codeph>CASE</codeph> expression, to simplify porting SQL with + vendor extensions to Impala. + </p> + <p conref="../shared/impala_common.xml#common/return_same_type"/> + <p conref="../shared/impala_common.xml#common/added_in_130"/> + </dd> + + </dlentry> + + <dlentry id="nullvalue" rev="2.2.0"> + + <dt> + <codeph>nullvalue(<varname>expression</varname>)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">function</indexterm> + <b>Purpose:</b> Tests if an expression (of any type) is <codeph>NULL</codeph> or not. + Returns <codeph>true</codeph> if so. + The converse of <codeph>nonnullvalue()</codeph>. + <p conref="../shared/impala_common.xml#common/return_type_boolean"/> + <p conref="../shared/impala_common.xml#common/for_compatibility_only"/> + <p conref="../shared/impala_common.xml#common/added_in_220"/> + </dd> + + </dlentry> + + <dlentry id="nvl" rev="1.1"> + + <dt> + <codeph>nvl(type a, type ifNull)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">nvl() function</indexterm> + <b>Purpose:</b> Alias for the <codeph>isnull()</codeph> function. Tests if an expression is + <codeph>NULL</codeph>, and returns the expression result value if not. If the first argument is + <codeph>NULL</codeph>, returns the second argument. Equivalent to the <codeph>nvl()</codeph> function + from Oracle Database or <codeph>ifnull()</codeph> from MySQL. + <p> + <b>Return type:</b> Same as the first argument value + </p> + <p conref="../shared/impala_common.xml#common/added_in_11"/> + </dd> + + </dlentry> + + <dlentry rev="1.3.0" id="zeroifnull"> + + <dt> + <codeph>zeroifnull(<varname>numeric_expr</varname>)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">zeroifnull() function</indexterm> + <b>Purpose:</b> Returns 0 if the numeric expression evaluates to <codeph>NULL</codeph>, otherwise returns + the result of the expression. + <p> + <b>Usage notes:</b> Used to avoid unexpected results due to unexpected propagation of + <codeph>NULL</codeph> values in numeric calculations. Serves as shorthand for a more elaborate + <codeph>CASE</codeph> expression, to simplify porting SQL with vendor extensions to Impala. + </p> + <p conref="../shared/impala_common.xml#common/return_same_type"/> + <p conref="../shared/impala_common.xml#common/added_in_130"/> + </dd> + + </dlentry> + </dl> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_config.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_config.xml b/docs/topics/impala_config.xml new file mode 100644 index 0000000..7ea82e5 --- /dev/null +++ b/docs/topics/impala_config.xml @@ -0,0 +1,57 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="config"> + + <title>Managing Impala</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Configuring"/> + <data name="Category" value="JDBC"/> + <data name="Category" value="ODBC"/> + <data name="Category" value="Stub Pages"/> + </metadata> + </prolog> + + <conbody> + + <p> + This section explains how to configure Impala to accept connections from applications that use popular + programming APIs: + </p> + + <ul> + <li> + <xref href="impala_config_performance.xml#config_performance"/> + </li> + + <li> + <xref href="impala_odbc.xml#impala_odbc"/> + </li> + + <li> + <xref href="impala_jdbc.xml#impala_jdbc"/> + </li> + </ul> + + <p> + This type of configuration is especially useful when using Impala in combination with Business Intelligence + tools, which use these standard interfaces to query different kinds of database and Big Data systems. + </p> + + <p> + You can also configure these other aspects of Impala: + </p> + + <ul> + <li> + <xref href="impala_security.xml#security"/> + </li> + + <li> + <xref href="impala_config_options.xml#config_options"/> + </li> + </ul> + </conbody> +</concept>
