http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_new_features.html ---------------------------------------------------------------------- diff --git a/docs/build3x/html/topics/impala_new_features.html b/docs/build3x/html/topics/impala_new_features.html new file mode 100644 index 0000000..cd1ecc5 --- /dev/null +++ b/docs/build3x/html/topics/impala_new_features.html @@ -0,0 +1,3806 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_release_notes.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="new_features"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>New Features in Apache Impala</title></head><body id="new_features"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1"><span class="ph">New Features in Apache Impala</span></h1> + + + + <div class="body conbody"> + + <p class="p"> + This release of Impala contains the following changes and enhancements from previous releases. + </p> + + <p class="p toc inpage"></p> + + </div> + + + <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_release_notes.html">Impala Release Notes</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="new_features__new_features_300"> + <h2 class="title topictitle2" id="ariaid-title2">New Features in <span class="keyword">Impala 3.0</span></h2> + <div class="body conbody"> + <p class="p"> + For the full list of issues closed in this release, including the + issues marked as <span class="q">"new features"</span> or <span class="q">"improvements"</span>, see the + <a class="xref" href="https://impala.apache.org/docs/changelog-3.0.html" target="_blank">changelog for <span class="keyword">Impala 3.0</span></a>. + </p> + </div> + </article> + + + + <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="new_features__new_features_2120"> + + <h2 class="title topictitle2" id="ariaid-title3">New Features in <span class="keyword">Impala 2.12</span></h2> + + <div class="body conbody"> + + <p class="p"> + For the full list of issues closed in this release, including the issues + marked as <span class="q">"new features"</span> or <span class="q">"improvements"</span>, see the + <a class="xref" href="https://impala.apache.org/docs/changelog-2.12.html" target="_blank">changelog for <span class="keyword">Impala 2.12</span></a>. + </p> + + </div> + </article> + + + + <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="new_features__new_features_2110"> + + <h2 class="title topictitle2" id="ariaid-title4">New Features in <span class="keyword">Impala 2.11</span></h2> + + <div class="body conbody"> + + <p class="p"> + For the full list of issues closed in this release, including the issues + marked as <span class="q">"new features"</span> or <span class="q">"improvements"</span>, see the + <a class="xref" href="https://impala.apache.org/docs/changelog-2.11.html" target="_blank">changelog for <span class="keyword">Impala 2.11</span></a>. + </p> + + </div> + </article> + + + + <article class="topic concept nested1" aria-labelledby="ariaid-title5" id="new_features__new_features_2100"> + + <h2 class="title topictitle2" id="ariaid-title5">New Features in <span class="keyword">Impala 2.10</span></h2> + + <div class="body conbody"> + + <p class="p"> + For the full list of issues closed in this release, including the issues + marked as <span class="q">"new features"</span> or <span class="q">"improvements"</span>, see the + <a class="xref" href="https://impala.apache.org/docs/changelog-2.10.html" target="_blank">changelog for <span class="keyword">Impala 2.10</span></a>. + </p> + + </div> + </article> + + + + <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="new_features__new_features_290"> + + <h2 class="title topictitle2" id="ariaid-title6">New Features in <span class="keyword">Impala 2.9</span></h2> + + <div class="body conbody"> + + <p class="p"> + For the full list of issues closed in this release, including the issues + marked as <span class="q">"new features"</span> or <span class="q">"improvements"</span>, see the + <a class="xref" href="https://impala.apache.org/docs/changelog-2.9.html" target="_blank">changelog for <span class="keyword">Impala 2.9</span></a>. + </p> + + <p class="p"> + The following are some of the most significant new features in this release: + </p> + + <ul class="ul" id="new_features_290__feature_list"> + <li class="li"> + <p class="p"> + A new function, <code class="ph codeph">replace()</code>, which is faster than + <code class="ph codeph">regexp_replace()</code> for simple string substitutions. + See <a class="xref" href="impala_string_functions.html">Impala String Functions</a> for details. + </p> + </li> + <li class="li"> + <p class="p"> + Startup flags for the <span class="keyword cmdname">impalad</span> daemon, <code class="ph codeph">is_executor</code> + and <code class="ph codeph">is_coordinator</code>, let you divide the work on a large, busy cluster + between a small number of hosts acting as query coordinators, and a larger number of + hosts acting as query executors. By default, each host can act in both roles, + potentially introducing bottlenecks during heavily concurrent workloads. + See <a class="xref" href="impala_scalability.html">Scalability Considerations for Impala</a> for details. + </p> + </li> + </ul> + + </div> + </article> + + + + <article class="topic concept nested1" aria-labelledby="ariaid-title7" id="new_features__new_features_280"> + + <h2 class="title topictitle2" id="ariaid-title7">New Features in <span class="keyword">Impala 2.8</span></h2> + + <div class="body conbody"> + + <ul class="ul" id="new_features_280__feature_list"> + <li class="li"> + <p class="p"> + Performance and scalability improvements: + </p> + <ul class="ul"> + <li class="li"> + <p class="p"> + The <code class="ph codeph">COMPUTE STATS</code> statement can + take advantage of multithreading. + </p> + </li> + <li class="li"> + <p class="p"> + Improved scalability for highly concurrent loads by reducing the possibility of TCP/IP timeouts. + A configuration setting, <code class="ph codeph">accepted_cnxn_queue_depth</code>, can be adjusted upwards to + avoid this type of timeout on large clusters. + </p> + </li> + <li class="li"> + <p class="p"> + Several performance improvements were made to the mechanism for generating native code: + </p> + <ul class="ul"> + <li class="li"> + <p class="p"> + Some queries involving analytic functions can take better advantage of native code generation. + </p> + </li> + <li class="li"> + <p class="p"> + Modules produced during intermediate code generation are organized + to be easier to cache and reuse during the lifetime of a long-running or complicated query. + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">COMPUTE STATS</code> statement is more efficient + (less time for the codegen phase) for tables with a large number + of columns, especially for tables containing <code class="ph codeph">TIMESTAMP</code> + columns. + </p> + </li> + <li class="li"> + <p class="p"> + The logic for determining whether or not to use a runtime filter is more reliable, and the + evaluation process itself is faster because of native code generation. + </p> + </li> + </ul> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">MT_DOP</code> query option enables + multithreading for a number of Impala operations. + <code class="ph codeph">COMPUTE STATS</code> statements for Parquet tables + use a default of <code class="ph codeph">MT_DOP=4</code> to improve the + intra-node parallelism and CPU efficiency of this data-intensive + operation. + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">COMPUTE STATS</code> statement is more efficient + (less time for the codegen phase) for tables with a large number + of columns. + </p> + </li> + <li class="li"> + <p class="p"> + A new hint, <code class="ph codeph">CLUSTERED</code>, + allows Impala <code class="ph codeph">INSERT</code> operations on a Parquet table + that use dynamic partitioning to process a high number of + partitions in a single statement. The data is ordered based on the + partition key columns, and each partition is only written + by a single host, reducing the amount of memory needed to buffer + Parquet data while the data blocks are being constructed. + </p> + </li> + <li class="li"> + <p class="p"> + The new configuration setting <code class="ph codeph">inc_stats_size_limit_bytes</code> + lets you reduce the load on the catalog server when running the + <code class="ph codeph">COMPUTE INCREMENTAL STATS</code> statement for very large tables. + </p> + </li> + <li class="li"> + <p class="p"> + Impala folds many constant expressions within query statements, + rather than evaluating them for each row. This optimization + is especially useful when using functions to manipulate and + format <code class="ph codeph">TIMESTAMP</code> values, such as the result + of an expression such as <code class="ph codeph">to_date(now() - interval 1 day)</code>. + </p> + </li> + <li class="li"> + <p class="p"> + Parsing of complicated expressions is faster. This speedup is + especially useful for queries containing large <code class="ph codeph">CASE</code> + expressions. + </p> + </li> + <li class="li"> + <p class="p"> + Evaluation is faster for <code class="ph codeph">IN</code> operators with many constant + arguments. The same performance improvement applies to other functions + with many constant arguments. + </p> + </li> + <li class="li"> + <p class="p"> + Impala optimizes identical comparison operators within multiple <code class="ph codeph">OR</code> + blocks. + </p> + </li> + <li class="li"> + <p class="p"> + The reporting for wall-clock times and total CPU time in profile output is more accurate. + </p> + </li> + <li class="li"> + <p class="p"> + A new query option, <code class="ph codeph">SCRATCH_LIMIT</code>, lets you restrict the amount of + space used when a query exceeds the memory limit and activates the <span class="q">"spill to disk"</span> mechanism. + This option helps to avoid runaway queries or make queries <span class="q">"fail fast"</span> if they require more + memory than anticipated. You can prevent runaway queries from using excessive amounts of spill space, + without restarting the cluster to turn the spilling feature off entirely. + See <a class="xref" href="impala_scratch_limit.html#scratch_limit">SCRATCH_LIMIT Query Option</a> for details. + </p> + </li> + </ul> + </li> + <li class="li"> + <p class="p"> + Integration with Apache Kudu: + </p> + <ul class="ul"> + <li class="li"> + <p class="p"> + The experimental Impala support for the Kudu storage layer has been folded + into the main Impala development branch. Impala can now directly access Kudu tables, + opening up new capabilities such as enhanced DML operations and continuous ingestion. + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">DELETE</code> statement is a flexible way to remove data from a Kudu table. Previously, + removing data from an Impala table involved removing or rewriting the underlying data files, dropping entire partitions, + or rewriting the entire table. This Impala statement only works for Kudu tables. + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">UPDATE</code> statement is a flexible way to modify data within a Kudu table. Previously, + updating data in an Impala table involved replacing the underlying data files, dropping entire partitions, + or rewriting the entire table. This Impala statement only works for Kudu tables. + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">UPSERT</code> statement is a flexible way to ingest, modify, or both data within a Kudu table. Previously, + ingesting data that might contain duplicates involved an inefficient multi-stage operation, and there was no + built-in protection against duplicate data. The <code class="ph codeph">UPSERT</code> statement, in combination with + the primary key designation for Kudu tables, lets you add or replace rows in a single operation, and + automatically avoids creating any duplicate data. + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">CREATE TABLE</code> statement gains some new clauses that are specific to Kudu tables: + <code class="ph codeph">PARTITION BY</code>, <code class="ph codeph">PARTITIONS</code>, <code class="ph codeph">STORED AS KUDU</code>, and column + attributes <code class="ph codeph">PRIMARY KEY</code>, <code class="ph codeph">NULL</code> and <code class="ph codeph">NOT NULL</code>, + <code class="ph codeph">ENCODING</code>, <code class="ph codeph">COMPRESSION</code>, <code class="ph codeph">DEFAULT</code>, and <code class="ph codeph">BLOCK_SIZE</code>. + These clauses replace the explicit <code class="ph codeph">TBLPROPERTIES</code> settings that were required in the + early experimental phases of integration between Impala and Kudu. + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">ALTER TABLE</code> statement can change certain attributes of Kudu tables. + You can add, drop, or rename columns. + You can add or drop range partitions. + You can change the <code class="ph codeph">TBLPROPERTIES</code> value to rename or point to a different underlying Kudu table, + independently from the Impala table name in the metastore database. + You cannot change the data type of an existing column in a Kudu table. + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">SHOW PARTITIONS</code> statement displays information about the distribution of data + between partitions in Kudu tables. A new variation, <code class="ph codeph">SHOW RANGE PARTITIONS</code>, + displays information about the Kudu-specific partitions that apply across ranges of key values. + </p> + </li> + <li class="li"> + <p class="p"> + Not all Impala data types are supported in Kudu tables. In particular, currently the Impala + <code class="ph codeph">TIMESTAMP</code> type is not allowed in a Kudu table. Impala does not recognize the + <code class="ph codeph">UNIXTIME_MICROS</code> Kudu type when it is present in a Kudu table. (These two + representations of date/time data use different units and are not directly compatible.) + You cannot create columns of type <code class="ph codeph">TIMESTAMP</code>, <code class="ph codeph">DECIMAL</code>, + <code class="ph codeph">VARCHAR</code>, or <code class="ph codeph">CHAR</code> within a Kudu table. Within a query, you can + cast values in a result set to these types. Certain types, such as <code class="ph codeph">BOOLEAN</code>, + cannot be used as primary key columns. + </p> + </li> + <li class="li"> + <p class="p"> + Currently, Kudu tables are not interchangeable between Impala and Hive the way other kinds of Impala tables are. + Although the metadata for Kudu tables is stored in the metastore database, currently Hive cannot access Kudu tables. + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">INSERT</code> statement works for Kudu tables. The organization + of the Kudu data makes it more efficient than with HDFS-backed tables to insert + data in small batches, such as with the <code class="ph codeph">INSERT ... VALUES</code> syntax. + </p> + </li> + <li class="li"> + <p class="p"> + Some audit data is recorded for data governance purposes. + All <code class="ph codeph">UPDATE</code>, <code class="ph codeph">DELETE</code>, and <code class="ph codeph">UPSERT</code> statements are characterized + as <code class="ph codeph">INSERT</code> operations in the audit log. Currently, lineage metadata is not generated for + <code class="ph codeph">UPDATE</code> and <code class="ph codeph">DELETE</code> operations on Kudu tables. + </p> + </li> + <li class="li"> + <div class="p"> + Currently, Kudu tables have limited support for Sentry: + <ul class="ul"> + <li class="li"> + <p class="p"> + Access to Kudu tables must be granted to roles as usual. + </p> + </li> + <li class="li"> + <p class="p"> + Currently, access to a Kudu table through Sentry is <span class="q">"all or nothing"</span>. + You cannot enforce finer-grained permissions such as at the column level, + or permissions on certain operations such as <code class="ph codeph">INSERT</code>. + </p> + </li> + <li class="li"> + <p class="p"> + Only users with <code class="ph codeph">ALL</code> privileges on <code class="ph codeph">SERVER</code> can create external Kudu tables. + </p> + </li> + </ul> + Because non-SQL APIs can access Kudu data without going through Sentry + authorization, currently the Sentry support is considered preliminary. + </div> + </li> + <li class="li"> + <p class="p"> + Equality and <code class="ph codeph">IN</code> predicates in Impala queries are pushed to + Kudu and evaluated efficiently by the Kudu storage layer. + </p> + </li> + </ul> + </li> + <li class="li"> + <p class="p"> + <strong class="ph b">Security:</strong> + </p> + <ul class="ul"> + <li class="li"> + <p class="p"> + Impala can take advantage of the S3 encrypted credential + store, to avoid exposing the secret key when accessing + data stored on S3. + </p> + </li> + </ul> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">REFRESH</code> statement now updates information about HDFS block locations. + Therefore, you can perform a fast and efficient <code class="ph codeph">REFRESH</code> after doing an HDFS + rebalancing operation instead of the more expensive <code class="ph codeph">INVALIDATE METADATA</code> statement. + </p> + </li> + <li class="li"> + <p class="p"> + [<a class="xref" href="https://issues.apache.org/jira/browse/IMPALA-1654" target="_blank">IMPALA-1654</a>] + Several kinds of DDL operations + can now work on a range of partitions. The partitions can be specified + using operators such as <code class="ph codeph"><</code>, <code class="ph codeph">>=</code>, and + <code class="ph codeph">!=</code> rather than just an equality predicate applying to a single + partition. + This new feature extends the syntax of several clauses + of the <code class="ph codeph">ALTER TABLE</code> statement + (<code class="ph codeph">DROP PARTITION</code>, <code class="ph codeph">SET [UN]CACHED</code>, + <code class="ph codeph">SET FILEFORMAT | SERDEPROPERTIES | TBLPROPERTIES</code>), + the <code class="ph codeph">SHOW FILES</code> statement, and the + <code class="ph codeph">COMPUTE INCREMENTAL STATS</code> statement. + It does not apply to statements that are defined to only apply to a single + partition, such as <code class="ph codeph">LOAD DATA</code>, <code class="ph codeph">ALTER TABLE ... ADD PARTITION</code>, + <code class="ph codeph">SET LOCATION</code>, and <code class="ph codeph">INSERT</code> with a static + partitioning clause. + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">instr()</code> function has optional second and third arguments, representing + the character to position to begin searching for the substring, and the Nth occurrence + of the substring to find. + </p> + </li> + <li class="li"> + <p class="p"> + Improved error handling for malformed Avro data. In particular, incorrect + precision or scale for <code class="ph codeph">DECIMAL</code> types is now handled. + </p> + </li> + <li class="li"> + <p class="p"> + Impala debug web UI: + </p> + <ul class="ul"> + <li class="li"> + <p class="p"> + In addition to <span class="q">"inflight"</span> and <span class="q">"finished"</span> queries, the web UI + now also includes a section for <span class="q">"queued"</span> queries. + </p> + </li> + <li class="li"> + <p class="p"> + The <span class="ph uicontrol">/sessions</span> tab now clarifies how many of the displayed + sections are active, and lets you sort by <span class="ph uicontrol">Expired</span> status + to distinguish active sessions from expired ones. + </p> + </li> + </ul> + </li> + <li class="li"> + <p class="p"> + Improved stability when DDL operations such as <code class="ph codeph">CREATE DATABASE</code> + or <code class="ph codeph">DROP DATABASE</code> are run in Hive at the same time as an Impala + <code class="ph codeph">INVALIDATE METADATA</code> statement. + </p> + </li> + <li class="li"> + <p class="p"> + The <span class="q">"out of memory"</span> error report was made more user-friendly, with additional + diagnostic information to help identify the spot where the memory limit was exceeded. + </p> + </li> + <li class="li"> + <p class="p"> + Improved disk space usage for Java-based UDFs. Temporary copies of the associated JAR + files are removed when no longer needed, so that they do not accumulate across restarts + of the <span class="keyword cmdname">catalogd</span> daemon and potentially cause an out-of-space condition. + These temporary files are also created in the directory specified by the <code class="ph codeph">local_library_dir</code> + configuration setting, so that the storage for these temporary files can be independent + from any capacity limits on the <span class="ph filepath">/tmp</span> filesystem. + </p> + </li> + </ul> + + </div> + </article> + + + + <article class="topic concept nested1" aria-labelledby="ariaid-title8" id="new_features__new_features_270"> + + <h2 class="title topictitle2" id="ariaid-title8">New Features in <span class="keyword">Impala 2.7</span></h2> + + <div class="body conbody"> + + <ul class="ul" id="new_features_270__feature_list"> + <li class="li"> + <p class="p"> + Performance improvements: + </p> + <ul class="ul"> + <li class="li"> + <p class="p"> + [<a class="xref" href="https://issues.apache.org/jira/browse/IMPALA-3206" target="_blank">IMPALA-3206</a>] + Speedup for queries against <code class="ph codeph">DECIMAL</code> columns in Avro tables. + The code that parses <code class="ph codeph">DECIMAL</code> values from Avro now uses + native code generation. + </p> + </li> + <li class="li"> + <p class="p"> + [<a class="xref" href="https://issues.apache.org/jira/browse/IMPALA-3674" target="_blank">IMPALA-3674</a>] + Improved efficiency in LLVM code generation can reduce codegen time, especially + for short queries. + </p> + </li> + + <li class="li"> + <p class="p"> + [<a class="xref" href="https://issues.apache.org/jira/browse/IMPALA-2979" target="_blank">IMPALA-2979</a>] + Improvements to scheduling on worker nodes, + enabled by the <code class="ph codeph">REPLICA_PREFERENCE</code> query option. + See <a class="xref" href="impala_replica_preference.html#replica_preference">REPLICA_PREFERENCE Query Option (Impala 2.7 or higher only)</a> for details. + </p> + </li> + </ul> + </li> + + <li class="li"> + <p class="p"> + [<a class="xref" href="https://issues.apache.org/jira/browse/IMPALA-1683" target="_blank">IMPALA-1683</a>] + The <code class="ph codeph">REFRESH</code> statement can be applied to a single partition, + rather than the entire table. See <a class="xref" href="impala_refresh.html#refresh">REFRESH Statement</a> + and <a class="xref" href="impala_partitioning.html#partition_refresh">Refreshing a Single Partition</a> for details. + </p> + </li> + <li class="li"> + <p class="p"> + Improvements to the Impala web user interface: + </p> + <ul class="ul"> + <li class="li"> + <p class="p"> + [<a class="xref" href="https://issues.apache.org/jira/browse/IMPALA-2767" target="_blank">IMPALA-2767</a>] + You can now force a session to expire by clicking a link in the web UI, + on the <span class="ph uicontrol">/sessions</span> tab. + </p> + </li> + <li class="li"> + <p class="p"> + [<a class="xref" href="https://issues.apache.org/jira/browse/IMPALA-3715" target="_blank">IMPALA-3715</a>] + The <span class="ph uicontrol">/memz</span> tab includes more information about + Impala memory usage. + </p> + </li> + <li class="li"> + <p class="p"> + [<a class="xref" href="https://issues.apache.org/jira/browse/IMPALA-3716" target="_blank">IMPALA-3716</a>] + The <span class="ph uicontrol">Details</span> page for a query now includes + a <span class="ph uicontrol">Memory</span> tab. + </p> + </li> + </ul> + </li> + <li class="li"> + <p class="p"> + [<a class="xref" href="https://issues.apache.org/jira/browse/IMPALA-3499" target="_blank">IMPALA-3499</a>] + Scalability improvements to the catalog server. Impala handles internal communication + more efficiently for tables with large numbers of columns and partitions, where the + size of the metadata exceeds 2 GiB. + </p> + </li> + <li class="li"> + <p class="p"> + [<a class="xref" href="https://issues.apache.org/jira/browse/IMPALA-3677" target="_blank">IMPALA-3677</a>] + You can send a <code class="ph codeph">SIGUSR1</code> signal to any Impala-related daemon to write a + Breakpad minidump. For advanced troubleshooting, you can now produce a minidump + without triggering a crash. See <a class="xref" href="impala_breakpad.html#breakpad">Breakpad Minidumps for Impala (Impala 2.6 or higher only)</a> for + details about the Breakpad minidump feature. + </p> + </li> + <li class="li"> + <p class="p"> + [<a class="xref" href="https://issues.apache.org/jira/browse/IMPALA-3687" target="_blank">IMPALA-3687</a>] + The schema reconciliation rules for Avro tables have changed slightly + for <code class="ph codeph">CHAR</code> and <code class="ph codeph">VARCHAR</code> columns. Now, if + the definition of such a column is changed in the Avro schema file, + the column retains its <code class="ph codeph">CHAR</code> or <code class="ph codeph">VARCHAR</code> + type as specified in the SQL definition, but the column name and comment + from the Avro schema file take precedence. + See <a class="xref" href="impala_avro.html#avro_create_table">Creating Avro Tables</a> for details about + column definitions in Avro tables. + </p> + </li> + <li class="li"> + <p class="p"> + [<a class="xref" href="https://issues.apache.org/jira/browse/IMPALA-3575" target="_blank">IMPALA-3575</a>] + Some network + operations now have additional timeout and retry settings. The extra + configuration helps avoid failed queries for transient network + problems, to avoid hangs when a sender or receiver fails in the + middle of a network transmission, and to make cancellation requests + more reliable despite network issues. </p> + </li> + </ul> + + </div> + </article> + + + <article class="topic concept nested1" aria-labelledby="ariaid-title9" id="new_features__new_features_260"> + + <h2 class="title topictitle2" id="ariaid-title9">New Features in <span class="keyword">Impala 2.6</span></h2> + + <div class="body conbody"> + + <ul class="ul"> + <li class="li"> + <p class="p"> + Improvements to Impala support for the Amazon S3 filesystem: + </p> + <ul class="ul"> + <li class="li"> + <p class="p"> + Impala can now write to S3 tables through the <code class="ph codeph">INSERT</code> + or <code class="ph codeph">LOAD DATA</code> statements. + See <a class="xref" href="impala_s3.html#s3">Using Impala with the Amazon S3 Filesystem</a> for general information about + using Impala with S3. + </p> + </li> + <li class="li"> + <p class="p"> + A new query option, <code class="ph codeph">S3_SKIP_INSERT_STAGING</code>, lets you + trade off between fast <code class="ph codeph">INSERT</code> performance and + slower <code class="ph codeph">INSERT</code>s that are more consistent if a + problem occurs during the statement. The new behavior is enabled by default. + See <a class="xref" href="impala_s3_skip_insert_staging.html#s3_skip_insert_staging">S3_SKIP_INSERT_STAGING Query Option (Impala 2.6 or higher only)</a> for details + about this option. + </p> + </li> + </ul> + </li> + <li class="li"> + <p class="p"> + Performance improvements for the runtime filtering feature: + </p> + <ul class="ul"> + <li class="li"> + <p class="p"> + The default for the <code class="ph codeph">RUNTIME_FILTER_MODE</code> + query option is changed to <code class="ph codeph">GLOBAL</code> (the highest setting). + See <a class="xref" href="impala_runtime_filter_mode.html#runtime_filter_mode">RUNTIME_FILTER_MODE Query Option (Impala 2.5 or higher only)</a> for + details about this option. + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">RUNTIME_BLOOM_FILTER_SIZE</code> setting is now only used + as a fallback if statistics are not available; otherwise, Impala + uses the statistics to estimate the appropriate size to use for each filter. + See <a class="xref" href="impala_runtime_bloom_filter_size.html#runtime_bloom_filter_size">RUNTIME_BLOOM_FILTER_SIZE Query Option (Impala 2.5 or higher only)</a> for + details about this option. + </p> + </li> + <li class="li"> + <p class="p"> + New query options <code class="ph codeph">RUNTIME_FILTER_MIN_SIZE</code> and + <code class="ph codeph">RUNTIME_FILTER_MAX_SIZE</code> let you fine-tune + the sizes of the Bloom filter structures used for runtime filtering. + If the filter size derived from Impala internal estimates or from + the <code class="ph codeph">RUNTIME_FILTER_BLOOM_SIZE</code> falls outside the size + range specified by these options, any too-small filter size is adjusted + to the minimum, and any too-large filter size is adjusted to the maximum. + See <a class="xref" href="impala_runtime_filter_min_size.html#runtime_filter_min_size">RUNTIME_FILTER_MIN_SIZE Query Option (Impala 2.6 or higher only)</a> + and <a class="xref" href="impala_runtime_filter_max_size.html#runtime_filter_max_size">RUNTIME_FILTER_MAX_SIZE Query Option (Impala 2.6 or higher only)</a> + for details about these options. + </p> + </li> + <li class="li"> + <p class="p"> + Runtime filter propagation now applies to all the + operands of <code class="ph codeph">UNION</code> and <code class="ph codeph">UNION ALL</code> + operators. + </p> + </li> + <li class="li"> + <p class="p"> + Runtime filters can now be produced during join queries even + when the join processing activates the spill-to-disk mechanism. + </p> + </li> + </ul> + See <a class="xref" href="impala_runtime_filtering.html#runtime_filtering">Runtime Filtering for Impala Queries (Impala 2.5 or higher only)</a> for + general information about the runtime filtering feature. + </li> + + <li class="li"> + <p class="p"> + Admission control and dynamic resource pools are enabled by default. + See <a class="xref" href="impala_admission.html#admission_control">Admission Control and Query Queuing</a> for details + about admission control. + </p> + </li> + + <li class="li"> + <p class="p"> + Impala can now manually set column statistics, + using the <code class="ph codeph">ALTER TABLE</code> statement with a + <code class="ph codeph">SET COLUMN STATS</code> clause. + See <a class="xref" href="impala_perf_stats.html#perf_column_stats_manual">impala_perf_stats.html#perf_column_stats_manual</a> for details. + </p> + </li> + <li class="li"> + <p class="p"> + Impala can now write lightweight <span class="q">"minidump"</span> files, rather + than large core files, to save diagnostic information when + any of the Impala-related daemons crash. This feature uses the + open source <code class="ph codeph">breakpad</code> framework. + See <a class="xref" href="impala_breakpad.html#breakpad">Breakpad Minidumps for Impala (Impala 2.6 or higher only)</a> for details. + </p> + </li> + <li class="li"> + <div class="p"> + New query options improve interoperability with Parquet files: + <ul class="ul"> + <li class="li"> + <p class="p"> + The <code class="ph codeph">PARQUET_FALLBACK_SCHEMA_RESOLUTION</code> query option + lets Impala locate columns within Parquet files based on + column name rather than ordinal position. + This enhancement improves interoperability with applications + that write Parquet files with a different order or subset of + columns than are used in the Impala table. + See <a class="xref" href="impala_parquet_fallback_schema_resolution.html#parquet_fallback_schema_resolution">PARQUET_FALLBACK_SCHEMA_RESOLUTION Query Option (Impala 2.6 or higher only)</a> + for details. + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">PARQUET_ANNOTATE_STRINGS_UTF8</code> query option + makes Impala include the <code class="ph codeph">UTF-8</code> annotation + metadata for <code class="ph codeph">STRING</code>, <code class="ph codeph">CHAR</code>, + and <code class="ph codeph">VARCHAR</code> columns in Parquet files created + by <code class="ph codeph">INSERT</code> or <code class="ph codeph">CREATE TABLE AS SELECT</code> + statements. + See <a class="xref" href="impala_parquet_annotate_strings_utf8.html#parquet_annotate_strings_utf8">PARQUET_ANNOTATE_STRINGS_UTF8 Query Option (Impala 2.6 or higher only)</a> + for details. + </p> + </li> + </ul> + See <a class="xref" href="impala_parquet.html#parquet">Using the Parquet File Format with Impala Tables</a> for general information about working + with Parquet files. + </div> + </li> + <li class="li"> + <p class="p"> + Improvements to security and reduction in overhead for secure clusters: + </p> + <ul class="ul"> + <li class="li"> + <p class="p"> + Overall performance improvements for secure clusters. + (TPC-H queries on a secure cluster were benchmarked + at roughly 3x as fast as the previous release.) + </p> + </li> + <li class="li"> + <p class="p"> + Impala now recognizes the <code class="ph codeph">auth_to_local</code> setting, + specified through the HDFS configuration setting + <code class="ph codeph">hadoop.security.auth_to_local</code>. + This feature is disabled by default; to enable it, + specify <code class="ph codeph">--load_auth_to_local_rules=true</code> + in the <span class="keyword cmdname">impalad</span> configuration settings. + See <a class="xref" href="impala_kerberos.html#auth_to_local">Mapping Kerberos Principals to Short Names for Impala</a> for details. + </p> + </li> + <li class="li"> + <p class="p"> + Timing improvements in the mechanism for the <span class="keyword cmdname">impalad</span> + daemon to acquire Kerberos tickets. This feature spreads out the overhead + on the KDC during Impala startup, especially for large clusters. + </p> + </li> + <li class="li"> + <p class="p"> + For Kerberized clusters, the Catalog service now uses + the Kerberos principal instead of the operating sytem user that runs + the <span class="keyword cmdname">catalogd</span> daemon. + This eliminates the requirement to configure a <code class="ph codeph">hadoop.user.group.static.mapping.overrides</code> + setting to put the OS user into the Sentry administrative group, on clusters where the principal + and the OS user name for this user are different. + </p> + </li> + </ul> + </li> + <li class="li"> + <p class="p"> + Overall performance improvements for join queries, by using a prefetching mechanism + while building the in-memory hash table to evaluate join predicates. + See <a class="xref" href="impala_prefetch_mode.html#prefetch_mode">PREFETCH_MODE Query Option (Impala 2.6 or higher only)</a> for the query option + to control this optimization. + </p> + </li> + <li class="li"> + <p class="p"> + The <span class="keyword cmdname">impala-shell</span> interpreter has a new command, + <code class="ph codeph">SOURCE</code>, that lets you run a set of SQL statements + or other <span class="keyword cmdname">impala-shell</span> commands stored in a file. + You can run additional <code class="ph codeph">SOURCE</code> commands from inside + a file, to set up flexible sequences of statements for use cases + such as schema setup, ETL, or reporting. + See <a class="xref" href="impala_shell_commands.html#shell_commands">impala-shell Command Reference</a> for details + and <a class="xref" href="impala_shell_running_commands.html#shell_running_commands">Running Commands and SQL Statements in impala-shell</a> + for examples. + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">millisecond()</code> built-in function lets you extract + the fractional seconds part of a <code class="ph codeph">TIMESTAMP</code> value. + See <a class="xref" href="impala_datetime_functions.html#datetime_functions">Impala Date and Time Functions</a> for details. + </p> + </li> + <li class="li"> + <p class="p"> + If an Avro table is created without column definitions in the + <code class="ph codeph">CREATE TABLE</code> statement, and columns are later + added through <code class="ph codeph">ALTER TABLE</code>, the resulting + table is now queryable. Missing values from the newly added + columns now default to <code class="ph codeph">NULL</code>. + See <a class="xref" href="impala_avro.html#avro">Using the Avro File Format with Impala Tables</a> for general details about + working with Avro files. + </p> + </li> + <li class="li"> + <div class="p"> + The mechanism for interpreting <code class="ph codeph">DECIMAL</code> literals is + improved, no longer going through an intermediate conversion step + to <code class="ph codeph">DOUBLE</code>: + <ul class="ul"> + <li class="li"> + <p class="p"> + Casting a <code class="ph codeph">DECIMAL</code> value to <code class="ph codeph">TIMESTAMP</code> + <code class="ph codeph">DOUBLE</code> produces a more precise + value for the <code class="ph codeph">TIMESTAMP</code> than formerly. + </p> + </li> + <li class="li"> + <p class="p"> + Certain function calls involving <code class="ph codeph">DECIMAL</code> literals + now succeed, when formerly they failed due to lack of a function + signature with a <code class="ph codeph">DOUBLE</code> argument. + </p> + </li> + <li class="li"> + <p class="p"> + Faster runtime performance for <code class="ph codeph">DECIMAL</code> constant + values, through improved native code generation for all combinations + of precision and scale. + </p> + </li> + </ul> + See <a class="xref" href="impala_decimal.html#decimal">DECIMAL Data Type (Impala 3.0 or higher only)</a> for details about the <code class="ph codeph">DECIMAL</code> type. + </div> + </li> + <li class="li"> + <p class="p"> + Improved type accuracy for <code class="ph codeph">CASE</code> return values. + If all <code class="ph codeph">WHEN</code> clauses of the <code class="ph codeph">CASE</code> + expression are of <code class="ph codeph">CHAR</code> type, the final result + is also <code class="ph codeph">CHAR</code> instead of being converted to + <code class="ph codeph">STRING</code>. + See <a class="xref" href="impala_conditional_functions.html#conditional_functions">Impala Conditional Functions</a> + for details about the <code class="ph codeph">CASE</code> function. + </p> + </li> + <li class="li"> + <p class="p"> + Uncorrelated queries using the <code class="ph codeph">NOT EXISTS</code> operator + are now supported. Formerly, the <code class="ph codeph">NOT EXISTS</code> + operator was only available for correlated subqueries. + </p> + </li> + <li class="li"> + <p class="p"> + Improved performance for reading Parquet files. + </p> + </li> + <li class="li"> + <p class="p"> + Improved performance for <dfn class="term">top-N</dfn> queries, that is, + those including both <code class="ph codeph">ORDER BY</code> and + <code class="ph codeph">LIMIT</code> clauses. + </p> + </li> + + <li class="li"> + <p class="p"> + Impala optionally skips an arbitrary number of header lines from text input + files on HDFS based on the <code class="ph codeph">skip.header.line.count</code> value + in the <code class="ph codeph">TBLPROPERTIES</code> field of the table metadata. + See <a class="xref" href="impala_txtfile.html#text_data_files">Data Files for Text Tables</a> for details. + </p> + </li> + <li class="li"> + <p class="p"> + Trailing comments are now allowed in queries processed by + the <span class="keyword cmdname">impala-shell</span> options <code class="ph codeph">-q</code> + and <code class="ph codeph">-f</code>. + </p> + </li> + <li class="li"> + <p class="p"> + Impala can run <code class="ph codeph">COUNT</code> queries for RCFile tables + that include complex type columns. + See <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for + general information about working with complex types, + and <a class="xref" href="impala_array.html#array">ARRAY Complex Type (Impala 2.3 or higher only)</a>, + <a class="xref" href="impala_map.html#map">MAP Complex Type (Impala 2.3 or higher only)</a>, and <a class="xref" href="impala_struct.html#struct">STRUCT Complex Type (Impala 2.3 or higher only)</a> + for syntax details of each type. + </p> + </li> + </ul> + + </div> + </article> + + + + <article class="topic concept nested1" aria-labelledby="ariaid-title10" id="new_features__new_features_250"> + + <h2 class="title topictitle2" id="ariaid-title10">New Features in <span class="keyword">Impala 2.5</span></h2> + + <div class="body conbody"> + + <ul class="ul"> + <li class="li"> + <p class="p"> + Dynamic partition pruning. When a query refers to a partition key column in a <code class="ph codeph">WHERE</code> + clause, and the exact set of column values are not known until the query is executed, + Impala evaluates the predicate and skips the I/O for entire partitions that are not needed. + For example, if a table was partitioned by year, Impala would apply this technique to a query + such as <code class="ph codeph">SELECT c1 FROM partitioned_table WHERE year = (SELECT MAX(year) FROM other_table)</code>. + <span class="ph">See <a class="xref" href="impala_partitioning.html#dynamic_partition_pruning">Dynamic Partition Pruning</a> for details.</span> + </p> + <p class="p"> + The dynamic partition pruning optimization technique lets Impala avoid reading + data files from partitions that are not part of the result set, even when + that determination cannot be made in advance. This technique is especially valuable + when performing join queries involving partitioned tables. For example, if a join + query includes an <code class="ph codeph">ON</code> clause and a <code class="ph codeph">WHERE</code> clause + that refer to the same columns, the query can find the set of column values that + match the <code class="ph codeph">WHERE</code> clause, and only scan the associated partitions + when evaluating the <code class="ph codeph">ON</code> clause. + </p> + <p class="p"> + Dynamic partition pruning is controlled by the same settings as the runtime filtering feature. + By default, this feature is enabled at a medium level, because the maximum setting can use + slightly more memory for queries than in previous releases. + To fully enable this feature, set the query option <code class="ph codeph">RUNTIME_FILTER_MODE=GLOBAL</code>. + </p> + </li> + <li class="li"> + <p class="p"> + Runtime filtering. This is a wide-ranging set of optimizations that are especially valuable for join queries. + Using the same technique as with dynamic partition pruning, + Impala uses the predicates from <code class="ph codeph">WHERE</code> and <code class="ph codeph">ON</code> clauses + to determine the subset of column values from one of the joined tables could possibly be part of the + result set. Impala sends a compact representation of the filter condition to the hosts in the cluster, + instead of the full set of values or the entire table. + <span class="ph">See <a class="xref" href="impala_runtime_filtering.html#runtime_filtering">Runtime Filtering for Impala Queries (Impala 2.5 or higher only)</a> for details.</span> + </p> + <p class="p"> + By default, this feature is enabled at a medium level, because the maximum setting can use + slightly more memory for queries than in previous releases. + To fully enable this feature, set the query option <code class="ph codeph">RUNTIME_FILTER_MODE=GLOBAL</code>. + <span class="ph">See <a class="xref" href="impala_runtime_filter_mode.html#runtime_filter_mode">RUNTIME_FILTER_MODE Query Option (Impala 2.5 or higher only)</a> for details.</span> + </p> + <p class="p"> + This feature involves some new query options: + <a class="xref" href="impala_runtime_filter_mode.html">RUNTIME_FILTER_MODE</a>, + <a class="xref" href="impala_max_num_runtime_filters.html">MAX_NUM_RUNTIME_FILTERS</a>, + <a class="xref" href="impala_runtime_bloom_filter_size.html">RUNTIME_BLOOM_FILTER_SIZE</a>, + <a class="xref" href="impala_runtime_filter_wait_time_ms.html">RUNTIME_FILTER_WAIT_TIME_MS</a>, + and <a class="xref" href="impala_disable_row_runtime_filtering.html">DISABLE_ROW_RUNTIME_FILTERING</a>. + <span class="ph">See + <a class="xref" href="impala_runtime_filter_mode.html#runtime_filter_mode">RUNTIME_FILTER_MODE</a>, + <a class="xref" href="impala_max_num_runtime_filters.html#max_num_runtime_filters">MAX_NUM_RUNTIME_FILTERS</a>, + <a class="xref" href="impala_runtime_bloom_filter_size.html#runtime_bloom_filter_size">RUNTIME_BLOOM_FILTER_SIZE</a>, + <a class="xref" href="impala_runtime_filter_wait_time_ms.html#runtime_filter_wait_time_ms">RUNTIME_FILTER_WAIT_TIME_MS</a>, and + <a class="xref" href="impala_disable_row_runtime_filtering.html#disable_row_runtime_filtering">DISABLE_ROW_RUNTIME_FILTERING</a> + for details. + </span> + </p> + </li> + <li class="li"> + <p class="p"> + More efficient use of the HDFS caching feature, to avoid + hotspots and bottlenecks that could occur if heavily used + cached data blocks were always processed by the same host. + By default, Impala now randomizes which host processes each cached + HDFS data block, when cached replicas are available on multiple hosts. + (Remember to use the <code class="ph codeph">WITH REPLICATION</code> clause with the + <code class="ph codeph">CREATE TABLE</code> or <code class="ph codeph">ALTER TABLE</code> statement + when enabling HDFS caching for a table or partition, to cache the same + data blocks across multiple hosts.) + The new query option <code class="ph codeph">SCHEDULE_RANDOM_REPLICA</code> + + lets you fine-tune the interaction with HDFS caching even more. + <span class="ph">See <a class="xref" href="impala_perf_hdfs_caching.html#hdfs_caching">Using HDFS Caching with Impala (Impala 2.1 or higher only)</a> for details.</span> + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">TRUNCATE TABLE</code> statement now accepts an <code class="ph codeph">IF EXISTS</code> + clause, making <code class="ph codeph">TRUNCATE TABLE</code> easier to use in setup or ETL scripts where the table might or + might not exist. + <span class="ph">See <a class="xref" href="impala_truncate_table.html#truncate_table">TRUNCATE TABLE Statement (Impala 2.3 or higher only)</a> for details.</span> + </p> + </li> + <li class="li"> + <div class="p"> + Improved performance and reliability for the <code class="ph codeph">DECIMAL</code> data type: + <ul class="ul"> + <li class="li"> + <p class="p"> + Using <code class="ph codeph">DECIMAL</code> values in a <code class="ph codeph">GROUP BY</code> clause now + triggers the native code generation optimization, speeding up queries that + group by values such as prices. + </p> + </li> + <li class="li"> + <p class="p"> + Checking for overflow in <code class="ph codeph">DECIMAL</code> + multiplication is now substantially faster, making <code class="ph codeph">DECIMAL</code> + a more practical data type in some use cases where formerly <code class="ph codeph">DECIMAL</code> + was much slower than <code class="ph codeph">FLOAT</code> or <code class="ph codeph">DOUBLE</code>. + </p> + </li> + <li class="li"> + <p class="p"> + Multiplying a mixture of <code class="ph codeph">DECIMAL</code> + and <code class="ph codeph">FLOAT</code> or <code class="ph codeph">DOUBLE</code> values now returns the + <code class="ph codeph">DOUBLE</code> rather than <code class="ph codeph">DECIMAL</code>. This change avoids + some cases where an intermediate value would underflow or overflow and become + <code class="ph codeph">NULL</code> unexpectedly. + </p> + </li> + </ul> + <span class="ph">See <a class="xref" href="impala_decimal.html">DECIMAL Data Type (Impala 3.0 or higher only)</a> for details.</span> + </div> + </li> + <li class="li"> + <p class="p"> + For UDFs written in Java, or Hive UDFs reused for Impala, + Impala now allows parameters and return values to be primitive types. + Formerly, these things were required to be one of the <span class="q">"Writable"</span> + object types. + <span class="ph">See <a class="xref" href="impala_udf.html#udfs_hive">Using Hive UDFs with Impala</a> for details.</span> + </p> + </li> + <li class="li"> + <p class="p"> + Performance improvements for HDFS I/O. Impala now caches HDFS file handles to avoid the + overhead of repeatedly opening the same file. + </p> + </li> + + + <li class="li"> + <p class="p"> + Performance improvements for queries involving nested complex types. + Certain basic query types, such as counting the elements of a complex column, + now use an optimized code path. + </p> + </li> + + <li class="li"> + <p class="p"> + Improvements to the memory reservation mechanism for the Impala + admission control feature. You can specify more settings, such + as the timeout period and maximum aggregate memory used, for each + resource pool instead of globally for the Impala instance. The + default limit for concurrent queries (the <span class="ph uicontrol">max requests</span> + setting) is now unlimited instead of 200. + </p> + </li> + + <li class="li"> + <p class="p"> + Performance improvements related to code generation. + Even in queries where code generation is not performed + for some phases of execution (such as reading data from + Parquet tables), Impala can still use code generation in + other parts of the query, such as evaluating + functions in the <code class="ph codeph">WHERE</code> clause. + </p> + </li> + <li class="li"> + <p class="p"> + Performance improvements for queries using aggregation functions + on high-cardinality columns. + Formerly, Impala could do unnecessary extra work to produce intermediate + results for operations such as <code class="ph codeph">DISTINCT</code> or <code class="ph codeph">GROUP BY</code> + on columns that were unique or had few duplicate values. + Now, Impala decides at run time whether it is more efficient to + do an initial aggregation phase and pass along a smaller set of intermediate data, + or to pass raw intermediate data back to next phase of query processing to be aggregated there. + This feature is known as <dfn class="term">streaming pre-aggregation</dfn>. + In case of performance regression, this feature can be turned off + using the <code class="ph codeph">DISABLE_STREAMING_PREAGGREGATIONS</code> query option. + <span class="ph">See <a class="xref" href="impala_disable_streaming_preaggregations.html#disable_streaming_preaggregations">DISABLE_STREAMING_PREAGGREGATIONS Query Option (Impala 2.5 or higher only)</a> for details.</span> + </p> + </li> + <li class="li"> + <p class="p"> + Spill-to-disk feature now always recommended. In earlier releases, the spill-to-disk feature + could be turned off using a pair of configuration settings, + <code class="ph codeph">enable_partitioned_aggregation=false</code> and + <code class="ph codeph">enable_partitioned_hash_join=false</code>. + The latest improvements in the spill-to-disk mechanism, and related features that + interact with it, make this feature robust enough that disabling it is now + no longer needed or supported. In particular, some new features in <span class="keyword">Impala 2.5</span> + and higher do not work when the spill-to-disk feature is disabled. + </p> + </li> + <li class="li"> + <p class="p"> + Improvements to scripting capability for the <span class="keyword cmdname">impala-shell</span> command, + through user-specified substitution variables that can appear in statements processed + by <span class="keyword cmdname">impala-shell</span>: + </p> + <ul class="ul"> + <li class="li"> + <p class="p"> + The <code class="ph codeph">--var</code> command-line option lets you pass key-value pairs to + <span class="keyword cmdname">impala-shell</span>. The shell can substitute the values + into queries before executing them, where the query text contains the notation + <code class="ph codeph">${var:<var class="keyword varname">varname</var>}</code>. For example, you might prepare a SQL file + containing a set of DDL statements and queries containing variables for + database and table names, and then pass the applicable names as part of the + <code class="ph codeph">impala-shell -f <var class="keyword varname">filename</var></code> command. + <span class="ph">See <a class="xref" href="impala_shell_running_commands.html#shell_running_commands">Running Commands and SQL Statements in impala-shell</a> for details.</span> + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">SET</code> and <code class="ph codeph">UNSET</code> commands within the + <span class="keyword cmdname">impala-shell</span> interpreter now work with user-specified + substitution variables, as well as the built-in query options. + The two kinds of variables are divided in the <code class="ph codeph">SET</code> output. + As with variables defined by the <code class="ph codeph">--var</code> command-line option, + you refer to the user-specified substitution variables in queries by using + the notation <code class="ph codeph">${var:<var class="keyword varname">varname</var>}</code> + in the query text. Because the substitution variables are processed by + <span class="keyword cmdname">impala-shell</span> instead of the <span class="keyword cmdname">impalad</span> + backend, you cannot define your own substitution variables through the + <code class="ph codeph">SET</code> statement in a JDBC or ODBC application. + <span class="ph">See <a class="xref" href="impala_set.html#set">SET Statement</a> for details.</span> + </p> + </li> + </ul> + </li> + <li class="li"> + <p class="p"> + Performance improvements for query startup. Impala better parallelizes certain work + when coordinating plan distribution between <span class="keyword cmdname">impalad</span> instances, which improves + startup time for queries involving tables with many partitions on large clusters, + or complicated queries with many plan fragments. + </p> + </li> + <li class="li"> + <p class="p"> + Performance and scalability improvements for tables with many partitions. + The memory requirements on the coordinator node are reduced, making it substantially + faster and less resource-intensive + to do joins involving several tables with thousands of partitions each. + </p> + </li> + <li class="li"> + <p class="p"> + Whitelisting for access to internal APIs. For applications that need direct access + to Impala APIs, without going through the HiveServer2 or Beeswax interfaces, you can + specify a list of Kerberos users who are allowed to call those APIs. By default, the + <code class="ph codeph">impala</code> and <code class="ph codeph">hdfs</code> users are the only ones authorized + for this kind of access. + Any users not explicitly authorized through the <code class="ph codeph">internal_principals_whitelist</code> + configuration setting are blocked from accessing the APIs. This setting applies to all the + Impala-related daemons, although currently it is primarily used for HDFS to control the + behavior of the catalog server. + </p> + </li> + <li class="li"> + <p class="p"> + Improvements to Impala integration and usability for Hue. (The code changes + are actually on the Hue side.) + </p> + <ul class="ul"> + <li class="li"> + <p class="p"> + The list of tables now refreshes dynamically. + </p> + </li> + </ul> + </li> + <li class="li"> + <p class="p"> + Usability improvements for case-insensitive queries. + You can now use the operators <code class="ph codeph">ILIKE</code> and <code class="ph codeph">IREGEXP</code> + to perform case-insensitive wildcard matches or regular expression matches, + rather than explicitly converting column values with <code class="ph codeph">UPPER</code> + or <code class="ph codeph">LOWER</code>. + <span class="ph">See <a class="xref" href="impala_operators.html#ilike">ILIKE Operator</a> and <a class="xref" href="impala_operators.html#iregexp">IREGEXP Operator</a> for details.</span> + </p> + </li> + <li class="li"> + <p class="p"> + Performance and reliability improvements for DDL and insert operations on partitioned tables with a large + number of partitions. Impala only re-evaluates metadata for partitions that are affected by + a DDL operation, not all partitions in the table. While a DDL or insert statement is in progress, + other Impala statements that attempt to modify metadata for the same table wait until the first one + finishes. + </p> + </li> + <li class="li"> + <p class="p"> + Reliability improvements for the <code class="ph codeph">LOAD DATA</code> statement. + Previously, this statement would fail if the source HDFS directory + contained any subdirectories at all. Now, the statement ignores + any hidden subdirectories, for example <span class="ph filepath">_impala_insert_staging</span>. + </p> + </li> + <li class="li"> + <p class="p"> + A new operator, <code class="ph codeph">IS [NOT] DISTINCT FROM</code>, lets you compare values + and always get a <code class="ph codeph">true</code> or <code class="ph codeph">false</code> result, + even if one or both of the values are <code class="ph codeph">NULL</code>. + The <code class="ph codeph">IS NOT DISTINCT FROM</code> operator, or its equivalent + <code class="ph codeph"><=></code> notation, improves the efficiency of join queries that + treat key values that are <code class="ph codeph">NULL</code> in both tables as equal. + <span class="ph">See <a class="xref" href="impala_operators.html#is_distinct_from">IS DISTINCT FROM Operator</a> for details.</span> + </p> + </li> + <li class="li"> + <p class="p"> + Security enhancements for the <span class="keyword cmdname">impala-shell</span> command. + A new option, <code class="ph codeph">--ldap_password_cmd</code>, lets you specify + a command to retrieve the LDAP password. The resulting password is + then used to authenticate the <span class="keyword cmdname">impala-shell</span> command + with the LDAP server. + <span class="ph">See <a class="xref" href="impala_shell_options.html">impala-shell Configuration Options</a> for details.</span> + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">CREATE TABLE AS SELECT</code> statement now accepts a + <code class="ph codeph">PARTITIONED BY</code> clause, which lets you create a + partitioned table and insert data into it with a single statement. + <span class="ph">See <a class="xref" href="impala_create_table.html#create_table">CREATE TABLE Statement</a> for details.</span> + </p> + </li> + <li class="li"> + <p class="p"> + User-defined functions (UDFs and UDAFs) written in C++ now persist automatically + when the <span class="keyword cmdname">catalogd</span> daemon is restarted. You no longer + have to run the <code class="ph codeph">CREATE FUNCTION</code> statements again after a restart. + </p> + </li> + <li class="li"> + <p class="p"> + User-defined functions (UDFs) written in Java can now persist + when the <span class="keyword cmdname">catalogd</span> daemon is restarted, and can be shared + transparently between Impala and Hive. You must do a one-time operation to recreate these + UDFs using new <code class="ph codeph">CREATE FUNCTION</code> syntax, without a signature for arguments + or the return value. Afterwards, you no longer have to run the <code class="ph codeph">CREATE FUNCTION</code> + statements again after a restart. + Although Impala does not have visibility into the UDFs that implement the + Hive built-in functions, user-created Hive UDFs are now automatically available + for calling through Impala. + <span class="ph">See <a class="xref" href="impala_create_function.html#create_function">CREATE FUNCTION Statement</a> for details.</span> + </p> + </li> + <li class="li"> + + <p class="p"> + Reliability enhancements for memory management. Some aggregation and join queries + that formerly might have failed with an out-of-memory error due to memory contention, + now can succeed using the spill-to-disk mechanism. + </p> + </li> + <li class="li"> + + <p class="p"> + The <code class="ph codeph">SHOW DATABASES</code> statement now returns two columns rather than one. + The second column includes the associated comment string, if any, for each database. + Adjust any application code that examines the list of databases and assumes the + result set contains only a single column. + <span class="ph">See <a class="xref" href="impala_show.html#show_databases">SHOW DATABASES</a> for details.</span> + </p> + </li> + <li class="li"> + <p class="p"> + A new optimization speeds up aggregation operations that involve only the partition key + columns of partitioned tables. For example, a query such as <code class="ph codeph">SELECT COUNT(DISTINCT k), MIN(k), MAX(k) FROM t1</code> + can avoid reading any data files if <code class="ph codeph">T1</code> is a partitioned table and <code class="ph codeph">K</code> + is one of the partition key columns. Because this technique can produce different results in cases + where HDFS files in a partition are manually deleted or are empty, you must enable the optimization + by setting the query option <code class="ph codeph">OPTIMIZE_PARTITION_KEY_SCANS</code>. + <span class="ph">See <a class="xref" href="impala_optimize_partition_key_scans.html">OPTIMIZE_PARTITION_KEY_SCANS Query Option (Impala 2.5 or higher only)</a> for details.</span> + </p> + </li> + + <li class="li"> + <p class="p"> + The <code class="ph codeph">DESCRIBE</code> statement can now display metadata about a database, using the + syntax <code class="ph codeph">DESCRIBE DATABASE <var class="keyword varname">db_name</var></code>. + <span class="ph">See <a class="xref" href="impala_describe.html#describe">DESCRIBE Statement</a> for details.</span> + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">uuid()</code> built-in function generates an + alphanumeric value that you can use as a guaranteed unique identifier. + The uniqueness applies even across tables, for cases where an ascending + numeric sequence is not suitable. + <span class="ph">See <a class="xref" href="impala_misc_functions.html#misc_functions">Impala Miscellaneous Functions</a> for details.</span> + </p> + </li> + </ul> + + </div> + </article> + + + + <article class="topic concept nested1" aria-labelledby="ariaid-title11" id="new_features__new_features_240"> + + <h2 class="title topictitle2" id="ariaid-title11">New Features in <span class="keyword">Impala 2.4</span></h2> + + <div class="body conbody"> + + <ul class="ul"> + <li class="li"> + <p class="p"> + Impala can be used on the DSSD D5 Storage Appliance. + From a user perspective, the Impala features are the same as in <span class="keyword">Impala 2.3</span>. + </p> + </li> + </ul> + + </div> + </article> + + + + + + <article class="topic concept nested1" aria-labelledby="ariaid-title12" id="new_features__new_features_230"> + + <h2 class="title topictitle2" id="ariaid-title12">New Features in <span class="keyword">Impala 2.3</span></h2> + + <div class="body conbody"> + + <p class="p"> + The following are the major new features in Impala 2.3.x. This major release + contains improvements to SQL syntax (particularly new support for complex types), performance, + manageability, security. + </p> + + <ul class="ul"> + + <li class="li"> + <p class="p"> + Complex data types: <code class="ph codeph">STRUCT</code>, <code class="ph codeph">ARRAY</code>, and <code class="ph codeph">MAP</code>. These + types can encode multiple named fields, positional items, or key-value pairs within a single column. + You can combine these types to produce nested types with arbitrarily deep nesting, + such as an <code class="ph codeph">ARRAY</code> of <code class="ph codeph">STRUCT</code> values, + a <code class="ph codeph">MAP</code> where each key-value pair is an <code class="ph codeph">ARRAY</code> of other <code class="ph codeph">MAP</code> values, + and so on. Currently, complex data types are only supported for the Parquet file format. + <span class="ph">See <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for usage details and <a class="xref" href="impala_array.html#array">ARRAY Complex Type (Impala 2.3 or higher only)</a>, <a class="xref" href="impala_struct.html#struct">STRUCT Complex Type (Impala 2.3 or higher only)</a>, and <a class="xref" href="impala_map.html#map">MAP Complex Type (Impala 2.3 or higher only)</a> for syntax.</span> + </p> + </li> + + <li class="li"> + <p class="p"> + Column-level authorization lets you define access to particular columns within a table, + rather than the entire table. This feature lets you reduce the reliance on creating views to + set up authorization schemes for subsets of information. + See <span class="xref">the documentation for Apache Sentry</span> for background details, and + <a class="xref" href="impala_grant.html#grant">GRANT Statement (Impala 2.0 or higher only)</a> and <a class="xref" href="impala_revoke.html#revoke">REVOKE Statement (Impala 2.0 or higher only)</a> for Impala-specific syntax. + </p> + </li> + + <li class="li"> + <p class="p"> + The <code class="ph codeph">TRUNCATE TABLE</code> statement removes all the data from a table without removing the table itself. + <span class="ph">See <a class="xref" href="impala_truncate_table.html#truncate_table">TRUNCATE TABLE Statement (Impala 2.3 or higher only)</a> for details.</span> + </p> + </li> + + <li class="li" id="new_features_230__IMPALA-2015"> + <p class="p"> + Nested loop join queries. Some join queries that formerly required equality comparisons can now use + operators such as <code class="ph codeph"><</code> or <code class="ph codeph">>=</code>. This same join mechanism is used + internally to optimize queries that retrieve values from complex type columns. + <span class="ph">See <a class="xref" href="impala_joins.html#joins">Joins in Impala SELECT Statements</a> for details about Impala join queries.</span> + </p> + </li> + + <li class="li"> + <p class="p"> + Reduced memory usage and improved performance and robustness for spill-to-disk feature. + <span class="ph">See <a class="xref" href="impala_scalability.html#spill_to_disk">SQL Operations that Spill to Disk</a> for details about this feature.</span> + </p> + </li> + + <li class="li"> + <p class="p"> + Performance improvements for querying Parquet data files containing multiple row groups + and multiple data blocks: + </p> + <ul class="ul"> + <li class="li"> + <p class="p"> For files written by Hive, SparkSQL, and other Parquet MR writers + and spanning multiple HDFS blocks, Impala now scans the extra + data blocks locally when possible, rather than using remote + reads. </p> + </li> + <li class="li"> + <p class="p"> + Impala queries benefit from the improved alignment of row groups with HDFS blocks for Parquet + files written by Hive, MapReduce, and other components. (Impala itself never writes + multiblock Parquet files, so the alignment change does not apply to Parquet files produced by Impala.) + These Parquet writers now add padding to Parquet files that they write to align row groups with HDFS blocks. + The <code class="ph codeph">parquet.writer.max-padding</code> setting specifies the maximum number of bytes, by default + 8 megabytes, that can be added to the file between row groups to fill the gap at the end of one block + so that the next row group starts at the beginning of the next block. + If the gap is larger than this size, the writer attempts to fit another entire row group in the remaining space. + Include this setting in the <span class="ph filepath">hive-site</span> configuration file to influence Parquet files written by Hive, + or the <span class="ph filepath">hdfs-site</span> configuration file to influence Parquet files written by all non-Impala components. + </p> + </li> + </ul> + <p class="p"> + See <a class="xref" href="impala_parquet.html#parquet">Using the Parquet File Format with Impala Tables</a> for instructions about using Parquet data files + with Impala. + </p> + </li> + + <li class="li" id="new_features_230__IMPALA-1660"> + <p class="p"> + Many new built-in scalar functions, for convenience and enhanced portability of SQL that uses common industry extensions. + </p> + + <p class="p"> + Math functions<span class="ph"> (see <a class="xref" href="impala_math_functions.html#math_functions">Impala Mathematical Functions</a> for details)</span>: + </p> + <ul class="ul"> + <li class="li"> + <code class="ph codeph">ATAN2</code> + </li> + + <li class="li"> + <code class="ph codeph">COSH</code> + </li> + + <li class="li"> + <code class="ph codeph">COT</code> + </li> + + <li class="li"> + <code class="ph codeph">DCEIL</code> + </li> + + <li class="li"> + <code class="ph codeph">DEXP</code> + </li> + + <li class="li"> + <code class="ph codeph">DFLOOR</code> + </li> + + <li class="li"> + <code class="ph codeph">DLOG10</code> + </li> + + <li class="li"> + <code class="ph codeph">DPOW</code> + </li> + + <li class="li"> + <code class="ph codeph">DROUND</code> + </li> + + <li class="li"> + <code class="ph codeph">DSQRT</code> + </li> + + <li class="li"> + <code class="ph codeph">DTRUNC</code> + </li> + + <li class="li"> + <code class="ph codeph">FACTORIAL</code>, and corresponding <code class="ph codeph">!</code> operator + </li> + + <li class="li"> + <code class="ph codeph">FPOW</code> + </li> + + <li class="li"> + <code class="ph codeph">RADIANS</code> + </li> + + <li class="li"> + <code class="ph codeph">RANDOM</code> + </li> + + <li class="li"> + <code class="ph codeph">SINH</code> + </li> + + <li class="li"> + <code class="ph codeph">TANH</code> + </li> + </ul> + + <p class="p"> + String functions<span class="ph"> (see <a class="xref" href="impala_string_functions.html#string_functions">Impala String Functions</a> for details)</span>: + </p> + <ul class="ul"> + <li class="li"> + <code class="ph codeph">BTRIM</code> + </li> + <li class="li"> + <code class="ph codeph">CHR</code> + </li> + <li class="li"> + <code class="ph codeph">REGEXP_LIKE</code> + </li> + <li class="li"> + <code class="ph codeph">SPLIT_PART</code> + </li> + </ul> + + <p class="p"> + Date and time functions<span class="ph"> (see <a class="xref" href="impala_datetime_functions.html#datetime_functions">Impala Date and Time Functions</a> for details)</span>: + </p> + <ul class="ul"> + <li class="li"> + <code class="ph codeph">INT_MONTHS_BETWEEN</code> + </li> + <li class="li"> + <code class="ph codeph">MONTHS_BETWEEN</code> + </li> + <li class="li"> + <code class="ph codeph">TIMEOFDAY</code> + </li> + <li class="li"> + <code class="ph codeph">TIMESTAMP_CMP</code> + </li> + </ul> + + <p class="p"> + Bit manipulation functions<span class="ph"> (see <a class="xref" href="impala_bit_functions.html#bit_functions">Impala Bit Functions</a> for details)</span>: + </p> + <ul class="ul"> + <li class="li"> + <code class="ph codeph">BITAND</code> + </li> + + <li class="li"> + <code class="ph codeph">BITNOT</code> + </li> + + <li class="li"> + <code class="ph codeph">BITOR</code> + </li> + + <li class="li"> + <code class="ph codeph">BITXOR</code> + </li> + + <li class="li"> + <code class="ph codeph">COUNTSET</code> + </li> + + <li class="li"> + <code class="ph codeph">GETBIT</code> + </li> + + <li class="li"> + <code class="ph codeph">ROTATELEFT</code> + </li> + + <li class="li"> + <code class="ph codeph">ROTATERIGHT</code> + </li> + + <li class="li"> + <code class="ph codeph">SETBIT</code> + </li> + + <li class="li"> + <code class="ph codeph">SHIFTLEFT</code> + </li> + + <li class="li"> + <code class="ph codeph">SHIFTRIGHT</code> + </li> + </ul> + <p class="p"> + Type conversion functions<span class="ph"> (see <a class="xref" href="impala_conversion_functions.html#conversion_functions">Impala Type Conversion Functions</a> for details)</span>: + </p> + <ul class="ul"> + <li class="li"> + <code class="ph codeph">TYPEOF</code> + </li> + </ul> + <p class="p"> + The <code class="ph codeph">effective_user()</code> function<span class="ph"> (see <a class="xref" href="impala_misc_functions.html#misc_functions">Impala Miscellaneous Functions</a> for details)</span>. + </p> + </li> + + <li class="li" id="new_features_230__IMPALA-2081"> + <p class="p"> + New built-in analytic
<TRUNCATED>