http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_new_features.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_new_features.xml b/docs/topics/impala_new_features.xml new file mode 100644 index 0000000..4da811f --- /dev/null +++ b/docs/topics/impala_new_features.xml @@ -0,0 +1,4015 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="ver" id="new_features"> + + <title><ph audience="standalone">New Features in Apache Impala (incubating)</ph><ph audience="integrated">What's New in Apache Impala (incubating)</ph></title> + + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Release Notes"/> + <data name="Category" value="New Features"/> + <data name="Category" value="What's New"/> + <data name="Category" value="Getting Started"/> + <data name="Category" value="Upgrading"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + This release of Impala contains the following changes and enhancements from previous releases. + </p> + + <p outputclass="toc inpage"/> + + </conbody> + +<!-- All 2.7.x new features go under here --> + + <concept rev="2.7.0" id="new_features_270"> + + <title>New Features in Impala 2.7.x / CDH 5.9.x</title> + + <conbody> + + <ul id="feature_list"> + <li> + <p> + Performance improvements: + </p> + <ul> + <li> + <p rev="IMPALA-3206 CDH-43744"> + [<xref href="https://issues.cloudera.org/browse/IMPALA-3206" scope="external" format="html">IMPALA-3206</xref>] + Speedup for queries against <codeph>DECIMAL</codeph> columns in Avro tables. + The code that parses <codeph>DECIMAL</codeph> values from Avro now uses + native code generation. + </p> + </li> + <li> + <p rev="IMPALA-3674"> + [<xref href="https://issues.cloudera.org/browse/IMPALA-3674" scope="external" format="html">IMPALA-3674</xref>] + Improved efficiency in LLVM code generation can reduce codegen time, especially + for short queries. + </p> + </li> + <!-- Not actually a new feature, it's more a tip about when to expect remote reads and how to minimize them. To go somewhere in the performance / best practices / Parquet info. + <li> + <p rev="IMPALA-3885 CDH-43793"> + [<xref href="https://issues.cloudera.org/browse/IMPALA-3885" scope="external" format="html">IMPALA-3885</xref>] + Parquet files with multiple blocks can now be processed + without remote reads. + </p> + </li> + --> + <li> + <p rev="IMPALA-2979 CDH-43739"> [<xref + href="https://issues.cloudera.org/browse/IMPALA-2979" scope="external" + format="html">IMPALA-2979</xref>] Improvements to scheduling on worker nodes, + enabled by the <codeph>REPLICA_PREFERENCE</codeph> query option. + See <xref + href="impala_replica_preference.xml#replica_preference"/> for details. + </p> + </li> + </ul> + </li> + <li audience="Cloudera"> + <p rev="IMPALA-3210 CDH-43736"><!-- Patch didn't make it into in <keyword keyref="impala27_full"/> --> + [<xref href="https://issues.cloudera.org/browse/IMPALA-3210" scope="external" format="html">IMPALA-3210</xref>] + The analytic functions <codeph>FIRST_VALUE()</codeph> and <codeph>LAST_VALUE()</codeph> + accept a new clause, <codeph>IGNORE NULLS</codeph>. + See <xref href="impala_analytic_functions.xml#first_value"/> + and <xref href="impala_analytic_functions.xml#last_value"/> + for details. + </p> + </li> + <li> + <p rev="IMPALA-1683 CDH-43732"> + [<xref href="https://issues.cloudera.org/browse/IMPALA-1683" scope="external" format="html">IMPALA-1683</xref>] + The <codeph>REFRESH</codeph> statement can be applied to a single partition, + rather than the entire table. See <xref href="impala_refresh.xml#refresh"/> + and <xref href="impala_partitioning.xml#partition_refresh"/> for details. + </p> + </li> + <li> + <p> + Improvements to the Impala web user interface: + </p> + <ul> + <li> + <p rev="IMPALA-2767 CDH-43748"> + [<xref href="https://issues.cloudera.org/browse/IMPALA-2767" scope="external" format="html">IMPALA-2767</xref>] + You can now force a session to expire by clicking a link in the web UI, + on the <uicontrol>/sessions</uicontrol> tab. + </p> + </li> + <li> + <p rev="IMPALA-3715 CDH-43743"> + [<xref href="https://issues.cloudera.org/browse/IMPALA-3715" scope="external" format="html">IMPALA-3715</xref>] + The <uicontrol>/memz</uicontrol> tab includes more information about + Impala memory usage. + </p> + </li> + <li> + <p rev="IMPALA-3716 CDH-43741"> + [<xref href="https://issues.cloudera.org/browse/IMPALA-3716" scope="external" format="html">IMPALA-3716</xref>] + The <uicontrol>Details</uicontrol> page for a query now includes + a <uicontrol>Memory</uicontrol> tab. + </p> + </li> + </ul> + </li> + <li> + <p rev="IMPALA-3499 CDH-43740"> + [<xref href="https://issues.cloudera.org/browse/IMPALA-3499" scope="external" format="html">IMPALA-3499</xref>] + Scalability improvements to the catalog server. Impala handles internal communication + more efficiently for tables with large numbers of columns and partitions, where the + size of the metadata exceeds 2 GiB. + </p> + </li> + <li> + <p rev="IMPALA-3677 CDH-43745"> + [<xref href="https://issues.cloudera.org/browse/IMPALA-3677" scope="external" format="html">IMPALA-3677</xref>] + You can send a <codeph>SIGUSR1</codeph> signal to any Impala-related daemon to write a + Breakpad minidump. For advanced troubleshooting, you can now produce a minidump + without triggering a crash. See <xref href="impala_breakpad.xml#breakpad"/> for + details about the Breakpad minidump feature. + </p> + </li> + <li> + <p rev="IMPALA-3687 CDH-43731"> + [<xref href="https://issues.cloudera.org/browse/IMPALA-3687" scope="external" format="html">IMPALA-3687</xref>] + The schema reconciliation rules for Avro tables have changed slightly + for <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> columns. Now, if + the definition of such a column is changed in the Avro schema file, + the column retains its <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> + type as specified in the SQL definition, but the column name and comment + from the Avro schema file take precedence. + See <xref href="impala_avro.xml#avro_create_table"/> for details about + column definitions in Avro tables. + </p> + </li> + <li audience="Cloudera"><!-- Patch didn't make it into in <keyword keyref="impala27_full"/> --> + <p rev="IMPALA-1654 CDH-43747"> + [<xref href="https://issues.cloudera.org/browse/IMPALA-1654" scope="external" format="html">IMPALA-1654</xref>] + Several kinds of DDL operations + can now work on a range of partitions. The partitions can be specified + using operators such as <codeph><</codeph>, <codeph>>=</codeph>, and + <codeph>!=</codeph> rather than just an equality predicate applying to a single + partition. + This new feature extends the syntax of + several clauses + of the <codeph>ALTER TABLE</codeph> statement + (<codeph>DROP PARTITION</codeph>, <codeph>SET [UN]CACHED</codeph>, + <codeph>SET FILEFORMAT | SERDEPROPERTIES | TBLPROPERTIES</codeph>), + the <codeph>SHOW FILES</codeph> statement, and the + <codeph>COMPUTE INCREMENTAL STATS</codeph> statement. + It does not apply to statements that are defined to only apply to a single + partition, such as <codeph>LOAD DATA</codeph>, <codeph>ALTER TABLE ... ADD PARTITION</codeph>, + <codeph>SET LOCATION</codeph>, and <codeph>INSERT</codeph> with a static + partitioning clause. + </p> + </li> + <li> + <p rev="IMPALA-3575 CDH-43742"> [<xref + href="https://issues.cloudera.org/browse/IMPALA-3575" + scope="external" format="html">IMPALA-3575</xref>] Some network + operations now have additional timeout and retry settings. The extra + configuration helps avoid failed queries for transient network + problems, to avoid hangs when a sender or receiver fails in the + middle of a network transmission, and to make cancellation requests + more reliable despite network issues. </p> + </li> + </ul> + + </conbody> + </concept> +<!-- All 2.6.x new features go under here --> + + <concept rev="2.6.0" id="new_features_260"> + + <title>New Features in Impala 2.6.x / CDH 5.8.x</title> + + <conbody> + + <!-- <note conref="../shared/impala_common.xml#common/only_cdh5_260" /> --> + + <ul> + <li> + <p> + Improvements to Impala support for the Amazon S3 filesystem: + </p> + <ul> + <li> + <p rev="IMPALA-1878 CDH-33310"> + Impala can now write to S3 tables through the <codeph>INSERT</codeph> + or <codeph>LOAD DATA</codeph> statements. + See <xref href="impala_s3.xml#s3"/> for general information about + using Impala with S3. + </p> + </li> + <li> + <p rev="IMPALA-3452 CDH-39913"> + A new query option, <codeph>S3_SKIP_INSERT_STAGING</codeph>, lets you + trade off between fast <codeph>INSERT</codeph> performance and + slower <codeph>INSERT</codeph>s that are more consistent if a + problem occurs during the statement. The new behavior is enabled by default. + See <xref href="impala_s3_skip_insert_staging.xml#s3_skip_insert_staging"/> for details + about this option. + </p> + </li> + </ul> + </li> + <li> + <p rev="CDH-41184"> + Performance improvements for the runtime filtering feature: + </p> + <ul> + <li> + <p rev="CDH-41184 IMPALA-3333"> + The default for the <codeph>RUNTIME_FILTER_MODE</codeph> + query option is changed to <codeph>GLOBAL</codeph> (the highest setting). + See <xref href="impala_runtime_filter_mode.xml#runtime_filter_mode"/> for + details about this option. + </p> + </li> + <li rev="CDH-41184 IMPALA-3007"> + <p> + The <codeph>RUNTIME_BLOOM_FILTER_SIZE</codeph> setting is now only used + as a fallback if statistics are not available; otherwise, Impala + uses the statistics to estimate the appropriate size to use for each filter. + See <xref href="impala_runtime_bloom_filter_size.xml#runtime_bloom_filter_size"/> for + details about this option. + </p> + </li> + <li rev="CDH-41184 IMPALA-3480"> + <p> + New query options <codeph>RUNTIME_FILTER_MIN_SIZE</codeph> and + <codeph>RUNTIME_FILTER_MAX_SIZE</codeph> let you fine-tune + the sizes of the Bloom filter structures used for runtime filtering. + If the filter size derived from Impala internal estimates or from + the <codeph>RUNTIME_FILTER_BLOOM_SIZE</codeph> falls outside the size + range specified by these options, any too-small filter size is adjusted + to the minimum, and any too-large filter size is adjusted to the maximum. + See <xref href="impala_runtime_filter_min_size.xml#runtime_filter_min_size"/> + and <xref href="impala_runtime_filter_max_size.xml#runtime_filter_max_size"/> + for details about these options. + </p> + </li> + <li rev="CDH-41184 IMPALA-2956"> + <p> + Runtime filter propagation now applies to all the + operands of <codeph>UNION</codeph> and <codeph>UNION ALL</codeph> + operators. + </p> + </li> + <li rev="CDH-41184 IMPALA-3077"> + <p> + Runtime filters can now be produced during join queries even + when the join processing activates the spill-to-disk mechanism. + </p> + </li> + </ul> + See <xref href="impala_runtime_filtering.xml#runtime_filtering"/> for + general information about the runtime filtering feature. + </li> + <!-- Have to look closer at resource management / admission control to see if + there are any ripple effects from this default change. --> + <li> + <p rev="IMPALA-3199"> + Admission control and dynamic resource pools are enabled by default. + See <xref href="impala_admission.xml#admission_control"/> for details + about admission control. + </p> + </li> + <!-- Below here are features that are pretty well taken care of already; + some of them didn't need much if any doc in the first place. --> + <li> + <p rev="IMPALA-3369"> + Impala can now manually set column statistics, + using the <codeph>ALTER TABLE</codeph> statement with a + <codeph>SET COLUMN STATS</codeph> clause. + See <xref href="impala_perf_stats.xml#perf_column_stats_manual"/> for details. + </p> + </li> + <li> + <p rev="CDH-40238 CDH-39818 IMPALA-3490 IMPALA-3581 IMPALA-2686"> + Impala can now write lightweight <q>minidump</q> files, rather + than large core files, to save diagnostic information when + any of the Impala-related daemons crash. This feature uses the + open source <codeph>breakpad</codeph> framework. + See <xref href="impala_breakpad.xml#breakpad"/> for details. + </p> + </li> + <li> + <p> + New query options improve interoperability with Parquet files: + <ul> + <li> + <p rev="IMPALA-2835 CDH-33330"> + The <codeph>PARQUET_FALLBACK_SCHEMA_RESOLUTION</codeph> query option + lets Impala locate columns within Parquet files based on + column name rather than ordinal position. + This enhancement improves interoperability with applications + that write Parquet files with a different order or subset of + columns than are used in the Impala table. + See <xref href="impala_parquet_fallback_schema_resolution.xml#parquet_fallback_schema_resolution"/> + for details. + </p> + </li> + <li> + <p rev="IMPALA-2069"> + The <codeph>PARQUET_ANNOTATE_STRINGS_UTF8</codeph> query option + makes Impala include the <codeph>UTF-8</codeph> annotation + metadata for <codeph>STRING</codeph>, <codeph>CHAR</codeph>, + and <codeph>VARCHAR</codeph> columns in Parquet files created + by <codeph>INSERT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph> + statements. + See <xref href="impala_parquet_annotate_strings_utf8.xml#parquet_annotate_strings_utf8"/> + for details. + </p> + </li> + </ul> + See <xref href="impala_parquet.xml#parquet"/> for general information about working + with Parquet files. + </p> + </li> + <li> + <p> + Improvements to security and reduction in overhead for secure clusters: + </p> + <ul> + <li> + <p rev="IMPALA-1928"> + Overall performance improvements for secure clusters. + (TPC-H queries on a secure cluster were benchmarked + at roughly 3x as fast as the previous release.) + </p> + </li> + <li> + <p rev="IMPALA-2660 CDH-40241"> + Impala now recognizes the <codeph>auth_to_local</codeph> setting, + specified through the HDFS configuration setting + <codeph>hadoop.security.auth_to_local</codeph>. + This feature is disabled by default; to enable it, + specify <codeph>--load_auth_to_local_rules=true</codeph> + in the <cmdname>impalad</cmdname> configuration settings. + See <xref href="impala_kerberos.xml#auth_to_local"/> for details. + </p> + </li> + <li> + <p rev="IMPALA-2599"> + Timing improvements in the mechanism for the <cmdname>impalad</cmdname> + daemon to acquire Kerberos tickets. This feature spreads out the overhead + on the KDC during Impala startup, especially for large clusters. + </p> + </li> + <li> + <p rev="IMPALA-3554"> + For Kerberized clusters, the Catalog service now uses + the Kerberos principal instead of the operating sytem user that runs + the <cmdname>catalogd</cmdname> daemon. + This eliminates the requirement to configure a <codeph>hadoop.user.group.static.mapping.overrides</codeph> + setting to put the OS user into the Sentry administrative group, on clusters where the principal + and the OS user name for this user are different. + </p> + </li> + </ul> + </li> + <li> + <p rev="IMPALA-3286"> + Overall performance improvements for join queries, by using a prefetching mechanism + while building the in-memory hash table to evaluate join predicates. + See <xref href="impala_prefetch_mode.xml#prefetch_mode"/> for the query option + to control this optimization. + </p> + </li> + <li> + <p rev="IMPALA-3397 CDH-40097"> + The <cmdname>impala-shell</cmdname> interpreter has a new command, + <codeph>SOURCE</codeph>, that lets you run a set of SQL statements + or other <cmdname>impala-shell</cmdname> commands stored in a file. + You can run additional <codeph>SOURCE</codeph> commands from inside + a file, to set up flexible sequences of statements for use cases + such as schema setup, ETL, or reporting. + See <xref href="impala_shell_commands.xml#shell_commands"/> for details + and <xref href="impala_shell_running_commands.xml#shell_running_commands"/> + for examples. + </p> + </li> + <li> + <p rev="IMPALA-1772 CDH-38381"> + The <codeph>millisecond()</codeph> built-in function lets you extract + the fractional seconds part of a <codeph>TIMESTAMP</codeph> value. + See <xref href="impala_datetime_functions.xml#datetime_functions"/> for details. + </p> + </li> + <li> + <p rev="IMPALA-3092"> + If an Avro table is created without column definitions in the + <codeph>CREATE TABLE</codeph> statement, and columns are later + added through <codeph>ALTER TABLE</codeph>, the resulting + table is now queryable. Missing values from the newly added + columns now default to <codeph>NULL</codeph>. + See <xref href="impala_avro.xml#avro"/> for general details about + working with Avro files. + </p> + </li> + <li> + <p> + The mechanism for interpreting <codeph>DECIMAL</codeph> literals is + improved, no longer going through an intermediate conversion step + to <codeph>DOUBLE</codeph>: + <ul> + <li> + <p rev="IMPALA-3163"> + Casting a <codeph>DECIMAL</codeph> value to <codeph>TIMESTAMP</codeph> + <codeph>DOUBLE</codeph> produces a more precise + value for the <codeph>TIMESTAMP</codeph> than formerly. + </p> + </li> + <li> + <p rev="IMPALA-3439"> + Certain function calls involving <codeph>DECIMAL</codeph> literals + now succeed, when formerly they failed due to lack of a function + signature with a <codeph>DOUBLE</codeph> argument. + </p> + </li> + <li> + <p rev=""> + Faster runtime performance for <codeph>DECIMAL</codeph> constant + values, through improved native code generation for all combinations + of precision and scale. + </p> + </li> + </ul> + See <xref href="impala_decimal.xml#decimal"/> for details about the <codeph>DECIMAL</codeph> type. + </p> + </li> + <li> + <p rev="IMPALA-3155"> + Improved type accuracy for <codeph>CASE</codeph> return values. + If all <codeph>WHEN</codeph> clauses of the <codeph>CASE</codeph> + expression are of <codeph>CHAR</codeph> type, the final result + is also <codeph>CHAR</codeph> instead of being converted to + <codeph>STRING</codeph>. + See <xref href="impala_conditional_functions.xml#conditional_functions"/> + for details about the <codeph>CASE</codeph> function. + </p> + </li> + <li> + <p rev="IMPALA-3232"> + Uncorrelated queries using the <codeph>NOT EXISTS</codeph> operator + are now supported. Formerly, the <codeph>NOT EXISTS</codeph> + operator was only available for correlated subqueries. + </p> + </li> + <li> + <p rev="IMPALA-2736"> + Improved performance for reading Parquet files. + </p> + </li> + <li> + <p rev="IMPALA-3375"> + Improved performance for <term>top-N</term> queries, that is, + those including both <codeph>ORDER BY</codeph> and + <codeph>LIMIT</codeph> clauses. + </p> + </li> + <!-- JIRA still in open state as of 5.8 / 2.6, commenting out. + <li> + <p rev="IMPALA-3471"> + A top-N query can now also activate the spill-to-disk mechanism if + a host runs low on memory while evaluating it. For example, using + large <codeph>LIMIT</codeph> and/or <codeph>OFFSET</codeph> clauses + adds some memory overhead that could cause spilling. + </p> + </li> + --> + <li> + <p rev="IMPALA-1740"> + Impala optionally skips an arbitrary number of header lines from text input + files on HDFS based on the <codeph>skip.header.line.count</codeph> value + in the <codeph>TBLPROPERTIES</codeph> field of the table metadata. + See <xref href="impala_txtfile.xml#text_data_files"/> for details. + </p> + </li> + <li> + <p rev="IMPALA-2336"> + Trailing comments are now allowed in queries processed by + the <cmdname>impala-shell</cmdname> options <codeph>-q</codeph> + and <codeph>-f</codeph>. + </p> + </li> + <li> + <p rev="IMPALA-2844"> + Impala can run <codeph>COUNT</codeph> queries for RCFile tables + that include complex type columns. + See <xref href="impala_complex_types.xml#complex_types"/> for + general information about working with complex types, + and <xref href="impala_array.xml#array"/>, + <xref href="impala_map.xml#map"/>, and <xref href="impala_struct.xml#struct"/> + for syntax details of each type. + </p> + </li> + </ul> + + </conbody> + </concept> + +<!-- All 2.5.x new features go under here --> + + <concept rev="2.5.0" id="new_features_250"> + + <title>New Features in Impala 2.5.x / CDH 5.7.x</title> + + <conbody> + + <note conref="../shared/impala_common.xml#common/only_cdh5_250" /> + + <ul> + <li><!-- Spec: https://docs.google.com/document/d/1ambtYJ1t05iITCVIrN6N1A-e7PZBSetBPgjy8SLzJrA/edit#heading=h.vcftzwlpn845 --> + <p rev="CDH-33292 IMPALA-2552 IMPALA-3054"> + Dynamic partition pruning. When a query refers to a partition key column in a <codeph>WHERE</codeph> + clause, and the exact set of column values are not known until the query is executed, + Impala evaluates the predicate and skips the I/O for entire partitions that are not needed. + For example, if a table was partitioned by year, Impala would apply this technique to a query + such as <codeph>SELECT c1 FROM partitioned_table WHERE year = (SELECT MAX(year) FROM other_table)</codeph>. + <ph audience="standalone">See <xref href="impala_partitioning.xml#dynamic_partition_pruning"/> for details.</ph> + </p> + <p> + The dynamic partition pruning optimization technique lets Impala avoid reading + data files from partitions that are not part of the result set, even when + that determination cannot be made in advance. This technique is especially valuable + when performing join queries involving partitioned tables. For example, if a join + query includes an <codeph>ON</codeph> clause and a <codeph>WHERE</codeph> clause + that refer to the same columns, the query can find the set of column values that + match the <codeph>WHERE</codeph> clause, and only scan the associated partitions + when evaluating the <codeph>ON</codeph> clause. + </p> + <p> + Dynamic partition pruning is controlled by the same settings as the runtime filtering feature. + By default, this feature is enabled at a medium level, because the maximum setting can use + slightly more memory for queries than in previous releases. + To fully enable this feature, set the query option <codeph>RUNTIME_FILTER_MODE=GLOBAL</codeph>. + </p> + </li> + <li><!-- Spec: https://docs.google.com/document/d/1ambtYJ1t05iITCVIrN6N1A-e7PZBSetBPgjy8SLzJrA/edit#heading=h.vcftzwlpn845 --> + <p rev="IMPALA-2419 IMPALA-3001 IMPALA-3008 IMPALA-3039 IMPALA-3046 IMPALA-3054"> + Runtime filtering. This is a wide-ranging set of optimizations that are especially valuable for join queries. + Using the same technique as with dynamic partition pruning, + Impala uses the predicates from <codeph>WHERE</codeph> and <codeph>ON</codeph> clauses + to determine the subset of column values from one of the joined tables could possibly be part of the + result set. Impala sends a compact representation of the filter condition to the hosts in the cluster, + instead of the full set of values or the entire table. + <ph audience="PDF">See <xref href="impala_runtime_filtering.xml#runtime_filtering"/> for details.</ph> + </p> + <p> + By default, this feature is enabled at a medium level, because the maximum setting can use + slightly more memory for queries than in previous releases. + To fully enable this feature, set the query option <codeph>RUNTIME_FILTER_MODE=GLOBAL</codeph>. + <ph audience="PDF">See <xref href="impala_runtime_filter_mode.xml#runtime_filter_mode"/> for details.</ph> + </p> + <p> + This feature involves some new query options: + <xref audience="standalone" href="impala_runtime_filter_mode.xml">RUNTIME_FILTER_MODE</xref><codeph audience="integrated">RUNTIME_FILTER_MODE</codeph>, + <xref audience="standalone" href="impala_max_num_runtime_filters.xml">MAX_NUM_RUNTIME_FILTERS</xref><codeph audience="integrated">MAX_NUM_RUNTIME_FILTERS</codeph>, + <xref audience="standalone" href="impala_runtime_bloom_filter_size.xml">RUNTIME_BLOOM_FILTER_SIZE</xref><codeph audience="integrated">RUNTIME_BLOOM_FILTER_SIZE</codeph>, + <xref audience="standalone" href="impala_runtime_filter_wait_time_ms.xml">RUNTIME_FILTER_WAIT_TIME_MS</xref><codeph audience="integrated">RUNTIME_FILTER_WAIT_TIME_MS</codeph>, + and <xref audience="standalone" href="impala_disable_row_runtime_filtering.xml">DISABLE_ROW_RUNTIME_FILTERING</xref><codeph audience="integrated">DISABLE_ROW_RUNTIME_FILTERING</codeph>. + <ph audience="PDF">See + <xref href="impala_runtime_filter_mode.xml#runtime_filter_mode">RUNTIME_FILTER_MODE</xref>, + <xref href="impala_max_num_runtime_filters.xml#max_num_runtime_filters">MAX_NUM_RUNTIME_FILTERS</xref>, + <xref href="impala_runtime_bloom_filter_size.xml#runtime_bloom_filter_size">RUNTIME_BLOOM_FILTER_SIZE</xref>, + <xref href="impala_runtime_filter_wait_time_ms.xml#runtime_filter_wait_time_ms">RUNTIME_FILTER_WAIT_TIME_MS</xref>, and + <xref href="impala_disable_row_runtime_filtering.xml#disable_row_runtime_filtering">DISABLE_ROW_RUNTIME_FILTERING</xref> + for details. + </ph> + </p> + </li> + <li> + <p rev="IMPALA-2696"> + More efficient use of the HDFS caching feature, to avoid + hotspots and bottlenecks that could occur if heavily used + cached data blocks were always processed by the same host. + By default, Impala now randomizes which host processes each cached + HDFS data block, when cached replicas are available on multiple hosts. + (Remember to use the <codeph>WITH REPLICATION</codeph> clause with the + <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statement + when enabling HDFS caching for a table or partition, to cache the same + data blocks across multiple hosts.) + The new query option <codeph>SCHEDULE_RANDOM_REPLICA</codeph> + <!-- and <codeph>REPLICA_PREFERENCE</codeph> --> + lets you fine-tune the interaction with HDFS caching even more. + <ph audience="PDF">See <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/> for details.</ph> + </p> + </li> + <li> + <p rev="IMPALA-2641"> + The <codeph>TRUNCATE TABLE</codeph> statement now accepts an <codeph>IF EXISTS</codeph> + clause, making <codeph>TRUNCATE TABLE</codeph> easier to use in setup or ETL scripts where the table might or + might not exist. + <ph audience="PDF">See <xref href="impala_truncate_table.xml#truncate_table"/> for details.</ph> + </p> + </li> + <li> + <p rev="IMPALA-2681 IMPALA-2688 IMPALA-2749"> + Improved performance and reliability for the <codeph>DECIMAL</codeph> data type: + <ul> + <li> + <p rev="IMPALA-2681"> + Using <codeph>DECIMAL</codeph> values in a <codeph>GROUP BY</codeph> clause now + triggers the native code generation optimization, speeding up queries that + group by values such as prices. + </p> + </li> + <li> + <p rev="IMPALA-2688"> + Checking for overflow in <codeph>DECIMAL</codeph> + multiplication is now substantially faster, making <codeph>DECIMAL</codeph> + a more practical data type in some use cases where formerly <codeph>DECIMAL</codeph> + was much slower than <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph>. + </p> + </li> + <li> + <p rev="IMPALA-2749"> + Multiplying a mixture of <codeph>DECIMAL</codeph> + and <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph> values now returns the + <codeph>DOUBLE</codeph> rather than <codeph>DECIMAL</codeph>. This change avoids + some cases where an intermediate value would underflow or overflow and become + <codeph>NULL</codeph> unexpectedly. + </p> + </li> + </ul> + <ph audience="PDF">See <xref href="impala_decimal.xml"/> for details.</ph> + </p> + </li> + <li> + <p rev="IMPALA-2382"> + For UDFs written in Java, or Hive UDFs reused for Impala, + Impala now allows parameters and return values to be primitive types. + Formerly, these things were required to be one of the <q>Writable</q> + object types. + <ph audience="PDF">See <xref href="impala_udf.xml#udfs_hive"/> for details.</ph> + </p> + </li> + <!-- CDH-33298 is for scoping internationalization / UTF-8 / Unicode support. That work is pushed out to 5.8. + <li> + <p rev="CDH-33298"> + Improvements to internationalization support. + Now Impala can process data that uses the UTF-8 character encoding. + </p> + </li> + --> + <li> + <p rev="IMPALA-1588"><!-- This is from 2015, so perhaps it's really in an earlier release. --> + Performance improvements for HDFS I/O. Impala now caches HDFS file handles to avoid the + overhead of repeatedly opening the same file. + </p> + </li> + + <!-- Kudu didn't make it into 2.5 / 5.7 release, so no DELETE or UPDATE statement. --> + <li> + <p><!-- Is there a JIRA for that one? Alex? --> + Performance improvements for queries involving nested complex types. + Certain basic query types, such as counting the elements of a complex column, + now use an optimized code path. + </p> + </li> + <li> + <p rev="IMPALA-3044 IMPALA-2538 IMPALA-1168 CDH-33289 CDH-34603"> + Improvements to the memory reservation mechanism for the Impala + admission control feature. You can specify more settings, such + as the timeout period and maximum aggregate memory used, for each + resource pool instead of globally for the Impala instance. The + default limit for concurrent queries (the <uicontrol>max requests</uicontrol> + setting) is now unlimited instead of 200. + The Cloudera Manager user interface for admission control has been + reworked, with the settings available under the + <uicontrol>Dynamic Resource Pools</uicontrol> window. + </p> + </li> + <li> + <p rev="IMPALA-1755"> + Performance improvements related to code generation. + Even in queries where code generation is not performed + for some phases of execution (such as reading data from + Parquet tables), Impala can still use code generation in + other parts of the query, such as evaluating + functions in the <codeph>WHERE</codeph> clause. + </p> + </li> + <li> + <p rev="IMPALA-1305"> + Performance improvements for queries using aggregation functions + on high-cardinality columns. + Formerly, Impala could do unnecessary extra work to produce intermediate + results for operations such as <codeph>DISTINCT</codeph> or <codeph>GROUP BY</codeph> + on columns that were unique or had few duplicate values. + Now, Impala decides at run time whether it is more efficient to + do an initial aggregation phase and pass along a smaller set of intermediate data, + or to pass raw intermediate data back to next phase of query processing to be aggregated there. + This feature is known as <term>streaming pre-aggregation</term>. + In case of performance regression, this feature can be turned off + using the <codeph>DISABLE_STREAMING_PREAGGREGATIONS</codeph> query option. + <ph audience="PDF">See <xref href="impala_disable_streaming_preaggregations.xml#disable_streaming_preaggregations"/> for details.</ph> + </p> + </li> + <li> + <p> + Spill-to-disk feature now always recommended. In earlier releases, the spill-to-disk feature + could be turned off using a pair of configuration settings, + <codeph>enable_partitioned_aggregation=false</codeph> and + <codeph>enable_partitioned_hash_join=false</codeph>. + The latest improvements in the spill-to-disk mechanism, and related features that + interact with it, make this feature robust enough that disabling it is now + no longer needed or supported. In particular, some new features in <keyword keyref="impala25_full"/> + and higher do not work when the spill-to-disk feature is disabled. + </p> + </li> + <li> + <p rev="IMPALA-1067"> + Improvements to scripting capability for the <cmdname>impala-shell</cmdname> command, + through user-specified substitution variables that can appear in statements processed + by <cmdname>impala-shell</cmdname>: + </p> + <ul> + <li rev="IMPALA-2179"> + <p> + The <codeph>--var</codeph> command-line option lets you pass key-value pairs to + <cmdname>impala-shell</cmdname>. The shell can substitute the values + into queries before executing them, where the query text contains the notation + <codeph>${var:<varname>varname</varname>}</codeph>. For example, you might prepare a SQL file + containing a set of DDL statements and queries containing variables for + database and table names, and then pass the applicable names as part of the + <codeph>impala-shell -f <varname>filename</varname></codeph> command. + <ph audience="PDF">See <xref href="impala_shell_running_commands.xml#shell_running_commands"/> for details.</ph> + </p> + </li> + <li rev="IMPALA-2180"> + <p> + The <codeph>SET</codeph> and <codeph>UNSET</codeph> commands within the + <cmdname>impala-shell</cmdname> interpreter now work with user-specified + substitution variables, as well as the built-in query options. + The two kinds of variables are divided in the <codeph>SET</codeph> output. + As with variables defined by the <codeph>--var</codeph> command-line option, + you refer to the user-specified substitution variables in queries by using + the notation <codeph>${var:<varname>varname</varname>}</codeph> + in the query text. Because the substitution variables are processed by + <cmdname>impala-shell</cmdname> instead of the <cmdname>impalad</cmdname> + backend, you cannot define your own substitution variables through the + <codeph>SET</codeph> statement in a JDBC or ODBC application. + <ph audience="PDF">See <xref href="impala_set.xml#set"/> for details.</ph> + </p> + </li> + </ul> + </li> + <li> + <p rev="IMPALA-1599"> + Performance improvements for query startup. Impala better parallelizes certain work + when coordinating plan distribution between <cmdname>impalad</cmdname> instances, which improves + startup time for queries involving tables with many partitions on large clusters, + or complicated queries with many plan fragments. + </p> + </li> + <li> + <p rev="IMPALA-2560"> + Performance and scalability improvements for tables with many partitions. + The memory requirements on the coordinator node are reduced, making it substantially + faster and less resource-intensive + to do joins involving several tables with thousands of partitions each. + </p> + </li> + <li> + <p rev="IMPALA-3095"> + Whitelisting for access to internal APIs. For applications that need direct access + to Impala APIs, without going through the HiveServer2 or Beeswax interfaces, you can + specify a list of Kerberos users who are allowed to call those APIs. By default, the + <codeph>impala</codeph> and <codeph>hdfs</codeph> users are the only ones authorized + for this kind of access. + Any users not explicitly authorized through the <codeph>internal_principals_whitelist</codeph> + configuration setting are blocked from accessing the APIs. This setting applies to all the + Impala-related daemons, although currently it is primarily used for HDFS to control the + behavior of the catalog server. + </p> + </li> + <li> + <p rev="CDH-37009 CDH-30378"> + Improvements to Impala integration and usability for Hue. (The code changes + are actually on the Hue side.) + </p> + <ul> + <li> + <p rev="CDH-37009"> + The list of tables now refreshes dynamically. + </p> + </li> + </ul> + </li> + <li> + <p rev="IMPALA-1787"> + Usability improvements for case-insensitive queries. + You can now use the operators <codeph>ILIKE</codeph> and <codeph>IREGEXP</codeph> + to perform case-insensitive wildcard matches or regular expression matches, + rather than explicitly converting column values with <codeph>UPPER</codeph> + or <codeph>LOWER</codeph>. + <ph audience="PDF">See <xref href="impala_operators.xml#ilike"/> and <xref href="impala_operators.xml#iregexp"/> for details.</ph> + </p> + </li> + <li> + <p rev="IMPALA-1480"> + Performance and reliability improvements for DDL and insert operations on partitioned tables with a large + number of partitions. Impala only re-evaluates metadata for partitions that are affected by + a DDL operation, not all partitions in the table. While a DDL or insert statement is in progress, + other Impala statements that attempt to modify metadata for the same table wait until the first one + finishes. + </p> + </li> + <li> + <p rev="IMPALA-2867"> + Reliability improvements for the <codeph>LOAD DATA</codeph> statement. + Previously, this statement would fail if the source HDFS directory + contained any subdirectories at all. Now, the statement ignores + any hidden subdirectories, for example <filepath>_impala_insert_staging</filepath>. + </p> + </li> + <li> + <p rev="IMPALA-2147"> + A new operator, <codeph>IS [NOT] DISTINCT FROM</codeph>, lets you compare values + and always get a <codeph>true</codeph> or <codeph>false</codeph> result, + even if one or both of the values are <codeph>NULL</codeph>. + The <codeph>IS NOT DISTINCT FROM</codeph> operator, or its equivalent + <codeph><=></codeph> notation, improves the efficiency of join queries that + treat key values that are <codeph>NULL</codeph> in both tables as equal. + <ph audience="PDF">See <xref href="impala_operators.xml#is_distinct_from"/> for details.</ph> + </p> + </li> + <li> + <p rev="IMPALA-1934"> + Security enhancements for the <cmdname>impala-shell</cmdname> command. + A new option, <codeph>--ldap_password_cmd</codeph>, lets you specify + a command to retrieve the LDAP password. The resulting password is + then used to authenticate the <cmdname>impala-shell</cmdname> command + with the LDAP server. + <ph audience="PDF">See <xref href="impala_shell_options.xml"/> for details.</ph> + </p> + </li> + <li> + <p> + The <codeph>CREATE TABLE AS SELECT</codeph> statement now accepts a + <codeph>PARTITIONED BY</codeph> clause, which lets you create a + partitioned table and insert data into it with a single statement. + <ph audience="PDF">See <xref href="impala_create_table.xml#create_table"/> for details.</ph> + </p> + </li> + <li> + <p rev="IMPALA-1748 CDH-38369"> + User-defined functions (UDFs and UDAFs) written in C++ now persist automatically + when the <cmdname>catalogd</cmdname> daemon is restarted. You no longer + have to run the <codeph>CREATE FUNCTION</codeph> statements again after a restart. + </p> + </li> + <li> + <p rev="IMPALA-2843 CDH-39148"> + User-defined functions (UDFs) written in Java can now persist + when the <cmdname>catalogd</cmdname> daemon is restarted, and can be shared + transparently between Impala and Hive. You must do a one-time operation to recreate these + UDFs using new <codeph>CREATE FUNCTION</codeph> syntax, without a signature for arguments + or the return value. Afterwards, you no longer have to run the <codeph>CREATE FUNCTION</codeph> + statements again after a restart. + Although Impala does not have visibility into the UDFs that implement the + Hive built-in functions, user-created Hive UDFs are now automatically available + for calling through Impala. + <ph audience="PDF">See <xref href="impala_create_function.xml#create_function"/> for details.</ph> + </p> + </li> + <li> + <!-- Listed as fixed in 2.6.0. Is this item inappropriate or did it actually come from a different JIRA? --> + <p rev="IMPALA-2728"> + Reliability enhancements for memory management. Some aggregation and join queries + that formerly might have failed with an out-of-memory error due to memory contention, + now can succeed using the spill-to-disk mechanism. + </p> + </li> + <li> + <!-- Same blurb is under Incompatible Changes. Turn into a conref. --> + <p rev="IMPALA-2070"> + The <codeph>SHOW DATABASES</codeph> statement now returns two columns rather than one. + The second column includes the associated comment string, if any, for each database. + Adjust any application code that examines the list of databases and assumes the + result set contains only a single column. + <ph audience="PDF">See <xref href="impala_show.xml#show_databases"/> for details.</ph> + </p> + </li> + <li> + <p rev="IMPALA-2499"> + A new optimization speeds up aggregation operations that involve only the partition key + columns of partitioned tables. For example, a query such as <codeph>SELECT COUNT(DISTINCT k), MIN(k), MAX(k) FROM t1</codeph> + can avoid reading any data files if <codeph>T1</codeph> is a partitioned table and <codeph>K</codeph> + is one of the partition key columns. Because this technique can produce different results in cases + where HDFS files in a partition are manually deleted or are empty, you must enable the optimization + by setting the query option <codeph>OPTIMIZE_PARTITION_KEY_SCANS</codeph>. + <ph audience="PDF">See <xref href="impala_optimize_partition_key_scans.xml"/> for details.</ph> + </p> + </li> + <li audience="Cloudera"><!-- All the other undocumented query options are not really new features for this release, so hiding this whole bullet. --> + <p> + Other new query options: + </p> + <ul> + <li audience="Cloudera"><!-- Actually from a long way back, just never documented. Not sure if appropriate to keep internal-only or expose. --> + <codeph>DISABLE_OUTERMOST_TOPN</codeph> + </li> + <li audience="Cloudera"><!-- Actually from a long way back, just never documented. Not sure if appropriate to keep internal-only or expose. --> + <codeph>RM_INITIAL_MEM</codeph> + </li> + <li audience="Cloudera"><!-- Seems to be related to writing sequence files, a capability not externalized at this time. --> + <codeph>SEQ_COMPRESSION_MODE</codeph> + </li> + <li audience="Cloudera"><!-- Actually, was only used for working around one JIRA. Being deprecated now in Impala 2.3 via IMPALA-2963. --> + <codeph>DISABLE_CACHED_READS</codeph> + </li> + </ul> + </li> + <li> + <p rev="IMPALA-2196"> + The <codeph>DESCRIBE</codeph> statement can now display metadata about a database, using the + syntax <codeph>DESCRIBE DATABASE <varname>db_name</varname></codeph>. + <ph audience="PDF">See <xref href="impala_describe.xml#describe"/> for details.</ph> + </p> + </li> + <li> + <p rev="IMPALA-1477"> + The <codeph>uuid()</codeph> built-in function generates an + alphanumeric value that you can use as a guaranteed unique identifier. + The uniqueness applies even across tables, for cases where an ascending + numeric sequence is not suitable. + <ph audience="PDF">See <xref href="impala_misc_functions.xml#misc_functions"/> for details.</ph> + </p> + </li> + </ul> + + </conbody> + </concept> + +<!-- All 2.4.x new features go under here --> + + <concept rev="2.4.0" id="new_features_240"> + + <title>New Features in Impala 2.4.x / CDH 5.6.x</title> + + <conbody> + + <note conref="../shared/impala_common.xml#common/only_cdh5_240" /> + + <ul> + <li> + <p> + Impala can be used on the DSSD D5 Storage Appliance. + From a user perspective, the Impala features are the same as in CDH 5.5 / Impala 2.3. + </p> + </li> + </ul> + + </conbody> + </concept> + +<!-- All 2.3.x subsections go under here --> + +<!-- Actually for 2.3 / 5.5, let's get away from doing a separate subhead for each maintenance release, + because in the normal course of events there will be nothing to add here until 5.6. If something new + needs to get noted, just add a new bullet with wording to indicate which 5.5.x release it applies to. --> + + <concept rev="2.3.0" id="new_features_230"> + + <title>New Features in Impala 2.3.x / CDH 5.5.x</title> + + <conbody> + + <note conref="../shared/impala_common.xml#common/only_cdh5_23x" /> + + <p> + The following are the major new features in Impala 2.3.x. This major release, available as part of CDH + 5.5.x, contains improvements to SQL syntax (particularly new support for complex types), performance, + manageability, security. + </p> + + <ul> + + <li> + <p> + Complex data types: <codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, and <codeph>MAP</codeph>. These + types can encode multiple named fields, positional items, or key-value pairs within a single column. + You can combine these types to produce nested types with arbitrarily deep nesting, + such as an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> values, + a <codeph>MAP</codeph> where each key-value pair is an <codeph>ARRAY</codeph> of other <codeph>MAP</codeph> values, + and so on. Currently, complex data types are only supported for the Parquet file format. + <ph audience="PDF">See <xref href="impala_complex_types.xml#complex_types"/> for usage details and <xref href="impala_array.xml#array"/>, <xref href="impala_struct.xml#struct"/>, and <xref href="impala_map.xml#map"/> for syntax.</ph> + </p> + </li> + + <li rev="collevelauth"> + <p> + Column-level authorization lets you define access to particular columns within a table, + rather than the entire table. This feature lets you reduce the reliance on creating views to + set up authorization schemes for subsets of information. + <ph audience="integrated">See <xref href="sg_hive_sql.xml#concept_c2q_4qx_p4/col_level_auth_sentry"/> for background details, and <xref href="impala_grant.xml#grant"/> and <xref href="impala_revoke.xml#revoke"/> for Impala-specific syntax.</ph> + </p> + </li> + + <li rev="IMPALA-1139"> + <p> + The <codeph>TRUNCATE TABLE</codeph> statement removes all the data from a table without removing the table itself. + <ph audience="PDF">See <xref href="impala_truncate_table.xml#truncate_table"/> for details.</ph> + </p> + </li> + + <li id="IMPALA-2015"> + <p> + Nested loop join queries. Some join queries that formerly required equality comparisons can now use + operators such as <codeph><</codeph> or <codeph>>=</codeph>. This same join mechanism is used + internally to optimize queries that retrieve values from complex type columns. + <ph audience="PDF">See <xref href="impala_joins.xml#joins"/> for details about Impala join queries.</ph> + </p> + </li> + + <li id="CDH-28141"> + <p> + Reduced memory usage and improved performance and robustness for spill-to-disk feature. + <ph audience="PDF">See <xref href="impala_scalability.xml#spill_to_disk"/> for details about this feature.</ph> + </p> + </li> + + <li rev="IMPALA-1881 CDH-34620"> + <p> + Performance improvements for querying Parquet data files containing multiple row groups + and multiple data blocks: + </p> + <ul> + <li> + <p> For files written by Hive, SparkSQL, and other Parquet MR writers + and spanning multiple HDFS blocks, Impala now scans the extra + data blocks locally when possible, rather than using remote + reads. </p> + </li> + <li> + <p> + Impala queries benefit from the improved alignment of row groups with HDFS blocks for Parquet + files written by Hive, MapReduce, and other components in <ph rev="upstream">CDH 5.5</ph> and higher. (Impala itself never writes + multiblock Parquet files, so the alignment change does not apply to Parquet files produced by Impala.) + These Parquet writers now add padding to Parquet files that they write to align row groups with HDFS blocks. + The <codeph>parquet.writer.max-padding</codeph> setting specifies the maximum number of bytes, by default + 8 megabytes, that can be added to the file between row groups to fill the gap at the end of one block + so that the next row group starts at the beginning of the next block. + If the gap is larger than this size, the writer attempts to fit another entire row group in the remaining space. + Include this setting in the <filepath>hive-site</filepath> configuration file to influence Parquet files written by Hive, + or the <filepath>hdfs-site</filepath> configuration file to influence Parquet files written by all non-Impala components. + </p> + </li> + </ul> + <p audience="PDF"> + See <xref href="impala_parquet.xml#parquet"/> for instructions about using Parquet data files + with Impala, and + <xref audience="integrated" href="cdh_ig_parquet.xml#parquet_format"/><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_parquet.html" scope="external" format="html"/> + for instructions for + other components that can read and write Parquet files. + </p> + </li> + + <li id="IMPALA-1660"> + <p> + Many new built-in scalar functions, for convenience and enhanced portability of SQL that uses common industry extensions. + </p> + + <p rev="IMPALA-1771"> + Math functions<ph audience="PDF"> (see <xref href="impala_math_functions.xml#math_functions"/> for details)</ph>: + </p> + <ul> + <li> + <codeph>ATAN2</codeph> + </li> + + <li> + <codeph>COSH</codeph> + </li> + + <li> + <codeph>COT</codeph> + </li> + + <li> + <codeph>DCEIL</codeph> + </li> + + <li> + <codeph>DEXP</codeph> + </li> + + <li> + <codeph>DFLOOR</codeph> + </li> + + <li> + <codeph>DLOG10</codeph> + </li> + + <li> + <codeph>DPOW</codeph> + </li> + + <li> + <codeph>DROUND</codeph> + </li> + + <li> + <codeph>DSQRT</codeph> + </li> + + <li> + <codeph>DTRUNC</codeph> + </li> + + <li> + <codeph>FACTORIAL</codeph>, and corresponding <codeph>!</codeph> operator + </li> + + <li> + <codeph>FPOW</codeph> + </li> + + <li> + <codeph>RADIANS</codeph> + </li> + + <li> + <codeph>RANDOM</codeph> + </li> + + <li> + <codeph>SINH</codeph> + </li> + + <li> + <codeph>TANH</codeph> + </li> + </ul> + + <p> + String functions<ph audience="PDF"> (see <xref href="impala_string_functions.xml#string_functions"/> for details)</ph>: + </p> + <ul> + <li> + <codeph>BTRIM</codeph> + </li> + <li> + <codeph>CHR</codeph> + </li> + <li> + <codeph>REGEXP_LIKE</codeph> + </li> + <li> + <codeph>SPLIT_PART</codeph> + </li> + </ul> + + <p> + Date and time functions<ph audience="PDF"> (see <xref href="impala_datetime_functions.xml#datetime_functions"/> for details)</ph>: + </p> + <ul> + <li> + <codeph>INT_MONTHS_BETWEEN</codeph> + </li> + <li> + <codeph>MONTHS_BETWEEN</codeph> + </li> + <li> + <codeph>TIMEOFDAY</codeph> + </li> + <li> + <codeph>TIMESTAMP_CMP</codeph> + </li> + </ul> + + <p> + Bit manipulation functions<ph audience="PDF"> (see <xref href="impala_bit_functions.xml#bit_functions"/> for details)</ph>: + </p> + <ul> + <li> + <codeph>BITAND</codeph> + </li> + + <li> + <codeph>BITNOT</codeph> + </li> + + <li> + <codeph>BITOR</codeph> + </li> + + <li> + <codeph>BITXOR</codeph> + </li> + + <li> + <codeph>COUNTSET</codeph> + </li> + + <li> + <codeph>GETBIT</codeph> + </li> + + <li> + <codeph>ROTATELEFT</codeph> + </li> + + <li> + <codeph>ROTATERIGHT</codeph> + </li> + + <li> + <codeph>SETBIT</codeph> + </li> + + <li> + <codeph>SHIFTLEFT</codeph> + </li> + + <li> + <codeph>SHIFTRIGHT</codeph> + </li> + </ul> + <p> + Type conversion functions<ph audience="PDF"> (see <xref href="impala_conversion_functions.xml#conversion_functions"/> for details)</ph>: + </p> + <ul> + <li> + <codeph>TYPEOF</codeph> + </li> + </ul> + <p> + The <codeph>effective_user()</codeph> function<ph audience="PDF"> (see <xref href="impala_misc_functions.xml#misc_functions"/> for details)</ph>. + </p> + </li> + + <li id="IMPALA-2081"> + <p> + New built-in analytic functions: <codeph>PERCENT_RANK</codeph>, <codeph>NTILE</codeph>, + <codeph>CUME_DIST</codeph>. + <ph audience="PDF">See <xref href="impala_analytic_functions.xml#analytic_functions"/> for details.</ph> + </p> + </li> + + <li id="IMPALA-595"> + <p> + The <codeph>DROP DATABASE</codeph> statement now works for a non-empty database. + When you specify the optional <codeph>CASCADE</codeph> clause, any tables in the + database are dropped before the database itself is removed. + <ph audience="PDF">See <xref href="impala_drop_database.xml#drop_database"/> for details.</ph> + </p> + </li> + + <li> + <p> + The <codeph>DROP TABLE</codeph> and <codeph>ALTER TABLE DROP PARTITION</codeph> statements have a new optional keyword, <codeph>PURGE</codeph>. + This keyword causes Impala to immediately remove the relevant HDFS data files rather than sending them to the HDFS trashcan. + This feature can help to avoid out-of-space errors on storage devices, and to avoid files being left behind in case of + a problem with the HDFS trashcan, such as the trashcan not being configured or being in a different HDFS encryption zone + than the data files. + <ph audience="PDF">See <xref href="impala_drop_table.xml#drop_table"/> and <xref href="impala_alter_table.xml#alter_table"/> for syntax.</ph> + </p> + </li> + + <li id="IMPALA-80"> + <p> + The <cmdname>impala-shell</cmdname> command has a new feature for live progress reporting. This feature + is enabled through the <codeph>--live_progress</codeph> and <codeph>--live_summary</codeph> + command-line options, or during a session through the <codeph>LIVE_SUMMARY</codeph> and + <codeph>LIVE_PROGRESS</codeph> query options. + <ph audience="PDF">See <xref href="impala_live_progress.xml#live_progress"/> and <xref href="impala_live_summary.xml#live_summary"/> for details.</ph> + </p> + </li> + + <li> + <p> + The <cmdname>impala-shell</cmdname> command also now displays a random <q>tip of the day</q> when it starts. + </p> + </li> + + <li id="IMPALA-1413"> + <p> + The <cmdname>impala-shell</cmdname> option <codeph>-f</codeph> now recognizes a special filename + <codeph>-</codeph> to accept input from stdin. + <ph audience="PDF">See <xref href="impala_shell_options.xml#shell_options"/> for details about the options for running <cmdname>impala-shell</cmdname> in non-interactive mode.</ph> + </p> + </li> + + <li id="IMPALA-1963"> + <p> + Format strings for the <codeph>unix_timestamp()</codeph> function can now include numeric timezone offsets. + <ph audience="PDF">See <xref href="impala_datetime_functions.xml#datetime_functions"/> for details.</ph> + </p> + </li> + + <li id="CDH-27547"> + <p> + Impala can now run a specified command to obtain the password to decrypt a private-key PEM file, + rather than having the private-key file be unencrypted on disk. + <ph audience="PDF">See <xref href="impala_ssl.xml#ssl"/> for details.</ph> + </p> + </li> + + <li id="IMPALA-859"> + <p> + Impala components now can use SSL for more of their internal communication. SSL is used for + communication between all three Impala-related daemons when the configuration option + <codeph>ssl_server_certificate</codeph> is enabled. SSL is used for communication with client + applications when the configuration option <codeph>ssl_client_ca_certificate</codeph> is enabled. + <ph audience="PDF">See <xref href="impala_ssl.xml#ssl"/> for details.</ph> + </p> + <p> + Currently, you can only use one of server-to-server TLS/SSL encryption or Kerberos authentication. + This limitation is tracked by the issue + <xref href="https://issues.cloudera.org/browse/IMPALA-2598" scope="external" format="html">IMPALA-2598</xref>. + </p> + </li> + + <li id="IMPALA-1829"> + <p> + Improved flexibility for intermediate data types in user-defined aggregate functions (UDAFs). + <ph audience="PDF">See <xref href="impala_udf.xml#udafs"/> for details.</ph> + </p> + </li> + + </ul> + + <p> + In CDH 5.5.2 / Impala 2.3.2, the bug fix for <xref href="https://issues.cloudera.org/browse/IMPALA-2598" scope="external" format="html">IMPALA-2598</xref> + removes the restriction on using both Kerberos and SSL for internal communication between Impala components. + </p> + +<!-- End of new feature list for 2.3 / 5.5. --> + + </conbody> + + </concept> + +<!-- All 2.2.x subsections go under here --> + +<!-- Removing all the 5.4.x release subtopics for which there wasn't anything to say. + Same convention as used in 5.5.x, 5.6.x, 5.7.x, 5.8.x. Only have one subtopic for + the .0. + <concept rev="5.4.10" id="new_features_2210"> + + <title>New Features in Impala 2.2.10 / CDH 5.4.10</title> + + <conbody> + + <p> + No new features. This point release is exclusively a bug fix release. + </p> + + <note conref="../shared/impala_common.xml#common/only_cdh5_22x" /> + + </conbody> + + </concept> + + <concept rev="5.4.9" id="new_features_229"> + + <title>New Features in Impala 2.2.9 / CDH 5.4.9</title> + + <conbody> + + <p> + No new features. This point release is exclusively a bug fix release. + </p> + + <note conref="../shared/impala_common.xml#common/only_cdh5_22x" /> + + </conbody> + + </concept> + + <concept rev="5.4.8" id="new_features_228"> + + <title>New Features in Impala 2.2.8 / CDH 5.4.8</title> + + <conbody> + + <p> + No new features. This point release is exclusively a bug fix release. + </p> + + <note conref="../shared/impala_common.xml#common/only_cdh5_22x" /> + + </conbody> + + </concept> + + <concept rev="5.4.7" id="new_features_227"> + + <title>New Features in Impala 2.2.7 / CDH 5.4.7</title> + + <conbody> + + <p> + No new features. This point release is exclusively a bug fix release. + </p> + + <note conref="../shared/impala_common.xml#common/only_cdh5_22x" /> + + </conbody> + + </concept> + + <concept audience="Cloudera" rev="5.4.6" id="new_features_226"> + + <title>New Features in Impala 2.2.6 / CDH 5.4.6</title> + + <conbody> + + <p> + No new features. This point release is exclusively a bug fix release. + </p> + + <note conref="../shared/impala_common.xml#common/only_cdh5_22x" /> + + </conbody> + + </concept> + + <concept rev="5.4.5" id="new_features_225"> + + <title>New Features in Impala 2.2.x for CDH 5.4.5</title> + + <conbody> + + <p> + No new features. This point release is exclusively a bug fix release. + </p> + + <note conref="../shared/impala_common.xml#common/only_cdh5_22x" /> + + </conbody> + + </concept> +--> + + <concept rev="5.4.3" id="new_features_223"> + + <title>New Features in Impala 2.2.x for CDH 5.4.3 and 5.4.4</title> + + <conbody> + + <p> + No new features added to the Impala code. The certification of Impala with EMC Isilon under CDH 5.4.4 means + that now you can query data stored on Isilon storage devices through Impala. See + <xref audience="integrated" href="cm_mc_isilon_service.xml"/><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_isilon_service.html" scope="external" format="html"/> + for details. The same level of Impala is included with both CDH + 5.4.3 and 5.4.4. +<!-- This point release is exclusively a bug fix release. --> + </p> + + <note conref="../shared/impala_common.xml#common/only_cdh5_22x" /> + + </conbody> + + </concept> + +<!-- I let the 5.4.3/5.4.3 subtopic above remain in existence, but now back to hiding specific 5.4.x subtopics + after the .0 one that has the actual new features. + <concept audience="Cloudera" rev="5.4.2" id="new_features_222"> + + <title>New Features in Impala 2.2.x for CDH 5.4.2</title> + + <conbody> + + <p> + No new features. This point release is exclusively a bug fix release. + </p> + + <note conref="../shared/impala_common.xml#common/only_cdh5_22x" /> + + </conbody> + + </concept> + + <concept rev="5.4.x" id="new_features_54x"> + + <title>New Features in Impala for CDH 5.4.x</title> + + <conbody> + + <p> + See <xref href="impala_new_features.xml#new_features_220"/> for the most recent set of new Impala features. + CDH maintenance releases such as 5.4.1, 5.4.2, and so on are exclusively bug fix releases, + therefore there are no new features for the 5.4.x series after 5.4.0. + </p> + + <note conref="../shared/impala_common.xml#common/only_cdh5_22x" /> + + </conbody> + + </concept> +--> + + <concept rev="2.2.0" id="new_features_220"> + + <title>New Features in Impala 2.2.x / CDH 5.4.x</title> + + <conbody> + + <note conref="../shared/impala_common.xml#common/only_cdh5_220" /> + + <p> + The following are the major new features in Impala 2.2.x. This release, available as part of CDH + 5.4.x, contains improvements to performance, manageability, security, and SQL syntax. + </p> + + <ul> + <li> + <p> + Several improvements to date and time features enable higher interoperability with Hive and other + database systems, provide more flexibility for handling time zones, and future-proof the handling of + <codeph>TIMESTAMP</codeph> values: + </p> + <ul> + <li> + <p> + The <codeph>WITH REPLICATION</codeph> clause for the <codeph>CREATE TABLE</codeph> and + <codeph>ALTER TABLE</codeph> statements lets you control the replication factor for + HDFS caching for a specific table or partition. By default, each cached block is + only present on a single host, which can lead to CPU contention if the same host + processes each cached block. Increasing the replication factor lets Impala choose + different hosts to process different cached blocks, to better distribute the CPU load. + </p> + </li> + <li> + <p> + Startup flags for the <cmdname>impalad</cmdname> daemon enable a higher level of compatibility with + <codeph>TIMESTAMP</codeph> values written by Hive, and more flexibility for working with date and + time data using the local time zone instead of UTC. To enable these features, set the + <cmdname>impalad</cmdname> startup flags + <codeph>-use_local_tz_for_unix_timestamp_conversions=true</codeph> and + <codeph>-convert_legacy_hive_parquet_utc_timestamps=true</codeph>. + </p> + + <p> + The <codeph>-use_local_tz_for_unix_timestamp_conversions</codeph> setting controls how the + <codeph>unix_timestamp()</codeph>, <codeph>from_unixtime()</codeph>, and <codeph>now()</codeph> + functions handle time zones. By default (when this setting is turned off), Impala considers all + <codeph>TIMESTAMP</codeph> values to be in the UTC time zone when converting to or from Unix time + values. When this setting is enabled, Impala treats <codeph>TIMESTAMP</codeph> values passed to or + returned from these functions to be in the local time zone. When this setting is enabled, take + particular care that all hosts in the cluster have the same timezone settings, to avoid + inconsistent results depending on which host reads or writes <codeph>TIMESTAMP</codeph> data. + </p> + + <p> + The <codeph>-convert_legacy_hive_parquet_utc_timestamps</codeph> setting causes Impala to convert + <codeph>TIMESTAMP</codeph> values to the local time zone when it reads them from Parquet files + written by Hive. This setting only applies to data using the Parquet file format, where Impala can + use metadata in the files to reliably determine that the files were written by Hive. If in the + future Hive changes the way it writes <codeph>TIMESTAMP</codeph> data in Parquet, Impala will + automatically handle that new <codeph>TIMESTAMP</codeph> encoding. + </p> + + <p> + See <xref href="impala_timestamp.xml#timestamp"/> for details about time zone handling and the + configuration options for Impala / Hive compatibility with Parquet format. + </p> + </li> + + <li> + <p conref="../shared/impala_common.xml#common/y2k38" /> + + <p> + See <xref href="impala_datetime_functions.xml#datetime_functions"/> for the current function + signatures. + </p> + </li> + </ul> + </li> + + <li> + <p> + The <codeph>SHOW FILES</codeph> statement lets you view the names and sizes of the files that make up + an entire table or a specific partition. See <xref href="impala_show.xml#show_files"/> for details. + </p> + </li> + + <li> + <p> + Impala can now run queries against Parquet data containing columns with complex or nested types, as + long as the query only refers to columns with scalar types. + </p> + </li> + + <li> + <p> + Performance improvements for queries that include <codeph>IN()</codeph> operators and involve + partitioned tables. + </p> + </li> + + <li> +<!-- Same text for this item in impala_fixed_issues.xml. Could turn into a conref. --> + <p> + The new <codeph>-max_log_files</codeph> configuration option specifies how many log files to keep at + each severity level. The default value is 10, meaning that Impala preserves the latest 10 log files for + each severity level (<codeph>INFO</codeph>, <codeph>WARNING</codeph>, and <codeph>ERROR</codeph>) for + each Impala-related daemon (<cmdname>impalad</cmdname>, <cmdname>statestored</cmdname>, and + <cmdname>catalogd</cmdname>). Impala checks to see if any old logs need to be removed based on the + interval specified in the <codeph>logbufsecs</codeph> setting, every 5 seconds by default. See + <xref href="impala_logging.xml#logs_rotate"/> for details. + </p> + </li> + + <li> + <p> + Redaction of sensitive data from Impala log files. This feature protects details such as credit card + numbers or tax IDs from administrators who see the text of SQL statements in the course of monitoring + and troubleshooting a Hadoop cluster. See <xref href="impala_logging.xml#redaction"/> for background + information for Impala users, and + <xref audience="integrated" href="sg_redaction.xml#log_redact"/><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/sg_redaction.html" scope="external" format="html">the CDH 5 Security Guide</xref> + for usage details. + </p> + </li> + + <li> + <p> + Lineage information is available for data created or queried by Impala. This feature lets you track who + has accessed data through Impala SQL statements, down to the level of specific columns, and how data + has been propagated between tables. See <xref href="impala_lineage.xml#lineage"/> for background + information for Impala users, + <xref audience="integrated" href="datamgmt_impala_lineage_log.xml"/><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/datamgmt_impala_lineage_log.html" scope="external" format="html"/> + for usage details, and + <xref audience="integrated" href="cn_iu_lineage.xml" /><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cn_iu_lineage.html" scope="external" format="html"/>. + for how to interpret the lineage + information. + </p> + </li> + + <li> + <p> + Impala tables and partitions can now be located on the Amazon Simple Storage Service (S3) filesystem, + for convenience in cases where data is already located in S3 and you prefer to query it in-place. + Queries might have lower performance than when the data files reside on HDFS, because Impala uses some + HDFS-specific optimizations. Impala can query data in S3, but cannot write to S3. Therefore, statements + such as <codeph>INSERT</codeph> and <codeph>LOAD DATA</codeph> are not available when the destination + table or partition is in S3. See <xref href="impala_s3.xml#s3"/> for details. + </p> + + <note conref="../shared/impala_common.xml#common/s3_caveat" /> + </li> + + <li> + <!-- Only want the link out of the release notes to appear for HTML + (N.B. audience="PDF" means hide from PDF), and only in the HTML for the + integrated build where the topic is available for link resolution. --> + <p> + Improved support for HDFS encryption. The <codeph>LOAD DATA</codeph> statement now works when the + source directory and destination table are in different encryption zones. <ph audience="integrated"><ph audience="PDF">See + <xref href="cdh_sg_component_kms.xml#impala_encryption"/> for details about using HDFS encryption with + Impala.</ph></ph> + </p> + </li> + + <li> + <p> + Additional arithmetic function <codeph>mod()</codeph>. See + <xref href="impala_math_functions.xml#math_functions"/> for details. + </p> + </li> + + <li> + <p> + Flexibility to interpret <codeph>TIMESTAMP</codeph> values using the UTC time zone (the traditional + Impala behavior) or using the local time zone (for compatibility with <codeph>TIMESTAMP</codeph> values + produced by Hive). + </p> + </li> + + <li> + <p> + Enhanced support for ETL using tools such as Flume. Impala ignores temporary files typically produced + by these tools (filenames with suffixes <codeph>.copying</codeph> and <codeph>.tmp</codeph>). + </p> + </li> + + <li> + <p> + The CPU requirement for Impala, which had become more restrictive in Impala 2.0.x and 2.1.x, has now + been relaxed. + </p> + + <p conref="../shared/impala_common.xml#common/cpu_prereq" /> + </li> + + <li> + <p> + Enhanced support for <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> types in the <codeph>COMPUTE + STATS</codeph> statement. + </p> + </li> + + <li rev="CDH-26073"> + <p> + The amount of memory required during setup for <q>spill to disk</q> operations is greatly reduced. This + enhancement reduces the chance of a memory-intensive join or aggregation query failing with an + out-of-memory error. + </p> + </li> + + <li> + <p> + Several new conditional functions provide enhanced compatibility when porting code that uses industry + extensions. The new functions are: <codeph>isfalse()</codeph>, <codeph>isnotfalse()</codeph>, + <codeph>isnottrue()</codeph>, <codeph>istrue()</codeph>, <codeph>nonnullvalue()</codeph>, and + <codeph>nullvalue()</codeph>. See <xref href="impala_conditional_functions.xml#conditional_functions"/> + for details. + </p> + </li> + + <li> + <p> + The Impala debug web UI now can display a visual representation of the query plan. On the + <uicontrol>/queries</uicontrol> tab, select <uicontrol>Details</uicontrol> for a particular query. The + <uicontrol>Details</uicontrol> page includes a <uicontrol>Plan</uicontrol> tab with a plan diagram that + you can zoom in or out (using scroll gestures through mouse wheel or trackpad). + </p> + </li> + </ul> + +<!-- End of new feature list for 5.4. --> + + </conbody> + + </concept> + +<!-- All 2.1.x subsections go under here --> + + <concept rev="2.1.8" id="new_features_218"> + + <title>New Features in Impala 2.1.8 / CDH 5.3.10</title> + + <conbody> + + <p> + This point release is exclusively a bug fix release. + </p> + + <note conref="../shared/impala_common.xml#common/only_cdh5_21x"/> + + </conbody> + + </concept> + + <concept rev="2.1.7" id="new_features_217"> + + <title>New Features in Impala 2.1.7 / CDH 5.3.9</title> + + <conbody> + + <p> + This point release is exclusively a bug fix release. + </p> + + <note conref="../shared/impala_common.xml#common/only_cdh5_21x"/> + + </conbody> + + </concept> + + <concept rev="2.1.6" id="new_features_216"> + + <title>New Features in Impala 2.1.6 / CDH 5.3.8</title> + + <conbody> + + <p> + This point release is exclusively a bug fix release. + </p> + + <note conref="../shared/impala_common.xml#common/only_cdh5_21x"/> + + </conbody> + + </concept> + + <concept rev="2.1.5" id="new_features_215"> + + <title>New Features in Impala 2.1.5 / CDH 5.3.6</title> + + <conbody> + + <p> + This point release is exclusively a bug fix release. + </p> + + <note conref="../shared/impala_common.xml#common/only_cdh5_21x"/> + + </conbody> + + </concept> + + <concept rev="2.1.4" id="new_features_214"> + + <title>New Features in Impala 2.1.4 / CDH 5.3.4</title> + + <conbody> + + <p> + No new features. This point release is exclusively a bug fix release. + <ph conref="../shared/impala_common.xml#common/impala_214_redux"/> + </p> + + <note conref="../shared/impala_common.xml#common/only_cdh5_21x"/> + + </conbody> + + </concept> + + <concept rev="2.1.3" id="new_features_213"> + + <title>New Features in Impala 2.1.3 / CDH 5.3.3</title> + + <conbody> + + <p> + No new features. This point release is exclusively a bug fix release. + </p> + + <note conref="../shared/impala_common.xml#common/only_cdh5_213" /> + + </conbody> + + </concept> + + <concept rev="2.1.2" id="new_features_212"> + + <title>New Features in Impala 2.1.2 / CDH 5.3.2</title> + + <conbody> + + <p> + No new features. This point release is exclusively a bug fix release. + </p> + + <note conref="../shared/impala_common.xml#common/only_cdh5_212" /> + + </conbody> + + </concept> + + <concept rev="2.1.1" id="new_features_211"> + + <title>New Features in Impala 2.1.1 / CDH 5.3.1</title> + + <conbody> + + <p> + No new features. This point release is exclusively a bug fix release. + </p> + + </conbody> + + </concept> + + <concept rev="2.1.0" id="new_features_210"> + + <title>New Features in Impala 2.1.0 / CDH 5.3.0</title> + + <conbody> + + <p> + This release contains the following enhancements to query performance and system scalability: + </p> + + <ul> + <li> + <p> + Impala can now collect statistics for individual partitions in a partitioned table, rather than + processing the entire table for each <codeph>COMPUTE STATS</codeph> statement. This feature is known as + incremental statistics, and is controlled by the <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax. + (You can still use the original <codeph>COMPUTE STATS</codeph> statement for nonpartitioned tables or + partitioned tables that are unchanging or whose contents are entirely replaced all at once.) See + <xref href="impala_compute_stats.xml#compute_stats"/> and + <xref href="impala_perf_stats.xml#perf_stats"/> for details. + </p> + </li> + + <li> + <p> + Optimization for small queries lets Impala process queries that process very few rows without the + unnecessary overhead of parallelizing and generating native code. Reducing this overhead lets Impala + clear small queries quickly, keeping YARN resources and admission control slots available for + data-intensive queries. The number of rows considered to be a <q>small</q> query is controlled by the + <codeph>EXEC_SINGLE_NODE_ROWS_THRESHOLD</codeph> query option. See + <xref href="impala_exec_single_node_rows_threshold.xml#exec_single_node_rows_threshold"/> for details. + </p> + </li> + + <li> + <p> + An enhancement to the statestore component lets it transmit heartbeat information independently of + broadcasting metadata updates. This optimization improves reliability of health checking on large + clusters with many tables and partitions. + </p> + </li> + + <li> + <p> + The memory requirement for querying gzip-compressed text is reduced. Now Impala decompresses the data + as it is read, rather than reading the entire gzipped file and decompressing it in memory. + </p> + </li> + </ul> + + </conbody> + + </concept> + +<!-- All 2.0.x subsections go under here --> + + <concept rev="2.0.5" id="new_features_205"> + + <title>New Features in Impala 2.0.5 / CDH 5.2.6</title> + + <conbody> + + <p> + No new features. This point release is exclusively a bug fix release. + </p> + + <note conref=
<TRUNCATED>
