incubator-impala git commit: Upgrade to latest version of impala_common.xml.

jrussell Mon, 31 Oct 2016 12:25:22 -0700

Repository: incubator-impala
Updated Branches:
  refs/heads/doc_prototype 0a7372454 -> 0124ae32f



Upgrade to latest version of impala_common.xml.


Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/0124ae32
Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/0124ae32
Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/0124ae32

Branch: refs/heads/doc_prototype
Commit: 0124ae32fe6a402252bf5a90fb3ce88100a4495a
Parents: 0a73724
Author: John Russell <[email protected]>
Authored: Mon Oct 31 12:24:39 2016 -0700
Committer: John Russell <[email protected]>
Committed: Mon Oct 31 12:24:39 2016 -0700

----------------------------------------------------------------------
 docs/shared/impala_common.xml | 955 +++++++++++++++++++++++++++++++++----
 1 file changed, 854 insertions(+), 101 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/0124ae32/docs/shared/impala_common.xml
----------------------------------------------------------------------
diff --git a/docs/shared/impala_common.xml b/docs/shared/impala_common.xml
index 37ebc34..f281318 100644
--- a/docs/shared/impala_common.xml
+++ b/docs/shared/impala_common.xml
@@ -1,5 +1,6 @@
-<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD 
DITA Concept//EN" "concept.dtd">
-<concept xmlns:ditaarch="http://dita.oasis-open.org/architecture/2005/"; 
id="common" ditaarch:DITAArchVersion="1.2" domains="(topic concept)             
               (topic hi-d)                             (topic ut-d)            
                 (topic indexing-d)                            (topic hazard-d) 
                           (topic abbrev-d)                            (topic 
pr-d)                             (topic sw-d)                            
(topic ui-d)    " xml:lang="en-US">
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="common">
 
   <title>Reusable Text, Paragraphs, List Items, and Other Elements for 
Impala</title>
 
@@ -17,6 +18,75 @@
       '#common/id_within_the_file', rather than a 3-part reference with an 
intervening, variable concept ID.
     </p>
 
+    <section id="concepts">
+
+      <title>Conceptual Content</title>
+
+      <p>
+        Overview and conceptual information for Impala as a whole.
+      </p>
+
+      <!-- Reconcile the 'advantages' and 'benefits' elements; be mindful of 
where each is used. -->
+
+      <p id="impala_advantages">
+        The following are some of the key advantages of Impala:
+
+        <ul>
+          <li>
+            Impala integrates with the existing CDH ecosystem, meaning data 
can be stored, shared, and accessed using
+            the various solutions included with CDH. This also avoids data 
silos and minimizes expensive data movement.
+          </li>
+
+          <li>
+            Impala provides access to data stored in CDH without requiring the 
Java skills required for MapReduce jobs.
+            Impala can access data directly from the HDFS file system. Impala 
also provides a SQL front-end to access
+            data in the HBase database system, <ph rev="2.2.0">or in the 
Amazon Simple Storage System (S3)</ph>.
+          </li>
+
+          <li>
+            Impala returns results typically within seconds or a few minutes, 
rather than the many minutes or hours
+            that are often required for Hive queries to complete.
+          </li>
+
+          <li>
+            Impala is pioneering the use of the Parquet file format, a 
columnar storage layout that is optimized for
+            large-scale queries typical in data warehouse scenarios.
+          </li>
+        </ul>
+      </p>
+
+      <p id="impala_benefits">
+        Impala provides:
+
+        <ul>
+          <li>
+            Familiar SQL interface that data scientists and analysts already 
know.
+          </li>
+
+          <li>
+            Ability to query high volumes of data (<q>big data</q>) in Apache 
Hadoop.
+          </li>
+
+          <li>
+            Distributed queries in a cluster environment, for convenient 
scaling and to make use of cost-effective
+            commodity hardware.
+          </li>
+
+          <li>
+            Ability to share data files between different components with no 
copy or export/import step; for example,
+            to write with Pig, transform with Hive and query with Impala. 
Impala can read from and write to Hive
+            tables, enabling simple data interchange using Impala for 
analytics on Hive-produced data.
+          </li>
+
+          <li>
+            Single system for big data processing and analytics, so customers 
can avoid costly modeling and ETL just
+            for analytics.
+          </li>
+        </ul>
+      </p>
+
+    </section>
+
     <section id="sentry">
 
       <title>Sentry-Related Content</title>
@@ -27,6 +97,33 @@
         nested topics at the end of this file.
       </p>
 
+      <p rev="IMPALA-2660 CDH-40241" id="auth_to_local_instructions">
+        In CDH 5.8 / Impala 2.6 and higher, Impala recognizes the 
<codeph>auth_to_local</codeph> setting,
+        specified through the HDFS configuration setting
+        <codeph>hadoop.security.auth_to_local</codeph>
+        or the Cloudera Manager setting
+        <uicontrol>Additional Rules to Map Kerberos Principals to Short 
Names</uicontrol>.
+        This feature is disabled by default, to avoid an unexpected change in 
security-related behavior.
+        To enable it:
+        <ul>
+          <li>
+            <p>
+              For clusters not managed by Cloudera Manager, specify 
<codeph>--load_auth_to_local_rules=true</codeph>
+              in the <cmdname>impalad</cmdname> and 
<cmdname>catalogd</cmdname>configuration settings.
+            </p>
+          </li>
+          <li>
+            <p>
+              For clusters managed by Cloudera Manager, select the 
+              <uicontrol>Use HDFS Rules to Map Kerberos Principals to Short 
Names</uicontrol>
+              checkbox to enable the service-wide 
<codeph>load_auth_to_local_rules</codeph> configuration setting.
+              Then restart the Impala service.
+            </p>
+          </li>
+        </ul>
+        See <xref 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/sg_auth_to_local_isolate.html";
 scope="external" format="html">Using Auth-to-Local Rules to Isolate Cluster 
Users</xref> for general information about this feature.
+      </p>
+
       <note id="authentication_vs_authorization">
         Regardless of the authentication mechanism used, Impala always creates 
HDFS directories and data files
         owned by the same user (typically <codeph>impala</codeph>). To 
implement user-level access to different
@@ -79,7 +176,9 @@ Search: Solr Server -> Advanced -> HiveServer2 Logging 
Safety Valve
 
       <p>
         Especially during the transition from CM 4 to CM 5, we'll use some 
stock phraseology to talk about fields
-        and such.
+        and such. Also there are some task steps etc. to conref under the 
Impala Service page that are easier
+        to keep track of here instead of in cm_common_elements.xml. (Although 
as part of Apache work, anything
+        CM might naturally move out of this file.)
       </p>
 
       <p>
@@ -88,6 +187,11 @@ Search: Solr Server -> Advanced -> HiveServer2 Logging 
Safety Valve
         Snippet</uicontrol>. </ph>
       </p>
 
+      <ul>
+        <li id="go_impala_service">Go to the Impala service.</li>
+        <li id="restart_impala_service">Restart the Impala service.</li>
+      </ul>
+
     </section>
 
     <section id="citi">
@@ -207,6 +311,14 @@ select concat('abc','mno','xyz');</codeblock>
 
       <title>Background Info for REFRESH, INVALIDATE METADATA, and General 
Metadata Discussion</title>
 
+      <p id="invalidate_then_refresh" rev="DOCS-1013">
+        Because <codeph>REFRESH <varname>table_name</varname></codeph> only 
works for tables that the current
+        Impala node is already aware of, when you create a new table in the 
Hive shell, enter
+        <codeph>INVALIDATE METADATA <varname>new_table</varname></codeph> 
before you can see the new table in
+        <cmdname>impala-shell</cmdname>. Once the table is known by Impala, 
you can issue <codeph>REFRESH
+        <varname>table_name</varname></codeph> after you add data files for 
that table.
+      </p>
+
       <p id="refresh_vs_invalidate">
         <codeph>INVALIDATE METADATA</codeph> and <codeph>REFRESH</codeph> are 
counterparts: <codeph>INVALIDATE
         METADATA</codeph> waits to reload the metadata when needed for a 
subsequent query, but reloads all the
@@ -242,6 +354,144 @@ select concat('abc','mno','xyz');</codeblock>
         they are primarily used in new SQL syntax topics underneath that 
parent topic.
       </p>
 
+<codeblock id="parquet_fallback_schema_resolution_example"><![CDATA[
+create database schema_evolution;
+use schema_evolution;
+create table t1 (c1 int, c2 boolean, c3 string, c4 timestamp)
+  stored as parquet;
+insert into t1 values
+  (1, true, 'yes', now()),
+  (2, false, 'no', now() + interval 1 day);
+
+select * from t1;
++----+-------+-----+-------------------------------+
+| c1 | c2    | c3  | c4                            |
++----+-------+-----+-------------------------------+
+| 1  | true  | yes | 2016-06-28 14:53:26.554369000 |
+| 2  | false | no  | 2016-06-29 14:53:26.554369000 |
++----+-------+-----+-------------------------------+
+
+desc formatted t1;
+...
+| Location:   | /user/hive/warehouse/schema_evolution.db/t1 |
+...
+
+-- Make T2 have the same data file as in T1, including 2
+-- unused columns and column order different than T2 expects.
+load data inpath '/user/hive/warehouse/schema_evolution.db/t1'
+  into table t2;
++----------------------------------------------------------+
+| summary                                                  |
++----------------------------------------------------------+
+| Loaded 1 file(s). Total files in destination location: 1 |
++----------------------------------------------------------+
+
+-- 'position' is the default setting.
+-- Impala cannot read the Parquet file if the column order does not match.
+set PARQUET_FALLBACK_SCHEMA_RESOLUTION=position;
+PARQUET_FALLBACK_SCHEMA_RESOLUTION set to position
+
+select * from t2;
+WARNINGS: 
+File 'schema_evolution.db/t2/45331705_data.0.parq'
+has an incompatible Parquet schema for column 'schema_evolution.t2.c4'.
+Column type: TIMESTAMP, Parquet schema: optional int32 c1 [i:0 d:1 r:0]
+
+File 'schema_evolution.db/t2/45331705_data.0.parq'
+has an incompatible Parquet schema for column 'schema_evolution.t2.c4'.
+Column type: TIMESTAMP, Parquet schema: optional int32 c1 [i:0 d:1 r:0]
+
+-- With the 'name' setting, Impala can read the Parquet data files
+-- despite mismatching column order.
+set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name;
+PARQUET_FALLBACK_SCHEMA_RESOLUTION set to name
+
+select * from t2;
++-------------------------------+-------+
+| c4                            | c2    |
++-------------------------------+-------+
+| 2016-06-28 14:53:26.554369000 | true  |
+| 2016-06-29 14:53:26.554369000 | false |
++-------------------------------+-------+
+]]>
+</codeblock>
+
+      <note rev="IMPALA-3334" id="one_but_not_true">
+        In CDH 5.7.0 / Impala 2.5.0, only the value 1 enables the option, and 
the value
+        <codeph>true</codeph> is not recognized. This limitation is
+        tracked by the issue
+        <xref href="https://issues.cloudera.org/browse/IMPALA-3334"; 
scope="external" format="html">IMPALA-3334</xref>,
+        which shows the releases where the problem is fixed.
+      </note>
+
+      <p rev="IMPALA-3732" id="avro_2gb_strings">
+        The Avro specification allows string values up to 2**64 bytes in 
length. 
+        Impala queries for Avro tables use 32-bit integers to hold string 
lengths.
+        In CDH 5.7 / Impala 2.5 and higher, Impala truncates 
<codeph>CHAR</codeph>
+        and <codeph>VARCHAR</codeph> values in Avro tables to (2**31)-1 bytes.
+        If a query encounters a <codeph>STRING</codeph> value longer than 
(2**31)-1
+        bytes in an Avro table, the query fails. In earlier releases,
+        encountering such long values in an Avro table could cause a crash.
+      </p>
+
+      <p rev="2.6.0 IMPALA-3369" id="set_column_stats_example">
+        You specify a case-insensitive symbolic name for the kind of 
statistics:
+        <codeph>numDVs</codeph>, <codeph>numNulls</codeph>, 
<codeph>avgSize</codeph>, <codeph>maxSize</codeph>.
+        The key names and values are both quoted. This operation applies to an 
entire table,
+        not a specific partition. For example:
+<codeblock>
+create table t1 (x int, s string);
+insert into t1 values (1, 'one'), (2, 'two'), (2, 'deux');
+show column stats t1;
++--------+--------+------------------+--------+----------+----------+
+| Column | Type   | #Distinct Values | #Nulls | Max Size | Avg Size |
++--------+--------+------------------+--------+----------+----------+
+| x      | INT    | -1               | -1     | 4        | 4        |
+| s      | STRING | -1               | -1     | -1       | -1       |
++--------+--------+------------------+--------+----------+----------+
+alter table t1 set column stats x ('numDVs'='2','numNulls'='0');
+alter table t1 set column stats s ('numdvs'='3','maxsize'='4');
+show column stats t1;
++--------+--------+------------------+--------+----------+----------+
+| Column | Type   | #Distinct Values | #Nulls | Max Size | Avg Size |
++--------+--------+------------------+--------+----------+----------+
+| x      | INT    | 2                | 0      | 4        | 4        |
+| s      | STRING | 3                | -1     | 4        | -1       |
++--------+--------+------------------+--------+----------+----------+
+</codeblock>
+      </p>
+
+<codeblock id="set_numrows_example">create table analysis_data stored as 
parquet as select * from raw_data;
+Inserted 1000000000 rows in 181.98s
+compute stats analysis_data;
+insert into analysis_data select * from smaller_table_we_forgot_before;
+Inserted 1000000 rows in 15.32s
+-- Now there are 1001000000 rows. We can update this single data point in the 
stats.
+alter table analysis_data set tblproperties('numRows'='1001000000', 
'STATS_GENERATED_VIA_STATS_TASK'='true');</codeblock>
+
+<codeblock id="set_numrows_partitioned_example">-- If the table originally 
contained 1 million rows, and we add another partition with 30 thousand rows,
+-- change the numRows property for the partition and the overall table.
+alter table partitioned_data partition(year=2009, month=4) set tblproperties 
('numRows'='30000', 'STATS_GENERATED_VIA_STATS_TASK'='true');
+alter table partitioned_data set tblproperties ('numRows'='1030000', 
'STATS_GENERATED_VIA_STATS_TASK'='true');</codeblock>
+
+      <p id="int_overflow_behavior">
+        Impala does not return column overflows as <codeph>NULL</codeph>, so 
that customers can distinguish
+        between <codeph>NULL</codeph> data and overflow conditions similar to 
how they do so with traditional
+        database systems. Impala returns the largest or smallest value in the 
range for the type. For example,
+        valid values for a <codeph>tinyint</codeph> range from -128 to 127. In 
Impala, a <codeph>tinyint</codeph>
+        with a value of -200 returns -128 rather than <codeph>NULL</codeph>. A 
<codeph>tinyint</codeph> with a
+        value of 200 returns 127.
+      </p>
+
+      <p rev="2.5.0" id="partition_key_optimization">
+        If you frequently run aggregate functions such as 
<codeph>MIN()</codeph>, <codeph>MAX()</codeph>, and
+        <codeph>COUNT(DISTINCT)</codeph> on partition key columns, consider 
enabling the <codeph>OPTIMIZE_PARTITION_KEY_SCANS</codeph>
+        query option, which optimizes such queries. This feature is available 
in CDH 5.7 / Impala 2.5 and higher.
+        See <xref href="../topics/impala_optimize_partition_key_scans.xml"/>
+        for the kinds of queries that this option applies to, and slight 
differences in how partitions are
+        evaluated when this query option is enabled.
+      </p>
+
       <p id="live_reporting_details">
         The output from this query option is printed to standard error. The 
output is only displayed in interactive mode,
         that is, not when the <codeph>-q</codeph> or <codeph>-f</codeph> 
options are used.
@@ -252,6 +502,18 @@ select concat('abc','mno','xyz');</codeblock>
         work in real time, see <xref 
href="https://asciinema.org/a/1rv7qippo0fe7h5k1b6k4nexk"; scope="external" 
format="html">this animated demo</xref>.
       </p>
 
+      <p rev="2.5.0" id="runtime_filter_mode_blurb">
+        Because the runtime filtering feature is enabled by default only for 
local processing,
+        the other filtering-related query options have the greatest effect 
when used in
+        combination with the setting 
<codeph>RUNTIME_FILTER_MODE=GLOBAL</codeph>.
+      </p>
+
+      <p rev="2.5.0" id="runtime_filtering_option_caveat">
+        Because the runtime filtering feature applies mainly to 
resource-intensive
+        and long-running queries, only adjust this query option when tuning 
long-running queries
+        involving some combination of large partitioned tables and joins 
involving large tables.
+      </p>
+
       <p rev="2.3.0" id="impala_shell_progress_reports_compute_stats_caveat">
         The <codeph>LIVE_PROGRESS</codeph> and <codeph>LIVE_SUMMARY</codeph> 
query options
         currently do not produce any output during <codeph>COMPUTE 
STATS</codeph> operations.
@@ -357,6 +619,14 @@ drop database temp;
         for example when programmatically generating SQL statements where a 
regular function call might be easier to construct.
       </p>
 
+      <p rev="2.3.0" id="current_timezone_tip">
+        To determine the time zone of the server you are connected to, in CDH 
5.5 / Impala 2.3 and
+        higher you can call the <codeph>timeofday()</codeph> function, which 
includes the time zone
+        specifier in its return value. Remember that with cloud computing, the 
server you interact
+        with might be in a different time zone than you are, or different 
sessions might connect to
+        servers in different time zones, or a cluster might include servers in 
more than one time zone.
+      </p>
+
       <p rev="2.2.0" id="timezone_conversion_caveat">
         The way this function deals with time zones when converting to or from 
<codeph>TIMESTAMP</codeph>
         values is affected by the 
<codeph>-use_local_tz_for_unix_timestamp_conversions</codeph> startup flag for 
the
@@ -364,20 +634,97 @@ drop database temp;
         how Impala handles time zone considerations for the 
<codeph>TIMESTAMP</codeph> data type.
       </p>
 
-      <note rev="2.2.0" id="s3_caveat" type="important">
+      <p rev="2.6.0 CDH-39913 IMPALA-3558" id="s3_drop_table_purge">
+        For best compatibility with the S3 write support in CDH 5.8 / Impala 
2.6
+        and higher:
+        <ul>
+        <li>Use native Hadoop techniques to create data files in S3 for 
querying through Impala.</li>
+        <li>Use the <codeph>PURGE</codeph> clause of <codeph>DROP 
TABLE</codeph> when dropping internal (managed) tables.</li>
+        </ul>
+        By default, when you drop an internal (managed) table, the data files 
are
+        moved to the HDFS trashcan. This operation is expensive for tables that
+        reside on the Amazon S3 filesystem. Therefore, for S3 tables, prefer 
to use
+        <codeph>DROP TABLE <varname>table_name</varname> PURGE</codeph> rather 
than the default <codeph>DROP TABLE</codeph> statement.
+        The <codeph>PURGE</codeph> clause makes Impala delete the data files 
immediately,
+        skipping the HDFS trashcan.
+        For the <codeph>PURGE</codeph> clause to work effectively, you must 
originally create the
+        data files on S3 using one of the tools from the Hadoop ecosystem, 
such as
+        <codeph>hadoop fs -cp</codeph>, or <codeph>INSERT</codeph> in Impala 
or Hive.
+      </p>
+
+      <p rev="2.6.0 CDH-39913 IMPALA-1878" id="s3_dml_performance">
+        Because of differences between S3 and traditional filesystems, DML 
operations
+        for S3 tables can take longer than for tables on HDFS. For example, 
both the
+        <codeph>LOAD DATA</codeph> statement and the final stage of the 
<codeph>INSERT</codeph>
+        and <codeph>CREATE TABLE AS SELECT</codeph> statements involve moving 
files from one directory
+        to another. (In the case of <codeph>INSERT</codeph> and <codeph>CREATE 
TABLE AS SELECT</codeph>,
+        the files are moved from a temporary staging directory to the final 
destination directory.)
+        Because S3 does not support a <q>rename</q> operation for existing 
objects, in these cases Impala
+        actually copies the data files from one location to another and then 
removes the original files.
+        In CDH 5.8 / Impala 2.6, the <codeph>S3_SKIP_INSERT_STAGING</codeph> 
query option provides a way
+        to speed up <codeph>INSERT</codeph> statements for S3 tables and 
partitions, with the tradeoff
+        that a problem during statement execution could leave data in an 
inconsistent state.
+        It does not apply to <codeph>INSERT OVERWRITE</codeph> or <codeph>LOAD 
DATA</codeph> statements.
+        See <xref 
href="../topics/impala_s3_skip_insert_staging.xml#s3_skip_insert_staging"/> for 
details.
+      </p>
+
+      <p rev="2.6.0 CDH-40329 IMPALA-3453" id="s3_block_splitting">
+        In CDH 5.8 / Impala 2.6 and higher, Impala queries are optimized for 
files stored in Amazon S3.
+        For Impala tables that use the file formats Parquet, RCFile, 
SequenceFile,
+        Avro, and uncompressed text, the setting 
<codeph>fs.s3a.block.size</codeph>
+        in the <filepath>core-site.xml</filepath> configuration file determines
+        how Impala divides the I/O work of reading the data files. This 
configuration
+        setting is specified in bytes. By default, this
+        value is 33554432 (32 MB), meaning that Impala parallelizes S3 read 
operations on the files
+        as if they were made up of 32 MB blocks. For example, if your S3 
queries primarily access
+        Parquet files written by MapReduce or Hive, increase 
<codeph>fs.s3a.block.size</codeph>
+        to 134217728 (128 MB) to match the row group size of those files. If 
most S3 queries involve
+        Parquet files written by Impala, increase 
<codeph>fs.s3a.block.size</codeph>
+        to 268435456 (256 MB) to match the row group size produced by Impala.
+      </p>
+
+      <note rev="2.6.0 CDH-39913 IMPALA-1878" id="s3_production" 
type="important">
         <p>
-          Impala query support for Amazon S3 is included in CDH 5.4.0, but is 
not currently supported or recommended for production use.
-          If you're interested in this feature, try it out in a test 
environment until we address the issues and limitations needed for 
production-readiness.
+          In CDH 5.8 / Impala 2.6 and higher, Impala supports both queries 
(<codeph>SELECT</codeph>)
+          and DML (<codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, 
<codeph>CREATE TABLE AS SELECT</codeph>)
+          for data residing on Amazon S3. With the inclusion of write support,
+          <!-- and configuration settings for more secure S3 key management, 
-->
+          the Impala support for S3 is now considered ready for production use.
         </p>
       </note>
 
-      <p rev="2.2.0" id="s3_dml">
-        Currently, Impala cannot insert or load data into a table or partition 
that resides in the Amazon
-        Simple Storage Service (S3).
-        Bring data into S3 using the normal S3 transfer mechanisms, then use 
Impala to query the S3 data.
-        See <xref href="../topics/impala_s3.xml#s3"/> for details about using 
Impala with S3.
+      <note rev="2.2.0" id="s3_caveat" type="important">
+        <p> Impala query support for Amazon S3 is included in CDH 5.4.0, but is
+          not currently supported or recommended for production use. To try 
this
+          feature, use it in a test environment until Cloudera resolves
+          currently existing issues and limitations to make it ready for
+          production use. </p>
+      </note>
+
+      <p rev="2.6.0 CDH-39913 IMPALA-1878" id="s3_ddl">
+        In CDH 5.8 / Impala 2.6 and higher, Impala DDL statements such as
+        <codeph>CREATE DATABASE</codeph>, <codeph>CREATE TABLE</codeph>, 
<codeph>DROP DATABASE CASCADE</codeph>,
+        <codeph>DROP TABLE</codeph>, and <codeph>ALTER TABLE [ADD|DROP] 
PARTITION</codeph> can create or remove folders
+        as needed in the Amazon S3 system. Prior to CDH 5.8 / Impala 2.6, you 
had to create folders yourself and point
+        Impala database, tables, or partitions at them, and manually remove 
folders when no longer needed.
+        See <xref href="../topics/impala_s3.xml#s3"/> for details about 
reading and writing S3 data with Impala.
       </p>
 
+      <p rev="2.6.0 CDH-39913 IMPALA-1878" id="s3_dml">
+        In CDH 5.8 / Impala 2.6 and higher, the Impala DML statements 
(<codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>,
+        and <codeph>CREATE TABLE AS SELECT</codeph>) can write data into a 
table or partition that resides in the
+        Amazon Simple Storage Service (S3).
+        The syntax of the DML statements is the same as for any other tables, 
because the S3 location for tables and
+        partitions is specified by an <codeph>s3a://</codeph> prefix in the
+        <codeph>LOCATION</codeph> attribute of
+        <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> 
statements.
+        If you bring data into S3 using the normal S3 transfer mechanisms 
instead of Impala DML statements,
+        issue a <codeph>REFRESH</codeph> statement for the table before using 
Impala to query the S3 data.
+      </p>
+
+        <!-- Formerly part of s3_dml element. Moved out to avoid a circular 
link in the S3 topic itelf. -->
+        <!-- See <xref href="../topics/impala_s3.xml#s3"/> for details about 
reading and writing S3 data with Impala. -->
+
       <p rev="2.2.0" id="s3_metadata">
         The <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> 
statements also cache metadata
         for tables where the data resides in the Amazon Simple Storage Service 
(S3).
@@ -433,11 +780,21 @@ drop database temp;
         specification, and specify constant values for all the partition key 
columns.
       </p>
 
-      <p id="udf_persistence_restriction">
-        Currently, Impala UDFs and UDAs are not persisted in the metastore 
database. Information
-        about these functions is held in the memory of the 
<cmdname>catalogd</cmdname> daemon. You must reload them
-        by running the <codeph>CREATE FUNCTION</codeph> statements again each 
time you restart the
-        <cmdname>catalogd</cmdname> daemon.
+      <p id="udf_persistence_restriction" rev="2.5.0 IMPALA-1748">
+        In CDH 5.7 / Impala 2.5 and higher, Impala UDFs and UDAs written in 
C++ are persisted in the metastore database.
+        Java UDFs are also persisted, if they were created with the new 
<codeph>CREATE FUNCTION</codeph> syntax for Java UDFs,
+        where the Java function argument and return types are omitted.
+        Java-based UDFs created with the old <codeph>CREATE FUNCTION</codeph> 
syntax do not persist across restarts
+        because they are held in the memory of the <cmdname>catalogd</cmdname> 
daemon.
+        Until you re-create such Java UDFs using the new <codeph>CREATE 
FUNCTION</codeph> syntax,
+        you must reload those Java-based UDFs by running the original 
<codeph>CREATE FUNCTION</codeph> statements again each time
+        you restart the <cmdname>catalogd</cmdname> daemon.
+        Prior to CDH 5.7 / Impala 2.5, the requirement to reload functions 
after a restart applied to both C++ and Java functions.
+      </p>
+
+      <p id="current_user_caveat" rev="CDH-36552">
+        The Hive <codeph>current_user()</codeph> function cannot be
+        called from a Java UDF through Impala.
       </p>
 
       <note id="add_partition_set_location">
@@ -513,6 +870,15 @@ select c_first_name, c_last_name from customer where 
lower(trim(c_last_name)) re
 select c_first_name, c_last_name from customer where lower(trim(c_last_name)) 
rlike '^de.*';
 </codeblock>
 
+      <p id="case_insensitive_comparisons_tip" rev="2.5.0 IMPALA-1787">
+        In CDH 5.7 / Impala 2.5 and higher, you can simplify queries that
+        use many <codeph>UPPER()</codeph> and <codeph>LOWER()</codeph> calls
+        to do case-insensitive comparisons, by using the <codeph>ILIKE</codeph>
+        or <codeph>IREGEXP</codeph> operators instead. See
+        <xref href="../topics/impala_operators.xml#ilike"/> and
+        <xref href="../topics/impala_operators.xml#iregexp"/> for details.
+      </p>
+
       <p id="show_security">
         When authorization is enabled, the output of the <codeph>SHOW</codeph> 
statement is limited to those
         objects for which you have some privilege. There might be other 
database, tables, and so on, but their
@@ -522,8 +888,16 @@ select c_first_name, c_last_name from customer where 
lower(trim(c_last_name)) rl
         privileges for specific kinds of objects.
       </p>
 
+      <p id="infinity_and_nan" rev="IMPALA-3267">
+        Infinity and NaN can be specified in text data files as 
<codeph>inf</codeph> and <codeph>nan</codeph>
+        respectively, and Impala interprets them as these special values. They 
can also be produced by certain
+        arithmetic expressions; for example, <codeph>pow(-1, 0.5)</codeph> 
returns <codeph>Infinity</codeph> and
+        <codeph>1/0</codeph> returns <codeph>NaN</codeph>. Or you can cast the 
literal values, such as <codeph>CAST('nan' AS
+        DOUBLE)</codeph> or <codeph>CAST('inf' AS DOUBLE)</codeph>.
+      </p>
+
       <p rev="2.0.0" id="user_kerberized">
-        In Impala 2.0 and later, <codeph>user()</codeph> returns the the full 
Kerberos principal string, such as
+        In Impala 2.0 and later, <codeph>user()</codeph> returns the full 
Kerberos principal string, such as
         <codeph>[email protected]</codeph>, in a Kerberized environment.
       </p>
 
@@ -597,6 +971,49 @@ DROP VIEW db2.v1;
         to all be different values.
       </p>
 
+      <p rev="2.5.0 IMPALA-3054" 
id="spill_to_disk_vs_dynamic_partition_pruning">
+        When the spill-to-disk feature is activated for a join node within a 
query, Impala does not
+        produce any runtime filters for that join operation on that host. 
Other join nodes within
+        the query are not affected.
+      </p>
+
+<codeblock id="simple_dpp_example">
+create table yy (s string) partitioned by (year int) stored as parquet;
+insert into yy partition (year) values ('1999', 1999), ('2000', 2000),
+  ('2001', 2001), ('2010',2010);
+compute stats yy;
+
+create table yy2 (s string) partitioned by (year int) stored as parquet;
+insert into yy2 partition (year) values ('1999', 1999), ('2000', 2000),
+  ('2001', 2001);
+compute stats yy2;
+
+-- The query reads an unknown number of partitions, whose key values are only
+-- known at run time. The 'runtime filters' lines show how the information 
about
+-- the partitions is calculated in query fragment 02, and then used in query
+-- fragment 00 to decide which partitions to skip.
+explain select s from yy2 where year in (select year from yy where year 
between 2000 and 2005);
++----------------------------------------------------------+
+| Explain String                                           |
++----------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=16.00MB VCores=2 |
+|                                                          |
+| 04:EXCHANGE [UNPARTITIONED]                              |
+| |                                                        |
+| 02:HASH JOIN [LEFT SEMI JOIN, BROADCAST]                 |
+| |  hash predicates: year = year                          |
+| |  <b>runtime filters: RF000 &lt;- year</b>                        |
+| |                                                        |
+| |--03:EXCHANGE [BROADCAST]                               |
+| |  |                                                     |
+| |  01:SCAN HDFS [dpp.yy]                                 |
+| |     partitions=2/4 files=2 size=468B                   |
+| |                                                        |
+| 00:SCAN HDFS [dpp.yy2]                                   |
+|    partitions=2/3 files=2 size=468B                      |
+|    <b>runtime filters: RF000 -> year</b>                        |
++----------------------------------------------------------+
+</codeblock>
       <p id="order_by_scratch_dir">
         By default, intermediate files used during large sort, join, 
aggregation, or analytic function operations
         are stored in the directory <filepath>/tmp/impala-scratch</filepath> . 
These files are removed when the
@@ -703,6 +1120,10 @@ DROP VIEW db2.v1;
         <b>Type:</b> string
       </p>
 
+      <p id="type_integer">
+        <b>Type:</b> integer
+      </p>
+
       <p id="default_false">
         <b>Default:</b> <codeph>false</codeph>
       </p>
@@ -711,6 +1132,10 @@ DROP VIEW db2.v1;
         <b>Default:</b> <codeph>false</codeph> (shown as 0 in output of 
<codeph>SET</codeph> statement)
       </p>
 
+      <p id="default_true_1">
+        <b>Default:</b> <codeph>true</codeph> (shown as 1 in output of 
<codeph>SET</codeph> statement)
+      </p>
+
       <p id="odd_return_type_string">
         Currently, the return value is always a <codeph>STRING</codeph>. The 
return type is subject to change in
         future releases. Always use <codeph>CAST()</codeph> to convert the 
result to whichever data type is
@@ -777,10 +1202,22 @@ show functions in _impala_builtins like 
'*<varname>substring</varname>*';
         for <codeph>DECIMAL</codeph> columns and Impala uses the statistics to 
optimize query performance.
       </p>
 
+      <p rev="CDH-35866" id="hive_column_stats_caveat">
+        If you run the Hive statement <codeph>ANALYZE TABLE COMPUTE STATISTICS 
FOR COLUMNS</codeph>, 
+        Impala can only use the resulting column statistics if the table is 
unpartitioned.
+        Impala cannot use Hive-generated column statistics for a partitioned 
table.
+      </p>
+
       <p id="datetime_function_chaining">
         <codeph>unix_timestamp()</codeph> and <codeph>from_unixtime()</codeph> 
are often used in combination to
         convert a <codeph>TIMESTAMP</codeph> value into a particular string 
format. For example:
-<codeblock xml:space="preserve">select from_unixtime(unix_timestamp(now() + 
interval 3 days), 'yyyy/MM/dd HH:mm');
+<codeblock xml:space="preserve">select from_unixtime(unix_timestamp(now() + 
interval 3 days),
+  'yyyy/MM/dd HH:mm') as yyyy_mm_dd_hh_mm;
++------------------+
+| yyyy_mm_dd_hh_mm |
++------------------+
+| 2016/06/03 11:38 |
++------------------+
 </codeblock>
       </p>
 
@@ -803,12 +1240,19 @@ show functions in _impala_builtins like 
'*<varname>substring</varname>*';
         statement.
       </p>
 
-      <note rev="1.4.0" id="compute_stats_nulls">
-        Prior to Impala 1.4.0, <codeph>COMPUTE STATS</codeph> counted the 
number of <codeph>NULL</codeph> values in
-        each column and recorded that figure in the metastore database. 
Because Impala does not currently make use
-        of the <codeph>NULL</codeph> count during query planning, Impala 1.4.0 
and higher speeds up the
-        <codeph>COMPUTE STATS</codeph> statement by skipping this 
<codeph>NULL</codeph> counting.
-      </note>
+      <note rev="1.4.0" id="compute_stats_nulls"> Prior to Impala 1.4.0,
+          <codeph>COMPUTE STATS</codeph> counted the number of
+          <codeph>NULL</codeph> values in each column and recorded that figure
+        in the metastore database. Because Impala does not currently use the
+          <codeph>NULL</codeph> count during query planning, Impala 1.4.0 and
+        higher speeds up the <codeph>COMPUTE STATS</codeph> statement by
+        skipping this <codeph>NULL</codeph> counting. </note>
+
+      <p id="regular_expression_whole_string">
+        The regular expression must match the entire value, not just occur 
somewhere inside it. Use <codeph>.*</codeph> at the beginning,
+        the end, or both if you only need to match characters anywhere in the 
middle. Thus, the <codeph>^</codeph> and <codeph>$</codeph>
+        atoms are often redundant, although you might already have them in 
your expression strings that you reuse from elsewhere.
+      </p>
 
       <p rev="1.3.1" id="regexp_matching">
         In Impala 1.3.1 and higher, the <codeph>REGEXP</codeph> and 
<codeph>RLIKE</codeph> operators now match a
@@ -871,6 +1315,22 @@ show functions in _impala_builtins like 
'*<varname>substring</varname>*';
         character used as a delimiter by some data formats.
       </note>
 
+      <p id="sqoop_blurb">
+        <b>Sqoop considerations:</b>
+      </p>
+
+      <p id="sqoop_timestamp_caveat" rev="IMPALA-2111 CDH-37399"> If you use 
Sqoop to
+        convert RDBMS data to Parquet, be careful with interpreting any
+        resulting values from <codeph>DATE</codeph>, <codeph>DATETIME</codeph>,
+        or <codeph>TIMESTAMP</codeph> columns. The underlying values are
+        represented as the Parquet <codeph>INT64</codeph> type, which is
+        represented as <codeph>BIGINT</codeph> in the Impala table. The Parquet
+        values represent the time in milliseconds, while Impala interprets
+          <codeph>BIGINT</codeph> as the time in seconds. Therefore, if you 
have
+        a <codeph>BIGINT</codeph> column in a Parquet table that was imported
+        this way from Sqoop, divide the values by 1000 when interpreting as the
+          <codeph>TIMESTAMP</codeph> type.</p>
+
       <p id="command_line_blurb">
         <b>Command-line equivalent:</b>
       </p>
@@ -889,17 +1349,31 @@ show functions in _impala_builtins like 
'*<varname>substring</varname>*';
 
       <ul id="complex_types_restrictions">
         <li>
-          Columns with this data type can only be used in tables or partitions 
with the Parquet file format.
+          <p>
+            Columns with this data type can only be used in tables or 
partitions with the Parquet file format.
+          </p>
         </li>
         <li>
-          Columns with this data type cannot be used as partition key columns 
in a partitioned table.
+          <p>
+            Columns with this data type cannot be used as partition key 
columns in a partitioned table.
+          </p>
         </li>
         <li>
-          The <codeph>COMPUTE STATS</codeph> statement does not produce any 
statistics for columns of this data type.
+          <p>
+            The <codeph>COMPUTE STATS</codeph> statement does not produce any 
statistics for columns of this data type.
+          </p>
+        </li>
+        <li rev="CDH-35868">
+          <p id="complex_types_max_length">
+            The maximum length of the column definition for any complex type, 
including declarations for any nested types,
+            is 4000 characters.
+          </p>
         </li>
         <li>
-          See <xref 
href="../topics/impala_complex_types.xml#complex_types_limits"/> for a full 
list of limitations
-          and associated guidelines about complex type columns.
+          <p>
+            See <xref 
href="../topics/impala_complex_types.xml#complex_types_limits"/> for a full 
list of limitations
+            and associated guidelines about complex type columns.
+          </p>
         </li>
       </ul>
 
@@ -939,6 +1413,10 @@ show functions in _impala_builtins like 
'*<varname>substring</varname>*';
         the complex types (<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>,
         and <codeph>MAP</codeph>) available in CDH 5.5 / Impala 2.3 and higher,
         currently, Impala can query these types only in Parquet tables.
+        <ph rev="IMPALA-2844">
+        The one exception to the preceding rule is <codeph>COUNT(*)</codeph> 
queries on RCFile tables that include complex types.
+        Such queries are allowed in CDH 5.8 / Impala 2.6 and higher.
+        </ph>
       </p>
 
       <p rev="2.3.0" id="complex_types_caveat_no_operator">
@@ -1067,23 +1545,23 @@ select
   r_name,
   count(r_nations.item.n_nationkey) as count,
   sum(r_nations.item.n_nationkey) as sum,
-  avg(r_nations.item.n_nationkey) as average,
+  avg(r_nations.item.n_nationkey) as avg,
   min(r_nations.item.n_name) as minimum,
   max(r_nations.item.n_name) as maximum,
-  ndv(r_nations.item.n_nationkey) as distinct_values
+  ndv(r_nations.item.n_nationkey) as distinct_vals
 from
   region, region.r_nations as r_nations
 group by r_name
 order by r_name;
-+-------------+-------+-----+---------+-----------+----------------+-----------------+
-| r_name      | count | sum | average | minimum   | maximum        | 
distinct_values |
-+-------------+-------+-----+---------+-----------+----------------+-----------------+
-| AFRICA      | 5     | 50  | 10      | ALGERIA   | MOZAMBIQUE     | 5         
      |
-| AMERICA     | 5     | 47  | 9.4     | ARGENTINA | UNITED STATES  | 5         
      |
-| ASIA        | 5     | 68  | 13.6    | CHINA     | VIETNAM        | 5         
      |
-| EUROPE      | 5     | 77  | 15.4    | FRANCE    | UNITED KINGDOM | 5         
      |
-| MIDDLE EAST | 5     | 58  | 11.6    | EGYPT     | SAUDI ARABIA   | 5         
      |
-+-------------+-------+-----+---------+-----------+----------------+-----------------+
++-------------+-------+-----+------+-----------+----------------+---------------+
+| r_name      | count | sum | avg  | minimum   | maximum        | 
distinct_vals |
++-------------+-------+-----+------+-----------+----------------+---------------+
+| AFRICA      | 5     | 50  | 10   | ALGERIA   | MOZAMBIQUE     | 5            
 |
+| AMERICA     | 5     | 47  | 9.4  | ARGENTINA | UNITED STATES  | 5            
 |
+| ASIA        | 5     | 68  | 13.6 | CHINA     | VIETNAM        | 5            
 |
+| EUROPE      | 5     | 77  | 15.4 | FRANCE    | UNITED KINGDOM | 5            
 |
+| MIDDLE EAST | 5     | 58  | 11.6 | EGYPT     | SAUDI ARABIA   | 5            
 |
++-------------+-------+-----+------+-----------+----------------+---------------+
 </codeblock>
 </p>
 
@@ -1232,7 +1710,7 @@ arrdelay = 5
 depdelay = -2
 origin = CMH
 dest = IND
-distince = 182
+distance = 182
 cancelled = 0
 diverted = 0
 
@@ -1261,7 +1739,7 @@ arrdelay = 5
 depdelay = -2
 origin = CMH
 dest = IND
-distince = 182
+distance = 182
 cancelled = 0
 diverted = 0
 
@@ -1332,6 +1810,19 @@ flight_num:           INT32 SNAPPY DO:83456393 
FPO:83488603 SZ:10216514/11474301
         This function cannot be used in an analytic context. That is, the 
<codeph>OVER()</codeph> clause is not allowed at all with this function.
       </p>
 
+      <p rev="CDH-40418" id="analytic_partition_pruning_caveat">
+        In queries involving both analytic functions and partitioned tables, 
partition pruning only occurs for columns named in the <codeph>PARTITION 
BY</codeph>
+        clause of the analytic function call. For example, if an analytic 
function query has a clause such as <codeph>WHERE year=2016</codeph>,
+        the way to make the query prune all other <codeph>YEAR</codeph> 
partitions is to include <codeph>PARTITION BY year</codeph>in the analytic 
function call;
+        for example, <codeph>OVER (PARTITION BY 
year,<varname>other_columns</varname> 
<varname>other_analytic_clauses</varname>)</codeph>.
+<!--
+        These examples illustrate the technique:
+<codeblock>
+
+</codeblock>
+-->
+      </p>
+
       <p id="impala_parquet_encodings_caveat">
         Impala can query Parquet files that use the <codeph>PLAIN</codeph>, 
<codeph>PLAIN_DICTIONARY</codeph>,
         <codeph>BIT_PACKED</codeph>, and <codeph>RLE</codeph> encodings. 
@@ -1475,10 +1966,10 @@ flight_num:           INT32 SNAPPY DO:83456393 
FPO:83488603 SZ:10216514/11474301
         <b>Amazon S3 considerations:</b>
       </p>
 
-      <p id="isilon_blurb" rev="5.4.3">
+      <p id="isilon_blurb" rev="2.2.3">
         <b>Isilon considerations:</b>
       </p>
-      <p id="isilon_block_size_caveat" rev="5.4.3">
+      <p id="isilon_block_size_caveat" rev="2.2.3">
         Because the EMC Isilon storage devices use a global value for the 
block size
         rather than a configurable value for each file, the 
<codeph>PARQUET_FILE_SIZE</codeph>
         query option has no effect when Impala inserts data into a table or 
partition
@@ -1536,6 +2027,16 @@ flight_num:           INT32 SNAPPY DO:83456393 
FPO:83488603 SZ:10216514/11474301
         each value.
       </p>
 
+      <p rev="2.7.0" id="added_in_270">
+        <b>Added in:</b> CDH 5.9.0 (Impala 2.7.0)
+      </p>
+      <p rev="2.6.0" id="added_in_260">
+        <b>Added in:</b> CDH 5.8.0 (Impala 2.6.0)
+      </p>
+      <p rev="2.5.0" id="added_in_250">
+        <b>Added in:</b> CDH 5.7.0 (Impala 2.5.0)
+      </p>
+
       <p rev="2.3.0" id="added_in_230">
         <b>Added in:</b> CDH 5.5.0 (Impala 2.3.0)
       </p>
@@ -1569,11 +2070,11 @@ flight_num:           INT32 SNAPPY DO:83456393 
FPO:83488603 SZ:10216514/11474301
         <b>Added in:</b> Impala 1.1.1
       </p>
 
-      <p id="added_in_210">
+      <p id="added_in_210" rev="2.1.0">
         <b>Added in:</b> CDH 5.3.0 (Impala 2.1.0)
       </p>
 
-      <p id="added_in_220">
+      <p id="added_in_220" rev="2.2.0">
         <b>Added in:</b> CDH 5.4.0 (Impala 2.2.0)
       </p>
 
@@ -1841,14 +2342,16 @@ select max(height), avg(height) from census_data where 
age &gt; 20;
         When Impala processes a cached data block, where the cache replication 
factor is greater than 1, Impala randomly
         selects a host that has a cached copy of that data block. This 
optimization avoids excessive CPU
         usage on a single host when the same cached data block is processed 
multiple times.
+        Cloudera recommends specifying a value greater than or equal to the 
HDFS block replication factor.
       </p>
 
 <!-- This same text is conref'ed in the #views and the #partition_pruning 
topics. -->
 
-      <p id="partitions_and_views">
-        If a view applies to a partitioned table, any partition pruning is 
determined by the clauses in the
-        original query. Impala does not prune additional columns if the query 
on the view includes extra
-        <codeph>WHERE</codeph> clauses referencing the partition key columns.
+      <p id="partitions_and_views" rev="CDH-36224">
+        If a view applies to a partitioned table, any partition pruning 
considers the clauses on both
+        the original query and any additional <codeph>WHERE</codeph> 
predicates in the query that refers to the view.
+        Prior to Impala 1.4, only the <codeph>WHERE</codeph> clauses on the 
original query from the
+        <codeph>CREATE VIEW</codeph> statement were used for partition pruning.
       </p>
 
       <p id="describe_formatted_view">
@@ -1857,39 +2360,39 @@ select max(height), avg(height) from census_data where 
age &gt; 20;
 <codeblock xml:space="preserve">[localhost:21000] &gt; create view v1 as 
select * from t1;
 [localhost:21000] &gt; describe formatted v1;
 Query finished, fetching results ...
-+------------------------------+------------------------------+----------------------+
-| name                         | type                         | comment        
      |
-+------------------------------+------------------------------+----------------------+
-| # col_name                   | data_type                    | comment        
      |
-|                              | NULL                         | NULL           
      |
-| x                            | int                          | None           
      |
-| y                            | int                          | None           
      |
-| s                            | string                       | None           
      |
-|                              | NULL                         | NULL           
      |
-| # Detailed Table Information | NULL                         | NULL           
      |
-| Database:                    | views                        | NULL           
      |
-| Owner:                       | cloudera                     | NULL           
      |
-| CreateTime:                  | Mon Jul 08 15:56:27 EDT 2013 | NULL           
      |
-| LastAccessTime:              | UNKNOWN                      | NULL           
      |
-| Protect Mode:                | None                         | NULL           
      |
-| Retention:                   | 0                            | NULL           
      |
-<b>| Table Type:                  | VIRTUAL_VIEW                 | NULL        
         |</b>
-| Table Parameters:            | NULL                         | NULL           
      |
-|                              | transient_lastDdlTime        | 1373313387     
      |
-|                              | NULL                         | NULL           
      |
-| # Storage Information        | NULL                         | NULL           
      |
-| SerDe Library:               | null                         | NULL           
      |
-| InputFormat:                 | null                         | NULL           
      |
-| OutputFormat:                | null                         | NULL           
      |
-| Compressed:                  | No                           | NULL           
      |
-| Num Buckets:                 | 0                            | NULL           
      |
-| Bucket Columns:              | []                           | NULL           
      |
-| Sort Columns:                | []                           | NULL           
      |
-|                              | NULL                         | NULL           
      |
-| # View Information           | NULL                         | NULL           
      |
-<b>| View Original Text:          | SELECT * FROM t1             | NULL        
         |
-| View Expanded Text:          | SELECT * FROM t1             | NULL           
      |</b>
-+------------------------------+------------------------------+----------------------+
++------------------------------+------------------------------+------------+
+| name                         | type                         | comment    |
++------------------------------+------------------------------+------------+
+| # col_name                   | data_type                    | comment    |
+|                              | NULL                         | NULL       |
+| x                            | int                          | None       |
+| y                            | int                          | None       |
+| s                            | string                       | None       |
+|                              | NULL                         | NULL       |
+| # Detailed Table Information | NULL                         | NULL       |
+| Database:                    | views                        | NULL       |
+| Owner:                       | cloudera                     | NULL       |
+| CreateTime:                  | Mon Jul 08 15:56:27 EDT 2013 | NULL       |
+| LastAccessTime:              | UNKNOWN                      | NULL       |
+| Protect Mode:                | None                         | NULL       |
+| Retention:                   | 0                            | NULL       |
+<b>| Table Type:                  | VIRTUAL_VIEW                 | NULL       
|</b>
+| Table Parameters:            | NULL                         | NULL       |
+|                              | transient_lastDdlTime        | 1373313387 |
+|                              | NULL                         | NULL       |
+| # Storage Information        | NULL                         | NULL       |
+| SerDe Library:               | null                         | NULL       |
+| InputFormat:                 | null                         | NULL       |
+| OutputFormat:                | null                         | NULL       |
+| Compressed:                  | No                           | NULL       |
+| Num Buckets:                 | 0                            | NULL       |
+| Bucket Columns:              | []                           | NULL       |
+| Sort Columns:                | []                           | NULL       |
+|                              | NULL                         | NULL       |
+| # View Information           | NULL                         | NULL       |
+<b>| View Original Text:          | SELECT * FROM t1             | NULL       |
+| View Expanded Text:          | SELECT * FROM t1             | NULL       
|</b>
++------------------------------+------------------------------+------------+
 </codeblock>
       </p>
 
@@ -1935,7 +2438,7 @@ Query finished, fetching results ...
       </p>
 
       <p id="impala_mission_statement">
-        Impala provides high-performance, low-latency SQL queries on data 
stored in popular Apache Hadoop
+        The Apache Impala (incubating) project provides high-performance, 
low-latency SQL queries on data stored in popular Apache Hadoop
         file formats. The fast response for queries enables interactive 
exploration and fine-tuning of analytic
         queries, rather than long batch jobs traditionally associated with 
SQL-on-Hadoop technologies. (You will
         often see the term <q>interactive</q> applied to these kinds of fast 
queries with human-scale response
@@ -2011,6 +2514,48 @@ Query finished, fetching results ...
         </ol>
       </p>
 
+      <p id="skip_header_lines" rev="IMPALA-1740 2.6.0">
+        In CDH 5.8 / Impala 2.6 and higher, Impala can optionally
+        skip an arbitrary number of header lines from text input files on HDFS
+        based on the <codeph>skip.header.line.count</codeph> value in the
+        <codeph>TBLPROPERTIES</codeph> field of the table metadata. For 
example:
+<codeblock>create table header_line(first_name string, age int)
+  row format delimited fields terminated by ',';
+
+-- Back in the shell, load data into the table with commands such as:
+-- cat >data.csv
+-- Name,Age
+-- Alice,25
+-- Bob,19
+-- hdfs dfs -put data.csv /user/hive/warehouse/header_line
+
+refresh header_line;
+
+-- Initially, the Name,Age header line is treated as a row of the table.
+select * from header_line limit 10;
++------------+------+
+| first_name | age  |
++------------+------+
+| Name       | NULL |
+| Alice      | 25   |
+| Bob        | 19   |
++------------+------+
+
+alter table header_line set tblproperties('skip.header.line.count'='1');
+
+-- Once the table property is set, queries skip the specified number of lines
+-- at the beginning of each text data file. Therefore, all the files in the 
table
+-- should follow the same convention for header lines.
+select * from header_line limit 10;
++------------+-----+
+| first_name | age |
++------------+-----+
+| Alice      | 25  |
+| Bob        | 19  |
++------------+-----+
+</codeblock>
+      </p>
+
 <!-- This list makes the impala_features.xml file obsolete. It was only ever 
there for conrefs. -->
 
       <p id="feature_list">
@@ -2111,6 +2656,30 @@ Query finished, fetching results ...
         Snippets related to installation, upgrading, prerequisites.
       </p>
 
+      <note id="core_dump_considerations">
+      <ul>
+        <li>
+          <p>
+            The location of core dump files may vary according to your 
operating system configuration.
+          </p>
+        </li>
+        <li>
+          <p>
+            Other security settings may prevent Impala from writing core dumps 
even when this option is enabled.
+          </p>
+        </li>
+        <li rev="CDH-34070">
+          <p>
+            On systems managed by Cloudera Manager, the default location for 
core dumps is on a temporary
+            filesystem, which can lead to out-of-space issues if the core 
dumps are large, frequent, or
+            not removed promptly. To specify an alternative location for the 
core dumps, filter the
+            Impala configuration settings to find the 
<codeph>core_dump_dir</codeph> option, which is
+            available in Cloudera Manager 5.4.3 and higher. This option lets 
you specify a different directory
+            for core dumps for each of the Impala-related daemons.
+          </p>
+        </li>
+      </ul>
+      </note>
       <p id="cpu_prereq" rev="2.2.0">
         The prerequisite for CPU architecture has been relaxed in Impala 2.2.0 
and higher. From this release
         onward, Impala works on CPUs that have the SSSE3 instruction set. The 
SSE4 instruction set is no longer
@@ -2135,15 +2704,21 @@ sudo pip-python install ssl</codeblock>
       </p>
 
       <note type="warning" id="impala_kerberos_ssl_caveat">
-        Currently, you can enable Kerberos authentication between Impala 
internal components,
+        Prior to CDH 5.5.2 / Impala 2.3.2, you could enable Kerberos 
authentication between Impala internal components,
         or SSL encryption between Impala internal components, but not both at 
the same time.
-        Impala does not start if both of these settings are enabled.
-        This limitation only applies to the Impala-to-Impala communication 
settings; you can still use both
-        Kerberos and SSL when connecting to Impala through 
<cmdname>impala-shell</cmdname>, JDBC, or ODBC.
+        This restriction has now been lifted.
         See <xref href="https://issues.cloudera.org/browse/IMPALA-2598"; 
scope="external" format="html">IMPALA-2598</xref>
-        to track the resolution of this issue.
+        to see the maintenance releases for different levels of CDH where the 
fix has been published.
       </note>
 
+      <p id="hive_jdbc_ssl_kerberos_caveat">
+        Prior to CDH 5.7 / Impala 2.5, the Hive JDBC driver did not support 
connections that use both Kerberos authentication
+        and SSL encryption. If your cluster is running an older release that 
has this restriction,
+        to use both of these security features with Impala through a JDBC 
application,
+        use the <xref 
href="http://www.cloudera.com/content/www/en-us/downloads.html.html"; 
scope="external" format="html">Cloudera JDBC Connector</xref>
+        as the JDBC driver.
+      </p>
+
       <note rev="1.2" id="cdh4_cdh5_upgrade">
         Because Impala 1.2.2 works with CDH 4, while the Impala that comes 
with the CDH 5 beta is version 1.2.0,
         upgrading from CDH 4 to the CDH 5 beta actually reverts to an earlier 
Impala version. The beta release of
@@ -2218,7 +2793,7 @@ sudo pip-python install ssl</codeblock>
         </li>
       </ul>
 
-      <note id="compute_stats_parquet">
+      <note id="compute_stats_parquet" rev="IMPALA-488">
         Currently, a known issue
         (<xref href="https://issues.cloudera.org/browse/IMPALA-488"; 
scope="external" format="html">IMPALA-488</xref>)
         could cause excessive memory usage during a <codeph>COMPUTE 
STATS</codeph> operation on a Parquet table. As
@@ -2232,13 +2807,78 @@ sudo pip-python install ssl</codeblock>
     <section id="admin_conrefs">
 
       <title>Administration</title>
+ 
+      <p id="statestored_catalogd_ha_blurb" rev="CDH-39624">
+        Most considerations for load balancing and high availability apply to 
the <cmdname>impalad</cmdname> daemon.
+        The <cmdname>statestored</cmdname> and <cmdname>catalogd</cmdname> 
daemons do not have special
+        requirements for high availability, because problems with those 
daemons do not result in data loss.
+        If those daemons become unavailable due to an outage on a particular
+        host, you can stop the Impala service, delete the <uicontrol>Impala 
StateStore</uicontrol> and
+        <uicontrol>Impala Catalog Server</uicontrol> roles, add the roles on a 
different host, and restart the
+        Impala service. 
+      </p>
+
+      <p id="hdfs_caching_encryption_caveat" rev="IMPALA-3679">
+        Due to a limitation of HDFS, zero-copy reads are not supported with
+        encryption. Cloudera recommends not using HDFS caching for Impala data
+        files in encryption zones. The queries fall back to the normal read
+        path during query execution, which might cause some performance 
overhead.
+      </p>
+
+      <note id="llama_query_options_obsolete">
+        <p>
+          This query option is no longer supported, because it affects 
interaction
+          between Impala and Llama.
+          The use of the Llama component for integrated
+          resource management within YARN is no longer supported with CDH 5.5 /
+          Impala 2.3 and higher.
+        </p>
+      </note>
 
       <note id="impala_llama_obsolete">
-        Though Impala can be used together with YARN via simple configuration 
of Static Service Pools in Cloudera Manager,
-        the use of the general-purpose component Llama for integrated resource 
management within YARN is no longer supported
-        with CDH 5.5 / Impala 2.3 and higher.
+        <p>
+          The use of the Llama component for integrated
+          resource management within YARN is no longer supported with CDH 5.5 /
+          Impala 2.3 and higher.
+        </p>
+        <p>
+          For clusters running Impala alongside
+          other data management components, you define static service pools to 
define the resources
+          available to Impala and other components. Then within the area 
allocated for Impala,
+          you can create dynamic service pools, each with its own settings for 
the Impala admission control feature.
+        </p>
       </note>
 
+      <note id="max_memory_default_limit_caveat"> If you specify <uicontrol>Max
+          Memory</uicontrol> for an Impala dynamic resource pool, you must also
+        specify the <uicontrol>Default Query Memory Limit</uicontrol>.
+          <uicontrol>Max Memory</uicontrol> relies on the <uicontrol>Default
+          Query Memory Limit</uicontrol> to produce a reliable estimate of
+        overall memory consumption for a query. </note>
+
+
+      <p id="admission_control_mem_limit_interaction">
+        For example, consider the following scenario:
+        <ul>
+          <li> The cluster is running <cmdname>impalad</cmdname> daemons on 
five
+            DataNodes. </li>
+          <li> A dynamic resource pool has <uicontrol>Max Memory</uicontrol> 
set
+            to 100 GB. </li>
+          <li> The <uicontrol>Default Query Memory Limit</uicontrol> for the
+            pool is 10 GB. Therefore, any query running in this pool could use
+            up to 50 GB of memory (default query memory limit * number of 
Impala
+            nodes). </li>
+          <li> The maximum number of queries that Impala executes concurrently
+            within this dynamic resource pool is two, which is the most that
+            could be accomodated within the 100 GB <uicontrol>Max
+              Memory</uicontrol> cluster-wide limit. </li>
+          <li> There is no memory penalty if queries use less memory than the
+              <uicontrol>Default Query Memory Limit</uicontrol> per-host 
setting
+            or the <uicontrol>Max Memory</uicontrol> cluster-wide limit. These
+            values are only used to estimate how many queries can be run
+            concurrently within the resource constraints for the pool. </li>
+        </ul>
+      </p>
       <note id="impala_llama_caveat">When using YARN with Impala, Cloudera
         recommends using the static partitioning technique (through a static
         service pool) rather than the combination of YARN and Llama. YARN is a
@@ -2263,12 +2903,10 @@ sudo pip-python install ssl</codeblock>
         the connection has been closed.
       </note>
 
-      <p id="impala_mr">
-        For a detailed example of configuring a cluster to share resources 
between Impala queries and MapReduce
-        jobs, see
-        <xref 
href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Installation-Guide/cmig_impala_res_mgmt.html";
 scope="external" format="html">Setting
-        up a Multi-tenant Cluster for Impala and MapReduce</xref>
-      </p>
+      <p id="impala_mr"> For a detailed information about configuring a 
cluster to share resources
+        between Impala queries and MapReduce jobs, see <xref
+          href="../topics/admin_howto_multitenancy.xml#howto_multitenancy"/> 
and <xref
+          href="../topics/impala_howto_rm.xml#howto_impala_rm"/>.</p>
 
       <note id="llama_beta" type="warning">
         In CDH 5.0.0, the Llama component is in beta. It is intended for 
evaluation of resource management in test
@@ -2293,8 +2931,33 @@ sudo pip-python install ssl</codeblock>
         There are no new bug fixes, new features, or incompatible changes.
       </p>
 
-      <note id="only_cdh5_230">
-        Impala 2.3.0 is available as part of CDH 5.5.0 and is not available 
for CDH 4.
+<!-- This next one is not actually used. -->
+      <note id="only_cdh5_260">
+        Impala 2.6.x is available as part of CDH 5.8.x.
+      </note>
+
+      <note id="only_cdh5_250">
+        Impala 2.5.x is available as part of CDH 5.7.x and is not available 
for CDH 4.
+        Cloudera does not intend to release future versions of Impala for CDH 
4 outside patch and maintenance releases if required.
+        Given the end-of-maintenance status for CDH 4, Cloudera recommends all 
customers to migrate to a recent CDH 5 release.
+      </note>
+
+<!-- These next 2 for Impala 2.4 / CDH 5.6 are not actually used. Trying to 
move away from the repetitive "don't use CDH 4" notes. -->
+
+      <note id="only_cdh5_24x">
+        Impala 2.4.x is available as part of CDH 5.6.x and is not available 
for CDH 4.
+        Cloudera does not intend to release future versions of Impala for CDH 
4 outside patch and maintenance releases if required.
+        Given the end-of-maintenance status for CDH 4, Cloudera recommends all 
customers to migrate to a recent CDH 5 release.
+      </note>
+
+      <note id="only_cdh5_240">
+        Impala 2.4.0 is available as part of CDH 5.6.0 and is not available 
for CDH 4.
+        Cloudera does not intend to release future versions of Impala for CDH 
4 outside patch and maintenance releases if required.
+        Given the end-of-maintenance status for CDH 4, Cloudera recommends all 
customers to migrate to a recent CDH 5 release.
+      </note>
+
+      <note id="only_cdh5_23x">
+        Impala 2.3.x is available as part of CDH 5.5.x and is not available 
for CDH 4.
         Cloudera does not intend to release future versions of Impala for CDH 
4 outside patch and maintenance releases if required.
         Given the end-of-maintenance status for CDH 4, Cloudera recommends all 
customers to migrate to a recent CDH 5 release.
       </note>
@@ -2409,6 +3072,17 @@ sudo pip-python install ssl</codeblock>
         Impala 1.4.1 is only available as part of CDH 5.1.2, not under CDH 4.
       </note>
 
+      <note id="standalone_release_notes_blurb">
+        Starting in April 2016, future release note updates are being 
consolidated
+        in a single location to avoid duplication of stale or incomplete 
information.
+        You can view online the Impala
+        <xref 
href="http://www.cloudera.com/documentation/enterprise/release-notes/topics/impala_new_features.html";
 scope="external" format="html">New Features</xref>,
+        <xref 
href="http://www.cloudera.com/documentation/enterprise/release-notes/topics/impala_incompatible_changes.html";
 scope="external" format="html">Incompatible Changes</xref>,
+        <xref 
href="http://www.cloudera.com/documentation/enterprise/release-notes/topics/impala_known_issues.html";
 scope="external" format="html">Known Issues</xref>, and
+        <xref 
href="http://www.cloudera.com/documentation/enterprise/release-notes/topics/impala_fixed_issues.html";
 scope="external" format="html">Fixed Issues</xref>.
+        You can view or print all of these by downloading <xref 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/impala.html";
 scope="external" format="html">the latest Impala PDF</xref>.
+      </note>
+
 <!-- The only significant text in this paragraph is inside the <ph> tags. 
Those are conref'ed into sentences
      similar in form to the ones below. -->
 
@@ -2472,6 +3146,85 @@ sudo pip-python install ssl</codeblock>
 
     </section>
 
+    <section id="relnotes">
+
+      <title>Release Notes</title>
+
+      <p>
+        These are notes associated with a particular JIRA issue. They 
typically will be conref'ed
+        both in the release notes and someplace in the main body as a 
limitation or warning or similar.
+      </p>
+
+      <p id="IMPALA-3662" rev="IMPALA-3662">
+        The initial release of CDH 5.7 / Impala 2.5 sometimes has a higher 
peak memory usage than in previous releases
+        while reading Parquet files.
+        The following query options might help to reduce memory consumption in 
the Parquet scanner:
+        <ul>
+          <li>
+            Reduce the number of scanner threads, for example: <codeph>set 
num_scanner_threads=30</codeph>
+          </li>
+          <li>
+            Reduce the batch size, for example: <codeph>set 
batch_size=512</codeph>
+          </li>
+          <li>
+            Increase the memory limit, for example: <codeph>set 
mem_limit=64g</codeph>
+          </li>
+        </ul>
+        You can track the status of the fix for this issue at
+        <xref href="https://issues.cloudera.org/browse/IMPALA-3662"; 
scope="external" format="html">IMPALA-3662</xref>.
+      </p>
+
+      <p id="increase_catalogd_heap_size" rev="CDH-40801 TSB-168">
+        For schemas with large numbers of tables, partitions, and data files, 
the <cmdname>catalogd</cmdname>
+        daemon might encounter an out-of-memory error. To increase the memory 
limit for the
+        <cmdname>catalogd</cmdname> daemon:
+
+        <ol>
+          <li>
+            <p>
+              Check current memory usage for the <cmdname>catalogd</cmdname> 
daemon by running the
+              following commands on the host where that daemon runs on your 
cluster:
+            </p>
+  <codeblock>
+  jcmd <varname>catalogd_pid</varname> VM.flags
+  jmap -heap <varname>catalogd_pid</varname>
+  </codeblock>
+          </li>
+          <li>
+            <p>
+              Decide on a large enough value for the 
<cmdname>catalogd</cmdname> heap.
+              You express it as an environment variable value as follows:
+            </p>
+  <codeblock>
+  JAVA_TOOL_OPTIONS="-Xmx8g"
+  </codeblock>
+          </li>
+          <li>
+            <p rev="OPSAPS-26483">
+              On systems managed by Cloudera Manager, include this value in 
the configuration field
+              <uicontrol>Java Heap Size of Catalog Server in Bytes</uicontrol> 
(Cloudera Manager 5.7 and higher), or
+              <uicontrol>Impala Catalog Server Environment Advanced 
Configuration Snippet (Safety Valve)</uicontrol>
+              (prior to Cloudera Manager 5.7).
+              Then restart the Impala service.
+            </p>
+          </li>
+          <li>
+            <p>
+              On systems not managed by Cloudera Manager, put this environment 
variable setting into the
+              startup script for the <cmdname>catalogd</cmdname> daemon, then 
restart the <cmdname>catalogd</cmdname>
+              daemon.
+            </p>
+          </li>
+          <li>
+            <p>
+              Use the same <cmdname>jcmd</cmdname> and <cmdname>jmap</cmdname> 
commands as earlier to
+              verify that the new settings are in effect.
+            </p>
+          </li>
+        </ol>
+      </p>
+    </section>
+
   </conbody>
 
 </concept>

incubator-impala git commit: Upgrade to latest version of impala_common.xml.

Reply via email to