5.10 versions.

jrussell Tue, 01 Nov 2016 16:13:53 -0700

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_perf_stats.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_perf_stats.xml 
b/docs/topics/impala_perf_stats.xml
index 7ed878f..5e12881 100644
--- a/docs/topics/impala_perf_stats.xml
+++ b/docs/topics/impala_perf_stats.xml
@@ -3,7 +3,16 @@
 <concept id="perf_stats">
 
   <title>Table and Column Statistics</title>
-  
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Performance"/>
+      <data name="Category" value="Querying"/>
+      <data name="Category" value="Concepts"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
 
   <conbody>
 
@@ -16,7 +25,1007 @@
       with, and how to produce them and keep them up to date.
     </p>
 
-    
+    <note>
+      <p rev="1.2.2">
+        Originally, Impala relied on the Hive mechanism for collecting 
statistics, through the Hive <codeph>ANALYZE
+        TABLE</codeph> statement which initiates a MapReduce job. For better 
user-friendliness and reliability,
+        Impala implements its own <codeph>COMPUTE STATS</codeph> statement in 
Impala 1.2.2 and higher, along with the
+        <codeph>DROP STATS</codeph>, <codeph>SHOW TABLE STATS</codeph>, and 
<codeph>SHOW COLUMN STATS</codeph>
+        statements.
+      </p>
+    </note>
+
+    <p outputclass="toc inpage"/>
+  </conbody>
+
+  <concept id="perf_table_stats">
+
+    <title id="table_stats">Overview of Table Statistics</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Concepts"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+<!-- Hive background info: https://cwiki.apache.org/Hive/statsdev.html -->
+
+      <p>
+        The Impala query planner can make use of statistics about entire 
tables and partitions.
+        This information includes physical characteristics such as the number 
of rows, number of data files,
+        the total size of the data files, and the file format. For partitioned 
tables, the numbers
+        are calculated per partition, and as totals for the whole table.
+        This metadata is stored in the metastore database, and can be updated 
by either Impala or Hive.
+        If a number is not available, the value -1 is used as a placeholder.
+        Some numbers, such as number and total sizes of data files, are always 
kept up to date because
+        they can be calculated cheaply, as part of gathering HDFS block 
metadata.
+      </p>
+
+      <p>
+        The following example shows table stats for an unpartitioned Parquet 
table.
+        The values for the number and sizes of files are always available.
+        Initially, the number of rows is not known, because it requires a 
potentially expensive
+        scan through the entire table, and so that value is displayed as -1.
+        The <codeph>COMPUTE STATS</codeph> statement fills in any unknown 
table stats values.
+      </p>
+
+<codeblock>
+show table stats parquet_snappy;
++-------+--------+---------+--------------+-------------------+---------+-------------------+...
+| #Rows | #Files | Size    | Bytes Cached | Cache Replication | Format  | 
Incremental stats |...
++-------+--------+---------+--------------+-------------------+---------+-------------------+...
+| -1    | 96     | 23.35GB | NOT CACHED   | NOT CACHED        | PARQUET | 
false             |...
++-------+--------+---------+--------------+-------------------+---------+-------------------+...
+
+compute stats parquet_snappy;
++-----------------------------------------+
+| summary                                 |
++-----------------------------------------+
+| Updated 1 partition(s) and 6 column(s). |
++-----------------------------------------+
+
+
+show table stats parquet_snappy;
++------------+--------+---------+--------------+-------------------+---------+-------------------+...
+| #Rows      | #Files | Size    | Bytes Cached | Cache Replication | Format  | 
Incremental stats |...
++------------+--------+---------+--------------+-------------------+---------+-------------------+...
+| 1000000000 | 96     | 23.35GB | NOT CACHED   | NOT CACHED        | PARQUET | 
false             |...
++------------+--------+---------+--------------+-------------------+---------+-------------------+...
+</codeblock>
+
+      <p>
+        Impala performs some optimizations using this metadata on its own, and 
other optimizations by
+        using a combination of table and column statistics.
+      </p>
+
+      <p rev="1.2.1">
+        To check that table statistics are available for a table, and see the 
details of those statistics, use the
+        statement <codeph>SHOW TABLE STATS 
<varname>table_name</varname></codeph>. See
+        <xref href="impala_show.xml#show"/> for details.
+      </p>
+
+      <p>
+        If you use the Hive-based methods of gathering statistics, see
+        <xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev"; 
scope="external" format="html">the
+        Hive wiki</xref> for information about the required configuration on 
the Hive side. Cloudera recommends
+        using the Impala <codeph>COMPUTE STATS</codeph> statement to avoid 
potential configuration and scalability
+        issues with the statistics-gathering process.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>
+    </conbody>
+  </concept>
+
+  <concept id="perf_column_stats">
+
+    <title id="column_stats">Overview of Column Statistics</title>
+
+    <conbody>
+
+<!-- Cloudera+Hive background information: 
http://blog.cloudera.com/blog/2012/08/column-statistics-in-hive/ -->
+
+      <p>
+        The Impala query planner can make use of statistics about individual 
columns when that metadata is
+        available in the metastore database. This technique is most valuable 
for columns compared across tables in
+        <xref href="impala_perf_joins.xml#perf_joins">join queries</xref>, to 
help estimate how many rows the query
+        will retrieve from each table. <ph rev="2.0.0"> These statistics are 
also important for correlated
+        subqueries using the <codeph>EXISTS()</codeph> or 
<codeph>IN()</codeph> operators, which are processed
+        internally the same way as join queries.</ph>
+      </p>
+
+      <p>
+        The following example shows column stats for an unpartitioned Parquet 
table.
+        The values for the maximum and average sizes of some types are always 
available,
+        because those figures are constant for numeric and other fixed-size 
types.
+        Initially, the number of distinct values is not known, because it 
requires a potentially expensive
+        scan through the entire table, and so that value is displayed as -1.
+        The same applies to maximum and average sizes of variable-sized types, 
such as <codeph>STRING</codeph>.
+        The <codeph>COMPUTE STATS</codeph> statement fills in most unknown 
column stats values.
+        (It does not record the number of <codeph>NULL</codeph> values, 
because currently Impala
+        does not use that figure for query optimization.)
+      </p>
+
+<codeblock>
+show column stats parquet_snappy;
++-------------+----------+------------------+--------+----------+----------+
+| Column      | Type     | #Distinct Values | #Nulls | Max Size | Avg Size |
++-------------+----------+------------------+--------+----------+----------+
+| id          | BIGINT   | -1               | -1     | 8        | 8        |
+| val         | INT      | -1               | -1     | 4        | 4        |
+| zerofill    | STRING   | -1               | -1     | -1       | -1       |
+| name        | STRING   | -1               | -1     | -1       | -1       |
+| assertion   | BOOLEAN  | -1               | -1     | 1        | 1        |
+| location_id | SMALLINT | -1               | -1     | 2        | 2        |
++-------------+----------+------------------+--------+----------+----------+
+
+compute stats parquet_snappy;
++-----------------------------------------+
+| summary                                 |
++-----------------------------------------+
+| Updated 1 partition(s) and 6 column(s). |
++-----------------------------------------+
+
+show column stats parquet_snappy;
++-------------+----------+------------------+--------+----------+-------------------+
+| Column      | Type     | #Distinct Values | #Nulls | Max Size | Avg Size     
     |
++-------------+----------+------------------+--------+----------+-------------------+
+| id          | BIGINT   | 183861280        | -1     | 8        | 8            
     |
+| val         | INT      | 139017           | -1     | 4        | 4            
     |
+| zerofill    | STRING   | 101761           | -1     | 6        | 6            
     |
+| name        | STRING   | 145636240        | -1     | 22       | 
13.00020027160645 |
+| assertion   | BOOLEAN  | 2                | -1     | 1        | 1            
     |
+| location_id | SMALLINT | 339              | -1     | 2        | 2            
     |
++-------------+----------+------------------+--------+----------+-------------------+
+</codeblock>
+
+      <note>
+        <p>
+          For column statistics to be effective in Impala, you also need to 
have table statistics for the
+          applicable tables, as described in <xref 
href="impala_perf_stats.xml#perf_table_stats"/>. When you use
+          the Impala <codeph>COMPUTE STATS</codeph> statement, both table and 
column statistics are automatically
+          gathered at the same time, for all columns in the table.
+        </p>
+        <p conref="../shared/impala_common.xml#common/decimal_no_stats"/>
+      </note>
+
+      <note conref="../shared/impala_common.xml#common/compute_stats_nulls"/>
+
+<!-- Hive-based instructions are considered obsolete since the introduction of 
the Impala COMPUTE STATS statement.
+      <p>
+        Add settings like the following to the 
<filepath>hive-site.xml</filepath>
+        configuration file, in the Hive configuration directory, on every node 
where you run
+        <codeph>ANALYZE TABLE</codeph> statements through the
+        <codeph>hive</codeph> shell. The
+        <codeph>hive.stats.ndv.error</codeph> setting represents the standard 
error when
+        estimating the number of distinct values for a column. The value of 
5.0 is recommended as a tradeoff between the
+        accuracy of the gathered statistics and the resource usage of the 
stats-gathering process.
+      </p>
+
+<codeblock><![CDATA[<property>
+ <name>hive.stats.ndv.error</name>
+ <value>5.0</value>
+</property>]]></codeblock>
+
+      <p>
+        5.0 is a relatively low value that devotes substantial computational 
resources to the statistics-gathering
+        process. To reduce the resource usage, you could increase this value; 
to make the statistics even more precise,
+        you could lower it.
+      </p>
+
+      <p>
+        The syntax for gathering column statistics uses the <codeph>ANALYZE 
TABLE ...
+        COMPUTE STATISTICS</codeph> clause, with an additional <codeph>FOR
+        COLUMNS</codeph> clause. For partitioned tables, you can gather 
statistics for specific partitions by including
+        a clause <codeph>PARTITION
+        (<varname>col1=val1</varname>,<varname>col2=val2</varname>,
+        ...)</codeph>; but you cannot include the partitioning columns in the
+        <codeph>FOR COLUMNS</codeph> clause. Also, you cannot use fully 
qualified table
+        names, so issue a <codeph>USE</codeph> command first to switch to the
+        appropriate database. For example:
+      </p>
+
+<codeblock>USE <varname>database_name</varname>;
+ANALYZE TABLE <varname>table_name</varname> COMPUTE STATISTICS FOR COLUMNS 
<varname>column_list</varname>;
+ANALYZE TABLE <varname>table_name</varname> PARTITION 
(<varname>partition_specs</varname>) COMPUTE STATISTICS FOR COLUMNS 
<varname>column_list</varname>;</codeblock>
+-->
+
+      <p rev="1.2.1">
+        To check whether column statistics are available for a particular set 
of columns, use the <codeph>SHOW
+        COLUMN STATS <varname>table_name</varname></codeph> statement, or 
check the extended
+        <codeph>EXPLAIN</codeph> output for a query against that table that 
refers to those columns. See
+        <xref href="impala_show.xml#show"/> and <xref 
href="impala_explain.xml#explain"/> for details.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>
+    </conbody>
+  </concept>
+
+  <concept id="perf_stats_partitions">
+    <title id="stats_partitions">How Table and Column Statistics Work for 
Partitioned Tables</title>
+    <conbody>
+
+      <p>
+        When you use Impala for <q>big data</q>, you are highly likely to use 
partitioning
+        for your biggest tables, the ones representing data that can be 
logically divided
+        based on dates, geographic regions, or similar criteria. The table and 
column statistics
+        are especially useful for optimizing queries on such tables. For 
example, a query involving
+        one year might involve substantially more or less data than a query 
involving a different year,
+        or a range of several years. Each query might be optimized differently 
as a result.
+      </p>
+
+      <p>
+        The following examples show how table and column stats work with a 
partitioned table.
+        The table for this example is partitioned by year, month, and day.
+        For simplicity, the sample data consists of 5 partitions, all from the 
same year and month.
+        Table stats are collected independently for each partition. (In fact, 
the
+        <codeph>SHOW PARTITIONS</codeph> statement displays exactly the same 
information as
+        <codeph>SHOW TABLE STATS</codeph> for a partitioned table.) Column 
stats apply to
+        the entire table, not to individual partitions. Because the partition 
key column values
+        are represented as HDFS directories, their characteristics are 
typically known in advance,
+        even when the values for non-key columns are shown as -1.
+      </p>
+
+<codeblock>
+show partitions year_month_day;
++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
+| year  | month | day | #Rows | #Files | Size    | Bytes Cached | Cache 
Replication | Format  |...
++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
+| 2013  | 12    | 1   | -1    | 1      | 2.51MB  | NOT CACHED   | NOT CACHED   
     | PARQUET |...
+| 2013  | 12    | 2   | -1    | 1      | 2.53MB  | NOT CACHED   | NOT CACHED   
     | PARQUET |...
+| 2013  | 12    | 3   | -1    | 1      | 2.52MB  | NOT CACHED   | NOT CACHED   
     | PARQUET |...
+| 2013  | 12    | 4   | -1    | 1      | 2.51MB  | NOT CACHED   | NOT CACHED   
     | PARQUET |...
+| 2013  | 12    | 5   | -1    | 1      | 2.52MB  | NOT CACHED   | NOT CACHED   
     | PARQUET |...
+| Total |       |     | -1    | 5      | 12.58MB | 0B           |              
     |         |...
++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
+
+show table stats year_month_day;
++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
+| year  | month | day | #Rows | #Files | Size    | Bytes Cached | Cache 
Replication | Format  |...
++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
+| 2013  | 12    | 1   | -1    | 1      | 2.51MB  | NOT CACHED   | NOT CACHED   
     | PARQUET |...
+| 2013  | 12    | 2   | -1    | 1      | 2.53MB  | NOT CACHED   | NOT CACHED   
     | PARQUET |...
+| 2013  | 12    | 3   | -1    | 1      | 2.52MB  | NOT CACHED   | NOT CACHED   
     | PARQUET |...
+| 2013  | 12    | 4   | -1    | 1      | 2.51MB  | NOT CACHED   | NOT CACHED   
     | PARQUET |...
+| 2013  | 12    | 5   | -1    | 1      | 2.52MB  | NOT CACHED   | NOT CACHED   
     | PARQUET |...
+| Total |       |     | -1    | 5      | 12.58MB | 0B           |              
     |         |...
++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
+
+show column stats year_month_day;
++-----------+---------+------------------+--------+----------+----------+
+| Column    | Type    | #Distinct Values | #Nulls | Max Size | Avg Size |
++-----------+---------+------------------+--------+----------+----------+
+| id        | INT     | -1               | -1     | 4        | 4        |
+| val       | INT     | -1               | -1     | 4        | 4        |
+| zfill     | STRING  | -1               | -1     | -1       | -1       |
+| name      | STRING  | -1               | -1     | -1       | -1       |
+| assertion | BOOLEAN | -1               | -1     | 1        | 1        |
+| year      | INT     | 1                | 0      | 4        | 4        |
+| month     | INT     | 1                | 0      | 4        | 4        |
+| day       | INT     | 5                | 0      | 4        | 4        |
++-----------+---------+------------------+--------+----------+----------+
+
+compute stats year_month_day;
++-----------------------------------------+
+| summary                                 |
++-----------------------------------------+
+| Updated 5 partition(s) and 5 column(s). |
++-----------------------------------------+
+
+show table stats year_month_day;
++-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+...
+| year  | month | day | #Rows  | #Files | Size    | Bytes Cached | Cache 
Replication | Format  |...
++-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+...
+| 2013  | 12    | 1   | 93606  | 1      | 2.51MB  | NOT CACHED   | NOT CACHED  
      | PARQUET |...
+| 2013  | 12    | 2   | 94158  | 1      | 2.53MB  | NOT CACHED   | NOT CACHED  
      | PARQUET |...
+| 2013  | 12    | 3   | 94122  | 1      | 2.52MB  | NOT CACHED   | NOT CACHED  
      | PARQUET |...
+| 2013  | 12    | 4   | 93559  | 1      | 2.51MB  | NOT CACHED   | NOT CACHED  
      | PARQUET |...
+| 2013  | 12    | 5   | 93845  | 1      | 2.52MB  | NOT CACHED   | NOT CACHED  
      | PARQUET |...
+| Total |       |     | 469290 | 5      | 12.58MB | 0B           |             
      |         |...
++-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+...
+
+show column stats year_month_day;
++-----------+---------+------------------+--------+----------+-------------------+
+| Column    | Type    | #Distinct Values | #Nulls | Max Size | Avg Size        
  |
++-----------+---------+------------------+--------+----------+-------------------+
+| id        | INT     | 511129           | -1     | 4        | 4               
  |
+| val       | INT     | 364853           | -1     | 4        | 4               
  |
+| zfill     | STRING  | 311430           | -1     | 6        | 6               
  |
+| name      | STRING  | 471975           | -1     | 22       | 
13.00160026550293 |
+| assertion | BOOLEAN | 2                | -1     | 1        | 1               
  |
+| year      | INT     | 1                | 0      | 4        | 4               
  |
+| month     | INT     | 1                | 0      | 4        | 4               
  |
+| day       | INT     | 5                | 0      | 4        | 4               
  |
++-----------+---------+------------------+--------+----------+-------------------+
+</codeblock>
+
+      <note>
+        Partitioned tables can grow so large that scanning the entire table, 
as the <codeph>COMPUTE STATS</codeph>
+        statement does, is impractical just to update the statistics for a new 
partition. The standard
+        <codeph>COMPUTE STATS</codeph> statement might take hours, or even 
days. That situation is where you switch
+        to using incremental statistics, a feature available in CDH 5.3 / 
Impala 2.1 and higher.
+        See <xref href="impala_perf_stats.xml#perf_stats_incremental"/> for 
details about this feature
+        and the <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax.
+      </note>
+
+      <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>
+    </conbody>
+  </concept>
+
+  <concept rev="2.1.0" id="perf_stats_incremental">
+
+    <title id="incremental_stats">Overview of Incremental Statistics</title>
+
+    <conbody>
+
+      <p>
+        In Impala 2.1.0 and higher, you can use the syntax <codeph>COMPUTE 
INCREMENTAL STATS</codeph> and
+        <codeph>DROP INCREMENTAL STATS</codeph>. The 
<codeph>INCREMENTAL</codeph> clauses work with incremental
+        statistics, a specialized feature for partitioned tables that are 
large or frequently updated with new
+        partitions.
+      </p>
+
+      <p>
+        When you compute incremental statistics for a partitioned table, by 
default Impala only processes those
+        partitions that do not yet have incremental statistics. By processing 
only newly added partitions, you can
+        keep statistics up to date for large partitioned tables, without 
incurring the overhead of reprocessing the
+        entire table each time.
+      </p>
+
+      <p>
+        You can also compute or drop statistics for a single partition by 
including a <codeph>PARTITION</codeph>
+        clause in the <codeph>COMPUTE INCREMENTAL STATS</codeph> or 
<codeph>DROP INCREMENTAL STATS</codeph>
+        statement.
+      </p>
+
+      <p>
+        The metadata for incremental statistics is handled differently from 
the original style of statistics:
+      </p>
+
+      <ul>
+        <li>
+          <p>
+            If you have an existing partitioned table for which you have 
already computed statistics, issuing
+            <codeph>COMPUTE INCREMENTAL STATS</codeph> without a partition 
clause causes Impala to rescan the
+            entire table. Once the incremental statistics are computed, any 
future <codeph>COMPUTE INCREMENTAL
+            STATS</codeph> statements only scan any new partitions and any 
partitions where you performed
+            <codeph>DROP INCREMENTAL STATS</codeph>.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            The <codeph>SHOW TABLE STATS</codeph> and <codeph>SHOW 
PARTITIONS</codeph> statements now include an
+            additional column showing whether incremental statistics are 
available for each column. A partition
+            could already be covered by the original type of statistics based 
on a prior <codeph>COMPUTE
+            STATS</codeph> statement, as indicated by a value other than 
<codeph>-1</codeph> under the
+            <codeph>#Rows</codeph> column. Impala query planning uses either 
kind of statistics when available.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            <codeph>COMPUTE INCREMENTAL STATS</codeph> takes more time than 
<codeph>COMPUTE STATS</codeph> for the
+            same volume of data. Therefore it is most suitable for tables with 
large data volume where new
+            partitions are added frequently, making it impractical to run a 
full <codeph>COMPUTE STATS</codeph>
+            operation for each new partition. For unpartitioned tables, or 
partitioned tables that are loaded once
+            and not updated with new partitions, use the original 
<codeph>COMPUTE STATS</codeph> syntax.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            <codeph>COMPUTE INCREMENTAL STATS</codeph> uses some memory in the 
<cmdname>catalogd</cmdname> process,
+            proportional to the number of partitions and number of columns in 
the applicable table. The memory
+            overhead is approximately 400 bytes for each column in each 
partition. This memory is reserved in the
+            <cmdname>catalogd</cmdname> daemon, the 
<cmdname>statestored</cmdname> daemon, and in each instance of
+            the <cmdname>impalad</cmdname> daemon.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            In cases where new files are added to an existing partition, issue 
a <codeph>REFRESH</codeph> statement
+            for the table, followed by a <codeph>DROP INCREMENTAL 
STATS</codeph> and <codeph>COMPUTE INCREMENTAL
+            STATS</codeph> sequence for the changed partition.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            The <codeph>DROP INCREMENTAL STATS</codeph> statement operates 
only on a single partition at a time. To
+            remove statistics (whether incremental or not) from all partitions 
of a table, issue a <codeph>DROP
+            STATS</codeph> statement with no <codeph>INCREMENTAL</codeph> or 
<codeph>PARTITION</codeph> clauses.
+          </p>
+        </li>
+      </ul>
+
+      <p>
+        The following considerations apply to incremental statistics when the 
structure of an existing table is
+        changed (known as <term>schema evolution</term>):
+      </p>
+
+      <ul>
+        <li>
+          <p>
+            If you use an <codeph>ALTER TABLE</codeph> statement to drop a 
column, the existing statistics remain
+            valid and <codeph>COMPUTE INCREMENTAL STATS</codeph> does not 
rescan any partitions.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            If you use an <codeph>ALTER TABLE</codeph> statement to add a 
column, Impala rescans all partitions and
+            fills in the appropriate column-level values the next time you run 
<codeph>COMPUTE INCREMENTAL
+            STATS</codeph>.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            If you use an <codeph>ALTER TABLE</codeph> statement to change the 
data type of a column, Impala
+            rescans all partitions and fills in the appropriate column-level 
values the next time you run
+            <codeph>COMPUTE INCREMENTAL STATS</codeph>.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            If you use an <codeph>ALTER TABLE</codeph> statement to change the 
file format of a table, the existing
+            statistics remain valid and a subsequent <codeph>COMPUTE 
INCREMENTAL STATS</codeph> does not rescan any
+            partitions.
+          </p>
+        </li>
+      </ul>
+
+      <p>
+        See <xref href="impala_compute_stats.xml#compute_stats"/> and
+        <xref href="impala_drop_stats.xml#drop_stats"/> for syntax details.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="perf_stats_computing">
+    <title>Generating Table and Column Statistics (COMPUTE STATS 
Statement)</title>
+    <conbody>
+
+      <p>
+        To gather table statistics after loading data into a table or 
partition, you typically use the
+        <codeph>COMPUTE STATS</codeph> statement. This statement is available 
in Impala 1.2.2 and higher.
+        It gathers both table statistics and column statistics for all columns 
in a single operation.
+        For large partitioned tables, where you frequently need to update 
statistics and it is impractical
+        to scan the entire table each time, use the syntax <codeph>COMPUTE 
INCREMENTAL STATS</codeph>,
+        which is available in CDH 5.3 / Impala 2.1 and higher.
+      </p>
+
+      <p>
+        If you use Hive as part of your ETL workflow, you can also use Hive to 
generate table and
+        column statistics. You might need to do extra configuration within 
Hive itself, the metastore,
+        or even set up a separate database to hold Hive-generated statistics. 
You might need to run
+        multiple statements to generate all the necessary statistics. 
Therefore, prefer the
+        Impala <codeph>COMPUTE STATS</codeph> statement where that technique 
is practical.
+        For details about collecting statistics through Hive, see
+        <xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev"; 
scope="external" format="html">the Hive wiki</xref>.
+      </p>
+
+      <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>
+
+<!-- Commenting out over-detailed Hive instructions as part of stats reorg.
+        <li>
+          Issue an <codeph>ANALYZE TABLE</codeph> statement in Hive, for the 
entire table or a specific partition.
+<codeblock>ANALYZE TABLE <varname>tablename</varname> 
[PARTITION(<varname>partcol1</varname>[=<varname>val1</varname>], 
<varname>partcol2</varname>[=<varname>val2</varname>], ...)] COMPUTE STATISTICS 
[NOSCAN];</codeblock>
+          For example, to gather statistics for a non-partitioned table:
+<codeblock>ANALYZE TABLE customer COMPUTE STATISTICS;</codeblock>
+          To gather statistics for a <codeph>store</codeph> table partitioned 
by state and city, and both of its
+          partitions:
+<codeblock>ANALYZE TABLE store PARTITION(s_state, s_county) COMPUTE 
STATISTICS;</codeblock>
+          To gather statistics for the <codeph>store</codeph> table and only 
the partitions for California:
+<codeblock>ANALYZE TABLE store PARTITION(s_state='CA', s_county) COMPUTE 
STATISTICS;</codeblock>
+        </li>
+
+        <li>
+          Load the data through the <codeph>INSERT OVERWRITE</codeph> 
statement in Hive, while the Hive setting
+          <b>hive.stats.autogather</b> is enabled.
+        </li>
+
+      </ul>
+-->
+
+      <p rev="2.0.1">
+<!-- Additional info as a result of IMPALA-1420 -->
+<!-- Keep checking if https://issues.apache.org/jira/browse/HIVE-8648 ever 
gets fixed and when that fix makes it into a CDH release. -->
+        For your very largest tables, you might find that <codeph>COMPUTE 
STATS</codeph> or even <codeph>COMPUTE INCREMENTAL STATS</codeph>
+        take so long to scan the data that it is impractical to use them 
regularly. In such a case, after adding a partition or inserting new data,
+        you can update just the number of rows property through an 
<codeph>ALTER TABLE</codeph> statement.
+        See <xref href="impala_perf_stats.xml#perf_table_stats_manual"/> for 
details.
+        Because the column statistics might be left in a stale state, do not 
use this technique as a replacement
+        for <codeph>COMPUTE STATS</codeph>. Only use this technique if all 
other means of collecting statistics are impractical, or as a
+        low-overhead operation that you run in between periodic 
<codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> 
operations.
+      </p>
+
+    </conbody>
+  </concept>
+
+  <concept rev="2.1.0" id="perf_stats_checking">
+
+    <title>Detecting Missing Statistics</title>
+
+    <conbody>
+
+      <p>
+        You can check whether a specific table has statistics using the 
<codeph>SHOW TABLE STATS</codeph> statement
+        (for any table) or the <codeph>SHOW PARTITIONS</codeph> statement (for 
a partitioned table). Both
+        statements display the same information. If a table or a partition 
does not have any statistics, the
+        <codeph>#Rows</codeph> field contains <codeph>-1</codeph>. Once you 
compute statistics for the table or
+        partition, the <codeph>#Rows</codeph> field changes to an accurate 
value.
+      </p>
+
+      <p>
+        The following example shows a table that initially does not have any 
statistics. The <codeph>SHOW TABLE
+        STATS</codeph> statement displays different values for 
<codeph>#Rows</codeph> before and after the
+        <codeph>COMPUTE STATS</codeph> operation.
+      </p>
+
+<codeblock>[localhost:21000] &gt; create table no_stats (x int);
+[localhost:21000] &gt; show table stats no_stats;
++-------+--------+------+--------------+--------+-------------------+
+| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
++-------+--------+------+--------------+--------+-------------------+
+| -1    | 0      | 0B   | NOT CACHED   | TEXT   | false             |
++-------+--------+------+--------------+--------+-------------------+
+[localhost:21000] &gt; compute stats no_stats;
++-----------------------------------------+
+| summary                                 |
++-----------------------------------------+
+| Updated 1 partition(s) and 1 column(s). |
++-----------------------------------------+
+[localhost:21000] &gt; show table stats no_stats;
++-------+--------+------+--------------+--------+-------------------+
+| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
++-------+--------+------+--------------+--------+-------------------+
+| 0     | 0      | 0B   | NOT CACHED   | TEXT   | false             |
++-------+--------+------+--------------+--------+-------------------+
+</codeblock>
+
+      <p>
+        The following example shows a similar progression with a partitioned 
table. Initially,
+        <codeph>#Rows</codeph> is <codeph>-1</codeph>. After a <codeph>COMPUTE 
STATS</codeph> operation,
+        <codeph>#Rows</codeph> changes to an accurate value. Any newly added 
partition starts with no statistics,
+        meaning that you must collect statistics after adding a new partition.
+      </p>
+
+<codeblock>[localhost:21000] &gt; create table no_stats_partitioned (x int) 
partitioned by (year smallint);
+[localhost:21000] &gt; show table stats no_stats_partitioned;
++-------+-------+--------+------+--------------+--------+-------------------+
+| year  | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
++-------+-------+--------+------+--------------+--------+-------------------+
+| Total | -1    | 0      | 0B   | 0B           |        |                   |
++-------+-------+--------+------+--------------+--------+-------------------+
+[localhost:21000] &gt; show partitions no_stats_partitioned;
++-------+-------+--------+------+--------------+--------+-------------------+
+| year  | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
++-------+-------+--------+------+--------------+--------+-------------------+
+| Total | -1    | 0      | 0B   | 0B           |        |                   |
++-------+-------+--------+------+--------------+--------+-------------------+
+[localhost:21000] &gt; alter table no_stats_partitioned add partition 
(year=2013);
+[localhost:21000] &gt; compute stats no_stats_partitioned;
++-----------------------------------------+
+| summary                                 |
++-----------------------------------------+
+| Updated 1 partition(s) and 1 column(s). |
++-----------------------------------------+
+[localhost:21000] &gt; alter table no_stats_partitioned add partition 
(year=2014);
+[localhost:21000] &gt; show partitions no_stats_partitioned;
++-------+-------+--------+------+--------------+--------+-------------------+
+| year  | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
++-------+-------+--------+------+--------------+--------+-------------------+
+| 2013  | 0     | 0      | 0B   | NOT CACHED   | TEXT   | false             |
+| 2014  | -1    | 0      | 0B   | NOT CACHED   | TEXT   | false             |
+| Total | 0     | 0      | 0B   | 0B           |        |                   |
++-------+-------+--------+------+--------------+--------+-------------------+
+</codeblock>
+
+      <note>
+        Because the default <codeph>COMPUTE STATS</codeph> statement creates 
and updates statistics for all
+        partitions in a table, if you expect to frequently add new partitions, 
use the <codeph>COMPUTE INCREMENTAL
+        STATS</codeph> syntax instead, which lets you compute stats for a 
single specified partition, or only for
+        those partitions that do not already have incremental stats.
+      </note>
+
+      <p>
+        If checking each individual table is impractical, due to a large 
number of tables or views that hide the
+        underlying base tables, you can also check for missing statistics for 
a particular query. Use the
+        <codeph>EXPLAIN</codeph> statement to preview query efficiency before 
actually running the query. Use the
+        query profile output available through the <codeph>PROFILE</codeph> 
command in
+        <cmdname>impala-shell</cmdname> or the web UI to verify query 
execution and timing after running the query.
+        Both the <codeph>EXPLAIN</codeph> plan and the 
<codeph>PROFILE</codeph> output display a warning if any
+        tables or partitions involved in the query do not have statistics.
+      </p>
+
+<codeblock>[localhost:21000] &gt; create table no_stats (x int);
+[localhost:21000] &gt; explain select count(*) from no_stats;
++------------------------------------------------------------------------------------+
+| Explain String                                                               
      |
++------------------------------------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=10.00MB VCores=1                     
      |
+| WARNING: The following tables are missing relevant table and/or column 
statistics. |
+| incremental_stats.no_stats                                                   
      |
+|                                                                              
      |
+| 03:AGGREGATE [FINALIZE]                                                      
      |
+| |  output: count:merge(*)                                                    
      |
+| |                                                                            
      |
+| 02:EXCHANGE [UNPARTITIONED]                                                  
      |
+| |                                                                            
      |
+| 01:AGGREGATE                                                                 
      |
+| |  output: count(*)                                                          
      |
+| |                                                                            
      |
+| 00:SCAN HDFS [incremental_stats.no_stats]                                    
      |
+|    partitions=1/1 files=0 size=0B                                            
      |
++------------------------------------------------------------------------------------+
+</codeblock>
+
+      <p>
+        Because Impala uses the <term>partition pruning</term> technique when 
possible to only evaluate certain
+        partitions, if you have a partitioned table with statistics for some 
partitions and not others, whether or
+        not the <codeph>EXPLAIN</codeph> statement shows the warning depends 
on the actual partitions used by the
+        query. For example, you might see warnings or not for different 
queries against the same table:
+      </p>
+
+<codeblock>-- No warning because all the partitions for the year 2012 have 
stats.
+EXPLAIN SELECT ... FROM t1 WHERE year = 2012;
+
+-- Missing stats warning because one or more partitions in this range
+-- do not have stats.
+EXPLAIN SELECT ... FROM t1 WHERE year BETWEEN 2006 AND 2009;
+</codeblock>
+
+      <p>
+        To confirm if any partitions at all in the table are missing 
statistics, you might explain a query that
+        scans the entire table, such as <codeph>SELECT COUNT(*) FROM 
<varname>table_name</varname></codeph>.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept rev="2.1.0" id="perf_stats_collecting">
+
+    <title>Keeping Statistics Up to Date</title>
+
+    <conbody>
+
+      <p>
+        When the contents of a table or partition change significantly, 
recompute the stats for the relevant table
+        or partition. The degree of change that qualifies as 
<q>significant</q> varies, depending on the absolute
+        and relative sizes of the tables. Typically, if you add more than 30% 
more data to a table, it is
+        worthwhile to recompute stats, because the differences in number of 
rows and number of distinct values
+        might cause Impala to choose a different join order when that table is 
used in join queries. This guideline
+        is most important for the largest tables. For example, adding 30% new 
data to a table containing 1 TB has a
+        greater effect on join order than adding 30% to a table containing 
only a few megabytes, and the larger
+        table has a greater effect on query performance if Impala chooses a 
suboptimal join order as a result of
+        outdated statistics.
+      </p>
+
+      <p>
+        If you reload a complete new set of data for a table, but the number 
of rows and number of distinct values
+        for each column is relatively unchanged from before, you do not need 
to recompute stats for the table.
+      </p>
+
+      <p>
+        If the statistics for a table are out of date, and the table's large 
size makes it impractical to recompute
+        new stats immediately, you can use the <codeph>DROP STATS</codeph> 
statement to remove the obsolete
+        statistics, making it easier to identify tables that need a new 
<codeph>COMPUTE STATS</codeph> operation.
+      </p>
+
+      <p>
+        For a large partitioned table, consider using the incremental stats 
feature available in Impala 2.1.0 and
+        higher, as explained in <xref 
href="impala_perf_stats.xml#perf_stats_incremental"/>. If you add a new
+        partition to a table, it is worthwhile to recompute incremental stats, 
because the operation only scans the
+        data for that one new partition.
+      </p>
+    </conbody>
+  </concept>
+
+<!-- Might deserve its own conceptual topic at some point. -->
+
+  <concept audience="Cloudera" rev="1.2.2" id="perf_stats_joins">
+
+    <title>How Statistics Are Used in Join Queries</title>
+
+    <conbody>
+
+      <p></p>
     </conbody>
   </concept>
 
+<!-- Might deserve its own conceptual topic at some point. -->
+
+  <concept audience="Cloudera" rev="1.2.2" id="perf_stats_inserts">
+
+    <title>How Statistics Are Used in INSERT Operations</title>
+
+    <conbody>
+
+      <p conref="../shared/impala_common.xml#common/insert_hints"/>
+    </conbody>
+  </concept>
+
+  <concept rev="1.2.2" id="perf_table_stats_manual">
+
+    <title>Setting the NUMROWS Value Manually through ALTER TABLE</title>
+
+    <conbody>
+
+      <p>
+        The most crucial piece of data in all the statistics is the number of 
rows in the table (for an
+        unpartitioned or partitioned table) and for each partition (for a 
partitioned table). The <codeph>COMPUTE STATS</codeph>
+        statement always gathers statistics about all columns, as well as 
overall table statistics. If it is not
+        practical to do a full <codeph>COMPUTE STATS</codeph> or 
<codeph>COMPUTE INCREMENTAL STATS</codeph>
+        operation after adding a partition or inserting data, or if you can 
see that Impala would produce a more
+        efficient plan if the number of rows was different, you can manually 
set the number of rows through an
+        <codeph>ALTER TABLE</codeph> statement:
+      </p>
+
+<codeblock>
+-- Set total number of rows. Applies to both unpartitioned and partitioned 
tables.
+alter table <varname>table_name</varname> set 
tblproperties('numRows'='<varname>new_value</varname>', 
'STATS_GENERATED_VIA_STATS_TASK'='true');
+
+-- Set total number of rows for a specific partition. Applies to partitioned 
tables only.
+-- You must specify all the partition key columns in the PARTITION clause.
+alter table <varname>table_name</varname> partition 
(<varname>keycol1</varname>=<varname>val1</varname>,<varname>keycol2</varname>=<varname>val2</varname>...)
 set tblproperties('numRows'='<varname>new_value</varname>', 
'STATS_GENERATED_VIA_STATS_TASK'='true');
+</codeblock>
+
+      <p>
+        This statement avoids re-scanning any data files. (The requirement to 
include the <codeph>STATS_GENERATED_VIA_STATS_TASK</codeph> property is 
relatively new, as a
+        result of the issue <xref 
href="https://issues.apache.org/jira/browse/HIVE-8648"; scope="external" 
format="html">HIVE-8648</xref>
+        for the Hive metastore.)
+      </p>
+
+<codeblock conref="../shared/impala_common.xml#common/set_numrows_example"/>
+
+      <p>
+        For a partitioned table, update both the per-partition number of rows 
and the number of rows for the whole
+        table:
+      </p>
+
+<codeblock 
conref="../shared/impala_common.xml#common/set_numrows_partitioned_example"/>
+
+      <p>
+        In practice, the <codeph>COMPUTE STATS</codeph> statement, or 
<codeph>COMPUTE INCREMENTAL STATS</codeph>
+        for a partitioned table, should be fast and convenient enough that 
this technique is only useful for the very
+        largest partitioned tables.
+        <!--
+        It is most useful as a workaround for in case of performance issues 
where you might adjust the <codeph>numRows</codeph> value higher
+        or lower to produce the ideal join order.
+        -->
+        <!-- Following wording is duplicated from earlier. Consider 
conref'ing. -->
+        Because the column statistics might be left in a stale state, do not 
use this technique as a replacement
+        for <codeph>COMPUTE STATS</codeph>. Only use this technique if all 
other means of collecting statistics are impractical, or as a
+        low-overhead operation that you run in between periodic 
<codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> 
operations.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="perf_column_stats_manual" rev="2.6.0 IMPALA-3369">
+    <title>Setting Column Stats Manually through ALTER TABLE</title>
+    <conbody>
+      <p>
+        In CDH 5.8 / Impala 2.6 and higher, you can also use the <codeph>SET 
COLUMN STATS</codeph>
+        clause of <codeph>ALTER TABLE</codeph> to manually set or change 
column statistics.
+        Only use this technique in cases where it is impractical to run
+        <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL 
STATS</codeph>
+        frequently enough to keep up with data changes for a huge table.
+      </p>
+      <p conref="../shared/impala_common.xml#common/set_column_stats_example"/>
+    </conbody>
+  </concept>
+
+  <concept rev="1.2.2" id="perf_stats_examples">
+
+    <title>Examples of Using Table and Column Statistics with Impala</title>
+
+    <conbody>
+
+      <p>
+        The following examples walk through a sequence of <codeph>SHOW TABLE 
STATS</codeph>, <codeph>SHOW COLUMN
+        STATS</codeph>, <codeph>ALTER TABLE</codeph>, and 
<codeph>SELECT</codeph> and <codeph>INSERT</codeph>
+        statements to illustrate various aspects of how Impala uses statistics 
to help optimize queries.
+      </p>
+
+      <p>
+        This example shows table and column statistics for the 
<codeph>STORE</codeph> column used in the
+        <xref href="http://www.tpc.org/tpcds/"; scope="external" 
format="html">TPC-DS benchmarks for decision
+        support</xref> systems. It is a tiny table holding data for 12 stores. 
Initially, before any statistics are
+        gathered by a <codeph>COMPUTE STATS</codeph> statement, most of the 
numeric fields show placeholder values
+        of -1, indicating that the figures are unknown. The figures that are 
filled in are values that are easily
+        countable or deducible at the physical level, such as the number of 
files, total data size of the files,
+        and the maximum and average sizes for data types that have a constant 
size such as <codeph>INT</codeph>,
+        <codeph>FLOAT</codeph>, and <codeph>TIMESTAMP</codeph>.
+      </p>
+
+<codeblock>[localhost:21000] &gt; show table stats store;
++-------+--------+--------+--------+
+| #Rows | #Files | Size   | Format |
++-------+--------+--------+--------+
+| -1    | 1      | 3.08KB | TEXT   |
++-------+--------+--------+--------+
+Returned 1 row(s) in 0.03s
+[localhost:21000] &gt; show column stats store;
++--------------------+-----------+------------------+--------+----------+----------+
+| Column             | Type      | #Distinct Values | #Nulls | Max Size | Avg 
Size |
++--------------------+-----------+------------------+--------+----------+----------+
+| s_store_sk         | INT       | -1               | -1     | 4        | 4    
    |
+| s_store_id         | STRING    | -1               | -1     | -1       | -1   
    |
+| s_rec_start_date   | TIMESTAMP | -1               | -1     | 16       | 16   
    |
+| s_rec_end_date     | TIMESTAMP | -1               | -1     | 16       | 16   
    |
+| s_closed_date_sk   | INT       | -1               | -1     | 4        | 4    
    |
+| s_store_name       | STRING    | -1               | -1     | -1       | -1   
    |
+| s_number_employees | INT       | -1               | -1     | 4        | 4    
    |
+| s_floor_space      | INT       | -1               | -1     | 4        | 4    
    |
+| s_hours            | STRING    | -1               | -1     | -1       | -1   
    |
+| s_manager          | STRING    | -1               | -1     | -1       | -1   
    |
+| s_market_id        | INT       | -1               | -1     | 4        | 4    
    |
+| s_geography_class  | STRING    | -1               | -1     | -1       | -1   
    |
+| s_market_desc      | STRING    | -1               | -1     | -1       | -1   
    |
+| s_market_manager   | STRING    | -1               | -1     | -1       | -1   
    |
+| s_division_id      | INT       | -1               | -1     | 4        | 4    
    |
+| s_division_name    | STRING    | -1               | -1     | -1       | -1   
    |
+| s_company_id       | INT       | -1               | -1     | 4        | 4    
    |
+| s_company_name     | STRING    | -1               | -1     | -1       | -1   
    |
+| s_street_number    | STRING    | -1               | -1     | -1       | -1   
    |
+| s_street_name      | STRING    | -1               | -1     | -1       | -1   
    |
+| s_street_type      | STRING    | -1               | -1     | -1       | -1   
    |
+| s_suite_number     | STRING    | -1               | -1     | -1       | -1   
    |
+| s_city             | STRING    | -1               | -1     | -1       | -1   
    |
+| s_county           | STRING    | -1               | -1     | -1       | -1   
    |
+| s_state            | STRING    | -1               | -1     | -1       | -1   
    |
+| s_zip              | STRING    | -1               | -1     | -1       | -1   
    |
+| s_country          | STRING    | -1               | -1     | -1       | -1   
    |
+| s_gmt_offset       | FLOAT     | -1               | -1     | 4        | 4    
    |
+| s_tax_percentage   | FLOAT     | -1               | -1     | 4        | 4    
    |
++--------------------+-----------+------------------+--------+----------+----------+
+Returned 29 row(s) in 0.04s</codeblock>
+
+      <p>
+        With the Hive <codeph>ANALYZE TABLE</codeph> statement for column 
statistics, you had to specify each
+        column for which to gather statistics. The Impala <codeph>COMPUTE 
STATS</codeph> statement automatically
+        gathers statistics for all columns, because it reads through the 
entire table relatively quickly and can
+        efficiently compute the values for all the columns. This example shows 
how after running the
+        <codeph>COMPUTE STATS</codeph> statement, statistics are filled in for 
both the table and all its columns:
+      </p>
+
+<codeblock>[localhost:21000] &gt; compute stats store;
++------------------------------------------+
+| summary                                  |
++------------------------------------------+
+| Updated 1 partition(s) and 29 column(s). |
++------------------------------------------+
+Returned 1 row(s) in 1.88s
+[localhost:21000] &gt; show table stats store;
++-------+--------+--------+--------+
+| #Rows | #Files | Size   | Format |
++-------+--------+--------+--------+
+| 12    | 1      | 3.08KB | TEXT   |
++-------+--------+--------+--------+
+Returned 1 row(s) in 0.02s
+[localhost:21000] &gt; show column stats store;
++--------------------+-----------+------------------+--------+----------+-------------------+
+| Column             | Type      | #Distinct Values | #Nulls | Max Size | Avg 
Size          |
++--------------------+-----------+------------------+--------+----------+-------------------+
+| s_store_sk         | INT       | 12               | -1     | 4        | 4    
             |
+| s_store_id         | STRING    | 6                | -1     | 16       | 16   
             |
+| s_rec_start_date   | TIMESTAMP | 4                | -1     | 16       | 16   
             |
+| s_rec_end_date     | TIMESTAMP | 3                | -1     | 16       | 16   
             |
+| s_closed_date_sk   | INT       | 3                | -1     | 4        | 4    
             |
+| s_store_name       | STRING    | 8                | -1     | 5        | 4.25 
             |
+| s_number_employees | INT       | 9                | -1     | 4        | 4    
             |
+| s_floor_space      | INT       | 10               | -1     | 4        | 4    
             |
+| s_hours            | STRING    | 2                | -1     | 8        | 
7.083300113677979 |
+| s_manager          | STRING    | 7                | -1     | 15       | 12   
             |
+| s_market_id        | INT       | 7                | -1     | 4        | 4    
             |
+| s_geography_class  | STRING    | 1                | -1     | 7        | 7    
             |
+| s_market_desc      | STRING    | 10               | -1     | 94       | 55.5 
             |
+| s_market_manager   | STRING    | 7                | -1     | 16       | 14   
             |
+| s_division_id      | INT       | 1                | -1     | 4        | 4    
             |
+| s_division_name    | STRING    | 1                | -1     | 7        | 7    
             |
+| s_company_id       | INT       | 1                | -1     | 4        | 4    
             |
+| s_company_name     | STRING    | 1                | -1     | 7        | 7    
             |
+| s_street_number    | STRING    | 9                | -1     | 3        | 
2.833300113677979 |
+| s_street_name      | STRING    | 12               | -1     | 11       | 
6.583300113677979 |
+| s_street_type      | STRING    | 8                | -1     | 9        | 
4.833300113677979 |
+| s_suite_number     | STRING    | 11               | -1     | 9        | 8.25 
             |
+| s_city             | STRING    | 2                | -1     | 8        | 6.5  
             |
+| s_county           | STRING    | 1                | -1     | 17       | 17   
             |
+| s_state            | STRING    | 1                | -1     | 2        | 2    
             |
+| s_zip              | STRING    | 2                | -1     | 5        | 5    
             |
+| s_country          | STRING    | 1                | -1     | 13       | 13   
             |
+| s_gmt_offset       | FLOAT     | 1                | -1     | 4        | 4    
             |
+| s_tax_percentage   | FLOAT     | 5                | -1     | 4        | 4    
             |
++--------------------+-----------+------------------+--------+----------+-------------------+
+Returned 29 row(s) in 0.04s</codeblock>
+
+      <p>
+        The following example shows how statistics are represented for a 
partitioned table. In this case, we have
+        set up a table to hold the world's most trivial census data, a single 
<codeph>STRING</codeph> field,
+        partitioned by a <codeph>YEAR</codeph> column. The table statistics 
include a separate entry for each
+        partition, plus final totals for the numeric fields. The column 
statistics include some easily deducible
+        facts for the partitioning column, such as the number of distinct 
values (the number of partition
+        subdirectories).
+<!-- and the number of <codeph>NULL</codeph> values (none in this case). -->
+      </p>
+
+<codeblock>localhost:21000] &gt; describe census;
++------+----------+---------+
+| name | type     | comment |
++------+----------+---------+
+| name | string   |         |
+| year | smallint |         |
++------+----------+---------+
+Returned 2 row(s) in 0.02s
+[localhost:21000] &gt; show table stats census;
++-------+-------+--------+------+---------+
+| year  | #Rows | #Files | Size | Format  |
++-------+-------+--------+------+---------+
+| 2000  | -1    | 0      | 0B   | TEXT    |
+| 2004  | -1    | 0      | 0B   | TEXT    |
+| 2008  | -1    | 0      | 0B   | TEXT    |
+| 2010  | -1    | 0      | 0B   | TEXT    |
+| 2011  | 0     | 1      | 22B  | TEXT    |
+| 2012  | -1    | 1      | 22B  | TEXT    |
+| 2013  | -1    | 1      | 231B | PARQUET |
+| Total | 0     | 3      | 275B |         |
++-------+-------+--------+------+---------+
+Returned 8 row(s) in 0.02s
+[localhost:21000] &gt; show column stats census;
++--------+----------+------------------+--------+----------+----------+
+| Column | Type     | #Distinct Values | #Nulls | Max Size | Avg Size |
++--------+----------+------------------+--------+----------+----------+
+| name   | STRING   | -1               | -1     | -1       | -1       |
+| year   | SMALLINT | 7                | -1     | 2        | 2        |
++--------+----------+------------------+--------+----------+----------+
+Returned 2 row(s) in 0.02s</codeblock>
+
+      <p>
+        The following example shows how the statistics are filled in by a 
<codeph>COMPUTE STATS</codeph> statement
+        in Impala.
+      </p>
+
+<codeblock>[localhost:21000] &gt; compute stats census;
++-----------------------------------------+
+| summary                                 |
++-----------------------------------------+
+| Updated 3 partition(s) and 1 column(s). |
++-----------------------------------------+
+Returned 1 row(s) in 2.16s
+[localhost:21000] &gt; show table stats census;
++-------+-------+--------+------+---------+
+| year  | #Rows | #Files | Size | Format  |
++-------+-------+--------+------+---------+
+| 2000  | -1    | 0      | 0B   | TEXT    |
+| 2004  | -1    | 0      | 0B   | TEXT    |
+| 2008  | -1    | 0      | 0B   | TEXT    |
+| 2010  | -1    | 0      | 0B   | TEXT    |
+| 2011  | 4     | 1      | 22B  | TEXT    |
+| 2012  | 4     | 1      | 22B  | TEXT    |
+| 2013  | 1     | 1      | 231B | PARQUET |
+| Total | 9     | 3      | 275B |         |
++-------+-------+--------+------+---------+
+Returned 8 row(s) in 0.02s
+[localhost:21000] &gt; show column stats census;
++--------+----------+------------------+--------+----------+----------+
+| Column | Type     | #Distinct Values | #Nulls | Max Size | Avg Size |
++--------+----------+------------------+--------+----------+----------+
+| name   | STRING   | 4                | -1     | 5        | 4.5      |
+| year   | SMALLINT | 7                | -1     | 2        | 2        |
++--------+----------+------------------+--------+----------+----------+
+Returned 2 row(s) in 0.02s</codeblock>
+
+      <p rev="1.4.0">
+        For examples showing how some queries work differently when statistics 
are available, see
+        <xref href="impala_perf_joins.xml#perf_joins_examples"/>. You can see 
how Impala executes a query
+        differently in each case by observing the <codeph>EXPLAIN</codeph> 
output before and after collecting
+        statistics. Measure the before and after query times, and examine the 
throughput numbers in before and
+        after <codeph>SUMMARY</codeph> or <codeph>PROFILE</codeph> output, to 
verify how much the improved plan
+        speeds up performance.
+      </p>
+    </conbody>
+  </concept>
+</concept>


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_performance.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_performance.xml 
b/docs/topics/impala_performance.xml
index 3d454ca..e58270e 100644
--- a/docs/topics/impala_performance.xml
+++ b/docs/topics/impala_performance.xml
@@ -4,7 +4,18 @@
 
   <title>Tuning Impala for Performance</title>
   <titlealts audience="PDF"><navtitle>Performance Tuning</navtitle></titlealts>
-  
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Performance"/>
+      <data name="Category" value="Databases"/>
+      <data name="Category" value="SQL"/>
+      <data name="Category" value="Querying"/>
+      <data name="Category" value="Developers"/>
+      <!-- Like Impala Administration, this page has a fair bit of info 
already, but it could benefit from wiki-style embedded of intro text from those 
other pages. -->
+      <data name="Category" value="Stub Pages"/>
+    </metadata>
+  </prolog>
 
   <conbody>
 
@@ -13,8 +24,168 @@
       tuning, monitoring, and benchmarking Impala queries and other SQL 
operations.
     </p>
 
-    
+    <p>
+      This section also describes techniques for maximizing Impala 
scalability. Scalability is tied to performance:
+      it means that performance remains high as the system workload increases. 
For example, reducing the disk I/O
+      performed by a query can speed up an individual query, and at the same 
time improve scalability by making it
+      practical to run more queries simultaneously. Sometimes, an optimization 
technique improves scalability more
+      than performance. For example, reducing memory usage for a query might 
not change the query performance much,
+      but might improve scalability by allowing more Impala queries or other 
kinds of jobs to run at the same time
+      without running out of memory.
+    </p>
+
+    <note>
+      <p>
+        Before starting any performance tuning or benchmarking, make sure your 
system is configured with all the
+        recommended minimum hardware requirements from <xref 
href="impala_prereqs.xml#prereqs_hardware"/> and
+        software settings from <xref 
href="impala_config_performance.xml#config_performance"/>.
+      </p>
+    </note>
+
+    <ul>
+      <li>
+        <xref href="impala_partitioning.xml#partitioning"/>. This technique 
physically divides the data based on
+        the different values in frequently queried columns, allowing queries 
to skip reading a large percentage of
+        the data in a table.
+      </li>
+
+      <li>
+        <xref href="impala_perf_joins.xml#perf_joins"/>. Joins are the main 
class of queries that you can tune at
+        the SQL level, as opposed to changing physical factors such as the 
file format or the hardware
+        configuration. The related topics <xref 
href="impala_perf_stats.xml#perf_column_stats"/> and
+        <xref href="impala_perf_stats.xml#perf_table_stats"/> are also 
important primarily for join performance.
+      </li>
+
+      <li>
+        <xref href="impala_perf_stats.xml#perf_table_stats"/> and
+        <xref href="impala_perf_stats.xml#perf_column_stats"/>. Gathering 
table and column statistics, using the
+        <codeph>COMPUTE STATS</codeph> statement, helps Impala automatically 
optimize the performance for join
+        queries, without requiring changes to SQL query statements. (This 
process is greatly simplified in Impala
+        1.2.2 and higher, because the <codeph>COMPUTE STATS</codeph> statement 
gathers both kinds of statistics in
+        one operation, and does not require any setup and configuration as was 
previously necessary for the
+        <codeph>ANALYZE TABLE</codeph> statement in Hive.)
+      </li>
+
+      <li>
+        <xref href="impala_perf_testing.xml#performance_testing"/>. Do some 
post-setup testing to ensure Impala is
+        using optimal settings for performance, before conducting any 
benchmark tests.
+      </li>
+
+      <li>
+        <xref href="impala_perf_benchmarking.xml#perf_benchmarks"/>. The 
configuration and sample data that you use
+        for initial experiments with Impala is often not appropriate for doing 
performance tests.
+      </li>
+
+      <li>
+        <xref href="impala_perf_resources.xml#mem_limits"/>. The more memory 
Impala can utilize, the better query
+        performance you can expect. In a cluster running other kinds of 
workloads as well, you must make tradeoffs
+        to make sure all Hadoop components have enough memory to perform well, 
so you might cap the memory that
+        Impala can use.
+      </li>
+
+      <li rev="1.2" audience="Cloudera">
+        <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/>. Impala can 
use the HDFS caching feature to pin
+        frequently accessed data in memory, reducing disk I/O.
+      </li>
+
+      <li rev="2.2.0">
+        <xref href="impala_s3.xml#s3"/>. Queries against data stored in the 
Amazon Simple Storage Service (S3)
+        have different performance characteristics than when the data is 
stored in HDFS.
+      </li>
+    </ul>
+
+    <p outputclass="toc"/>
+
+    <p conref="../shared/impala_common.xml#common/cookbook_blurb"/>
+
+  </conbody>
+
+<!-- Empty/hidden stub sections that might be worth expanding later. -->
+
+  <concept id="perf_network" audience="Cloudera">
+
+    <title>Network Traffic</title>
+
+    <conbody/>
+  </concept>
+
+  <concept id="perf_partition_schema" audience="Cloudera">
+
+    <title>Designing Partitioned Tables</title>
+
+    <conbody/>
+  </concept>
+
+  <concept id="perf_partition_query" audience="Cloudera">
+
+    <title>Queries on Partitioned Tables</title>
+
+    <conbody/>
+  </concept>
+
+  <concept id="perf_monitoring" audience="Cloudera">
 
-    </conbody>
+    <title>Monitoring Performance through the Impala Web Interface</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Monitoring"/>
+    </metadata>
+  </prolog>
+
+    <conbody/>
+  </concept>
+
+  <concept id="perf_query_coord" audience="Cloudera">
+
+    <title>Query Coordination</title>
+
+    <conbody/>
+  </concept>
+
+  <concept id="perf_bottlenecks" audience="Cloudera">
+
+    <title>Performance Bottlenecks</title>
+
+    <conbody/>
+  </concept>
+
+  <concept id="perf_long_queries" audience="Cloudera">
+
+    <title>Managing Long-Running Queries</title>
+
+    <conbody/>
+  </concept>
+
+  <concept id="perf_load" audience="Cloudera">
+
+    <title>Performance Considerations for Loading Data</title>
+
+    <conbody/>
+  </concept>
+
+  <concept id="perf_file_formats" audience="Cloudera">
+
+    <title>Performance Considerations for File Formats</title>
+
+    <conbody/>
   </concept>
 
+  <concept id="perf_compression" audience="Cloudera">
+
+    <title>Performance Considerations for Compression</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Compression"/>
+    </metadata>
+  </prolog>
+
+    <conbody/>
+  </concept>
+
+  <concept id="perf_codegen" audience="Cloudera">
+
+    <title>Native Code Generation</title>
+
+    <conbody/>
+  </concept>
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_porting.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_porting.xml b/docs/topics/impala_porting.xml
index c9c8e52..3800713 100644
--- a/docs/topics/impala_porting.xml
+++ b/docs/topics/impala_porting.xml
@@ -3,6 +3,7 @@
 <concept id="porting">
 
   <title>Porting SQL from Other Database Systems to Impala</title>
+  <titlealts audience="PDF"><navtitle>Porting SQL</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_prereqs.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_prereqs.xml b/docs/topics/impala_prereqs.xml
index bb10263..0587c6f 100644
--- a/docs/topics/impala_prereqs.xml
+++ b/docs/topics/impala_prereqs.xml
@@ -5,7 +5,22 @@
   <title>Impala Requirements</title>
   <titlealts audience="PDF"><navtitle>Requirements</navtitle></titlealts>
 
-  
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Requirements"/>
+      <data name="Category" value="Planning"/>
+      <data name="Category" value="Installing"/>
+      <data name="Category" value="Upgrading"/>
+      <data name="Category" value="Administrators"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+      <!-- Another instance of a topic pulled into the map twice, resulting in 
a second HTML page with a *1.html filename. -->
+      <data name="Category" value="Duplicate Topics"/>
+      <!-- Using a separate category, 'Multimap', to flag those pages that are 
duplicate because of multiple DITA map references. -->
+      <data name="Category" value="Multimap"/>
+    </metadata>
+  </prolog>
 
   <conbody>
 
@@ -16,7 +31,330 @@
       described in the following sections.
     </p>
 
-    
+    <p outputclass="toc inpage"/>
+  </conbody>
+
+  <concept id="product_compatibility_matrix">
+
+    <title>Product Compatibility Matrix</title>
+
+    <conbody>
+
+      <p> The ultimate source of truth about compatibility between various
+        versions of CDH, Cloudera Manager, and various CDH components is the 
<ph
+          audience="integrated"><xref
+            href="rn_consolidated_pcm.xml"
+            >Product Compatibility Matrix for CDH and Cloudera
+          Manager</xref></ph><ph audience="standalone">online <xref
+            
href="http://www.cloudera.com/documentation/enterprise/latest/topics/rn_consolidated_pcm.html";
+            format="html" scope="external">Product Compatibility
+          Matrix</xref></ph>. </p>
+
+      <p>
+        For Impala, see the
+        <xref 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/pcm_impala.html";
 scope="external" format="html">Impala
+        compatibility matrix page</xref>.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="prereqs_os">
+
+    <title>Supported Operating Systems</title>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">software requirements</indexterm>
+        <indexterm audience="Cloudera">Red Hat Enterprise Linux</indexterm>
+        <indexterm audience="Cloudera">RHEL</indexterm>
+        <indexterm audience="Cloudera">CentOS</indexterm>
+        <indexterm audience="Cloudera">SLES</indexterm>
+        <indexterm audience="Cloudera">Ubuntu</indexterm>
+        <indexterm audience="Cloudera">SUSE</indexterm>
+        <indexterm audience="Cloudera">Debian</indexterm> The relevant 
supported operating systems
+        and versions for Impala are the same as for the corresponding CDH 4 
and CDH 5 platforms. For
+        details, see the <cite>Supported Operating Systems</cite> page for 
<xref
+          
href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Requirements-and-Supported-Versions/cdhrsv_topic_1.html";
+          scope="external" format="html">CDH 4</xref> or <!-- Original URL: 
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Requirements-and-Supported-Versions/cdh_rsv_os.html
 -->
+        <!-- Expected post-reorg URL, doesn't work: <xref 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_rsv_os.html";
 scope="external" format="html">CDH 5</xref>. -->
+        <ph audience="integrated"><xref 
href="rn_consolidated_pcm.xml#cdh_cm_supported_os">CDH
+            5</xref></ph><ph audience="standalone"><xref
+            
href="http://www.cloudera.com/documentation/enterprise/latest/topics/rn_consolidated_pcm.html#cdh_cm_supported_os";
+            scope="external" format="html">CDH 5</xref></ph>. </p>
+    </conbody>
+  </concept>
+
+  <concept id="prereqs_hive">
+
+    <title>Hive Metastore and Related Configuration</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Metastore"/>
+      <data name="Category" value="Hive"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">Hive</indexterm>
+        <indexterm audience="Cloudera">MySQL</indexterm>
+        <indexterm audience="Cloudera">PostgreSQL</indexterm>
+        Impala can interoperate with data stored in Hive, and uses the same 
infrastructure as Hive for tracking
+        metadata about schema objects such as tables and columns. The 
following components are prerequisites for
+        Impala:
+      </p>
+
+      <ul>
+        <li>
+          MySQL or PostgreSQL, to act as a metastore database for both Impala 
and Hive.
+          <note>
+            <p>
+              Installing and configuring a Hive metastore is an Impala 
requirement. Impala does not work without
+              the metastore database. For the process of installing and 
configuring the metastore, see
+              <xref href="impala_install.xml#install"/>.
+            </p>
+            <p>
+              Always configure a <b>Hive metastore service</b> rather than 
connecting directly to the metastore
+              database. The Hive metastore service is required to interoperate 
between possibly different levels of
+              metastore APIs used by CDH and Impala, and avoids known issues 
with connecting directly to the
+              metastore database. The Hive metastore service is set up for you 
by default if you install through
+              Cloudera Manager 4.5 or higher.
+            </p>
+            <p>
+              A summary of the metastore installation process is as follows:
+            </p>
+            <ul>
+              <li>
+                Install a MySQL or PostgreSQL database. Start the database if 
it is not started after installation.
+              </li>
+
+              <li>
+                Download the
+                <xref href="http://www.mysql.com/products/connector/"; 
scope="external" format="html">MySQL
+                connector</xref> or the
+                <xref href="http://jdbc.postgresql.org/download.html"; 
scope="external" format="html">PostgreSQL
+                connector</xref> and place it in the 
<codeph>/usr/share/java/</codeph> directory.
+              </li>
+
+              <li>
+                Use the appropriate command line tool for your database to 
create the metastore database.
+              </li>
+
+              <li>
+                Use the appropriate command line tool for your database to 
grant privileges for the metastore
+                database to the <codeph>hive</codeph> user.
+              </li>
+
+              <li>
+                Modify <codeph>hive-site.xml</codeph> to include information 
matching your particular database: its
+                URL, username, and password. You will copy the 
<codeph>hive-site.xml</codeph> file to the Impala
+                Configuration Directory later in the Impala installation 
process.
+              </li>
+            </ul>
+          </note>
+        </li>
+
+        <li>
+          <b>Optional:</b> Hive. Although only the Hive metastore database is 
required for Impala to function, you
+          might install Hive on some client machines to create and load data 
into tables that use certain file
+          formats. See <xref href="impala_file_formats.xml#file_formats"/> for 
details. Hive does not need to be
+          installed on the same DataNodes as Impala; it just needs access to 
the same metastore database.
+        </li>
+      </ul>
+    </conbody>
+  </concept>
+
+  <concept id="prereqs_java">
+
+    <title>Java Dependencies</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Java"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">Java</indexterm>
+        <indexterm audience="Cloudera">impala-dependencies.jar</indexterm>
+        Although Impala is primarily written in C++, it does use Java to 
communicate with various Hadoop
+        components:
+      </p>
+
+      <ul>
+        <li>
+          The officially supported JVM for Impala is the Oracle JVM. Other 
JVMs might cause issues, typically
+          resulting in a failure at <cmdname>impalad</cmdname> startup. In 
particular, the JamVM used by default on
+          certain levels of Ubuntu systems can cause 
<cmdname>impalad</cmdname> to fail to start.
+          <!-- To do:
+            Could say something here about JDK 6 vs. JDK 7 in CDH 5. Since we 
didn't specify the JDK version before,
+            don't know the impact from the user perspective so not calling it 
out at the moment.
+          -->
+        </li>
+
+        <li>
+          Internally, the <cmdname>impalad</cmdname> daemon relies on the 
<codeph>JAVA_HOME</codeph> environment
+          variable to locate the system Java libraries. Make sure the 
<cmdname>impalad</cmdname> service is not run
+          from an environment with an incorrect setting for this variable.
+        </li>
+
+        <li>
+          All Java dependencies are packaged in the 
<codeph>impala-dependencies.jar</codeph> file, which is located
+          at <codeph>/usr/lib/impala/lib/</codeph>. These map to everything 
that is built under
+          <codeph>fe/target/dependency</codeph>.
+        </li>
+      </ul>
+    </conbody>
+  </concept>
+
+  <concept id="prereqs_network">
+
+    <title>Networking Configuration Requirements</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Network"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">network configuration</indexterm>
+        As part of ensuring best performance, Impala attempts to complete 
tasks on local data, as opposed to using
+        network connections to work with remote data. To support this goal, 
Impala matches
+        theÂ <b>hostname</b>Â provided to each Impala daemon with theÂ <b>IP 
address</b>Â of each DataNode by
+        resolving the hostname flag to an IP address. For Impala to work with 
local data, use a single IP interface
+        for the DataNode and the Impala daemon on each machine. Ensure that 
the Impala daemon's hostname flag
+        resolves to the IP address of the DataNode. For single-homed machines, 
this is usually automatic, but for
+        multi-homed machines, ensure that the Impala daemon's hostname 
resolves to the correct interface. Impala
+        tries to detect the correct hostname at start-up, and prints the 
derived hostname at the start of the log
+        in a message of the form:
+      </p>
+
+<codeblock>Using hostname: impala-daemon-1.cloudera.com</codeblock>
+
+      <p>
+        In the majority of cases, this automatic detection works correctly. If 
you need to explicitly set the
+        hostname, do so by setting theÂ <codeph>--hostname</codeph>Â flag.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="prereqs_hardware">
+
+    <title>Hardware Requirements</title>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">hardware requirements</indexterm>
+        <indexterm audience="Cloudera">capacity</indexterm>
+        <indexterm audience="Cloudera">RAM</indexterm>
+        <indexterm audience="Cloudera">memory</indexterm>
+        <indexterm audience="Cloudera">CPU</indexterm>
+        <indexterm audience="Cloudera">processor</indexterm>
+        <indexterm audience="Cloudera">Intel</indexterm>
+        <indexterm audience="Cloudera">AMD</indexterm>
+        During join operations, portions of data from each joined table are 
loaded into memory. Data sets can be
+        very large, so ensure your hardware has sufficient memory to 
accommodate the joins you anticipate
+        completing.
+      </p>
+
+      <p>
+        While requirements vary according to data set size, the following is 
generally recommended:
+      </p>
+
+      <ul>
+        <li rev="2.0.0">
+          CPU - Impala version 2.2 and higher uses the SSSE3 instruction set, 
which is included in newer processors.
+          <note>
+            This required level of processor is the same as in Impala version 
1.x. The Impala 2.0 and 2.1 releases
+            had a stricter requirement for the SSE4.1 instruction set, which 
has now been relaxed.
+          </note>
+<!--
+          For best performance use:
+          <ul>
+            <li>
+              Intel - Nehalem (released 2008) or later processors.
+            </li>
+
+            <li>
+              AMD - Bulldozer (released 2011) or later processors.
+            </li>
+          </ul>
+-->
+        </li>
+
+        <li rev="1.2">
+          Memory - 128 GB or more recommended, ideally 256 GB or more. If the 
intermediate results during query
+          processing on a particular node exceed the amount of memory 
available to Impala on that node, the query
+          writes temporary work data to disk, which can lead to long query 
times. Note that because the work is
+          parallelized, and intermediate results for aggregate queries are 
typically smaller than the original
+          data, Impala can query and join tables that are much larger than the 
memory available on an individual
+          node.
+        </li>
+
+        <li>
+          Storage - DataNodes with 12 or more disks each. I/O speeds are often 
the limiting factor for disk
+          performance with Impala. Ensure that you have sufficient disk space 
to store the data Impala will be
+          querying.
+        </li>
+      </ul>
     </conbody>
   </concept>
 
+  <concept id="prereqs_account">
+
+    <title>User Account Requirements</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Users"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">impala user</indexterm>
+        <indexterm audience="Cloudera">impala group</indexterm>
+        <indexterm audience="Cloudera">root user</indexterm>
+        Impala creates and uses a user and group named 
<codeph>impala</codeph>. Do not delete this account or group
+        and do not modify the account's or group's permissions and rights. 
Ensure no existing systems obstruct the
+        functioning of these accounts and groups. For example, if you have 
scripts that delete user accounts not in
+        a white-list, add these accounts to the list of permitted accounts.
+      </p>
+
+<!-- Taking out because no longer applicable in CDH 5.5 and up. -->
+      <p id="impala_hdfs_group" rev="1.2" audience="Cloudera">
+        For the resource management feature to work (in combination with CDH 5 
and the YARN and Llama components),
+        the <codeph>impala</codeph> user must be a member of the 
<codeph>hdfs</codeph> group. This setup is
+        performed automatically during a new install, but not when upgrading 
from earlier Impala releases to Impala
+        1.2. If you are upgrading a node to CDH 5 that already had Impala 1.1 
or 1.0 installed, manually add the
+        <codeph>impala</codeph> user to the <codeph>hdfs</codeph> group.
+      </p>
+
+      <p>
+        For correct file deletion during <codeph>DROP TABLE</codeph> 
operations, Impala must be able to move files
+        to the HDFS trashcan. You might need to create an HDFS directory 
<filepath>/user/impala</filepath>,
+        writeable by the <codeph>impala</codeph> user, so that the trashcan 
can be created. Otherwise, data files
+        might remain behind after a <codeph>DROP TABLE</codeph> statement.
+      </p>
+
+      <p>
+        Impala should not run as root. Best Impala performance is achieved 
using direct reads, but root is not
+        permitted to use direct reads. Therefore, running Impala as root 
negatively affects performance.
+      </p>
+
+      <p>
+        By default, any user can connect to Impala and access all the 
associated databases and tables. You can
+        enable authorization and authentication based on the Linux OS user who 
connects to the Impala server, and
+        the associated groups for that user. <xref 
href="impala_security.xml#security"/> for details. These
+        security features do not change the underlying file permission 
requirements; the <codeph>impala</codeph>
+        user still needs to be able to access the data files.
+      </p>
+    </conbody>
+  </concept>
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_processes.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_processes.xml b/docs/topics/impala_processes.xml
index be8a9b8..05f2274 100644
--- a/docs/topics/impala_processes.xml
+++ b/docs/topics/impala_processes.xml
@@ -3,16 +3,132 @@
 <concept id="processes">
 
   <title>Starting Impala</title>
-  
+  <prolog>
+    <metadata>
+      <data name="Category" value="Starting and Stopping"/>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Administrators"/>
+      <data name="Category" value="Operators"/>
+    </metadata>
+  </prolog>
 
   <conbody>
 
-    <p>
-      
+    <p rev="1.2">
+      <indexterm audience="Cloudera">state store</indexterm>
+      <indexterm audience="Cloudera">starting services</indexterm>
+      <indexterm audience="Cloudera">services</indexterm>
       To activate Impala if it is installed but not yet started:
     </p>
 
-    
+    <ol>
+      <li>
+        Set any necessary configuration options for the Impala services. See
+        <xref href="impala_config_options.xml#config_options"/> for details.
+      </li>
+
+      <li>
+        Start one instance of the Impala statestore. The statestore helps 
Impala to distribute work efficiently,
+        and to continue running in the event of availability problems for 
other Impala nodes. If the statestore
+        becomes unavailable, Impala continues to function.
+      </li>
+
+      <li>
+        Start one instance of the Impala catalog service.
+      </li>
+
+      <li>
+        Start the main Impala service on one or more DataNodes, ideally on all 
DataNodes to maximize local
+        processing and avoid network traffic due to remote reads.
+      </li>
+    </ol>
+
+    <p>
+      Once Impala is running, you can conduct interactive experiments using 
the instructions in
+      <xref href="impala_tutorial.xml#tutorial"/> and try <xref 
href="impala_impala_shell.xml#impala_shell"/>.
+    </p>
+
+    <p outputclass="toc inpage"/>
+  </conbody>
+
+  <concept id="starting_via_cm">
+
+    <title>Starting Impala through Cloudera Manager</title>
+
+    <conbody>
+
+      <p>
+        If you installed Impala with Cloudera Manager, use Cloudera Manager to 
start and stop services. The
+        Cloudera Manager GUI is a convenient way to check that all services 
are running, to set configuration
+        options using form fields in a browser, and to spot potential issues 
such as low disk space before they
+        become serious. Cloudera Manager automatically starts all the 
Impala-related services as a group, in the
+        correct order. See
+        <xref 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_start_stop_service.html";
 scope="external" format="html">the
+        Cloudera Manager Documentation</xref> for details.
+      </p>
+
+      <note>
+        <p 
conref="../shared/impala_common.xml#common/udf_persistence_restriction"/>
+      </note>
     </conbody>
   </concept>
 
+  <concept id="starting_via_cmdline">
+
+    <title>Starting Impala from the Command Line</title>
+
+    <conbody>
+
+      <p>
+        To start the Impala state store and Impala from the command line or a 
script, you can either use the
+        <cmdname>service</cmdname> command or you can start the daemons 
directly through the
+        <cmdname>impalad</cmdname>, <codeph>statestored</codeph>, and 
<cmdname>catalogd</cmdname> executables.
+      </p>
+
+      <p>
+        Start the Impala statestore and then start <codeph>impalad</codeph> 
instances. You can modify the values
+        the service initialization scripts use when starting the statestore 
and Impala by editing
+        <codeph>/etc/default/impala</codeph>.
+      </p>
+
+      <p>
+        Start the statestore service using a command similar to the following:
+      </p>
+
+      <p>
+<codeblock>$ sudo service impala-state-store start</codeblock>
+      </p>
+
+      <p rev="1.2">
+        Start the catalog service using a command similar to the following:
+      </p>
+
+<codeblock rev="1.2">$ sudo service impala-catalog start</codeblock>
+
+      <p>
+        Start the Impala service on each DataNode using a command similar to 
the following:
+      </p>
+
+      <p>
+<codeblock>$ sudo service impala-server start</codeblock>
+      </p>
+
+      <note>
+        <p 
conref="../shared/impala_common.xml#common/udf_persistence_restriction"/>
+      </note>
+
+      <p>
+        If any of the services fail to start, review:
+        <ul>
+          <li>
+            <xref href="impala_logging.xml#logs_debug"/>
+          </li>
+
+          <li>
+            <xref href="impala_troubleshooting.xml#troubleshooting"/>
+          </li>
+        </ul>
+      </p>
+    </conbody>
+  </concept>
+</concept>

[08/23] incubator-impala git commit: Update all impala* files to the latest CDH 5.9/5.10 versions.

Reply via email to