[34/51] [partial] incubator-impala git commit: IMPALA-4181 [DOCS] Publish rendered Impala documentation to ASF site

jbapple Wed, 12 Apr 2017 11:25:51 -0700

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_explain.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_explain.html 
b/docs/build/html/topics/impala_explain.html
new file mode 100644
index 0000000..473a94d
--- /dev/null
+++ b/docs/build/html/topics/impala_explain.html
@@ -0,0 +1,291 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_langref_sql.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="explain"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>EXPLAIN Statement</title></head><body 
id="explain"><main role="main"><article role="article" 
aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">EXPLAIN Statement</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      Returns the execution plan for a statement, showing the low-level 
mechanisms that Impala will use to read the
+      data, divide the work among nodes in the cluster, and transmit 
intermediate and final results across the
+      network. Use <code class="ph codeph">explain</code> followed by a 
complete <code class="ph codeph">SELECT</code> query. For example:
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Syntax:</strong>
+      </p>
+
+<pre class="pre codeblock"><code>EXPLAIN { <var class="keyword 
varname">select_query</var> | <var class="keyword varname">ctas_stmt</var> | 
<var class="keyword varname">insert_stmt</var> }
+</code></pre>
+
+    <p class="p">
+      The <var class="keyword varname">select_query</var> is a <code class="ph 
codeph">SELECT</code> statement, optionally prefixed by a
+      <code class="ph codeph">WITH</code> clause. See <a class="xref" 
href="impala_select.html#select">SELECT Statement</a> for details.
+    </p>
+
+    <p class="p">
+      The <var class="keyword varname">insert_stmt</var> is an <code class="ph 
codeph">INSERT</code> statement that inserts into or overwrites an
+      existing table. It can use either the <code class="ph codeph">INSERT ... 
SELECT</code> or <code class="ph codeph">INSERT ...
+      VALUES</code> syntax. See <a class="xref" 
href="impala_insert.html#insert">INSERT Statement</a> for details.
+    </p>
+
+    <p class="p">
+      The <var class="keyword varname">ctas_stmt</var> is a <code class="ph 
codeph">CREATE TABLE</code> statement using the <code class="ph codeph">AS
+      SELECT</code> clause, typically abbreviated as a <span 
class="q">"CTAS"</span> operation. See
+      <a class="xref" href="impala_create_table.html#create_table">CREATE 
TABLE Statement</a> for details.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Usage notes:</strong>
+      </p>
+
+    <p class="p">
+      You can interpret the output to judge whether the query is performing 
efficiently, and adjust the query
+      and/or the schema if not. For example, you might change the tests in the 
<code class="ph codeph">WHERE</code> clause, add
+      hints to make join operations more efficient, introduce subqueries, 
change the order of tables in a join, add
+      or change partitioning for a table, collect column statistics and/or 
table statistics in Hive, or any other
+      performance tuning steps.
+    </p>
+
+    <p class="p">
+      The <code class="ph codeph">EXPLAIN</code> output reminds you if table 
or column statistics are missing from any table
+      involved in the query. These statistics are important for optimizing 
queries involving large tables or
+      multi-table joins. See <a class="xref" 
href="impala_compute_stats.html#compute_stats">COMPUTE STATS Statement</a> for 
how to gather statistics,
+      and <a class="xref" href="impala_perf_stats.html#perf_stats">Table and 
Column Statistics</a> for how to use this information for query tuning.
+    </p>
+
+    <div class="p">
+        Read the <code class="ph codeph">EXPLAIN</code> plan from bottom to 
top:
+        <ul class="ul">
+          <li class="li">
+            The last part of the plan shows the low-level details such as the 
expected amount of data that will be
+            read, where you can judge the effectiveness of your partitioning 
strategy and estimate how long it will
+            take to scan a table based on total data size and the size of the 
cluster.
+          </li>
+
+          <li class="li">
+            As you work your way up, next you see the operations that will be 
parallelized and performed on each
+            Impala node.
+          </li>
+
+          <li class="li">
+            At the higher levels, you see how data flows when intermediate 
result sets are combined and transmitted
+            from one node to another.
+          </li>
+
+          <li class="li">
+            See <a class="xref" 
href="../shared/../topics/impala_explain_level.html#explain_level">EXPLAIN_LEVEL
 Query Option</a> for details about the
+            <code class="ph codeph">EXPLAIN_LEVEL</code> query option, which 
lets you customize how much detail to show in the
+            <code class="ph codeph">EXPLAIN</code> plan depending on whether 
you are doing high-level or low-level tuning,
+            dealing with logical or physical aspects of the query.
+          </li>
+        </ul>
+      </div>
+
+    <p class="p">
+      If you come from a traditional database background and are not familiar 
with data warehousing, keep in mind
+      that Impala is optimized for full table scans across very large tables. 
The structure and distribution of
+      this data is typically not suitable for the kind of indexing and 
single-row lookups that are common in OLTP
+      environments. Seeing a query scan entirely through a large table is 
common, not necessarily an indication of
+      an inefficient query. Of course, if you can reduce the volume of scanned 
data by orders of magnitude, for
+      example by using a query that affects only certain partitions within a 
partitioned table, then you might be
+      able to optimize a query so that it executes in seconds rather than 
minutes.
+    </p>
+
+    <p class="p">
+      For more information and examples to help you interpret <code class="ph 
codeph">EXPLAIN</code> output, see
+      <a class="xref" href="impala_explain_plan.html#perf_explain">Using the 
EXPLAIN Plan for Performance Tuning</a>.
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Extended EXPLAIN output:</strong>
+    </p>
+
+    <p class="p">
+      For performance tuning of complex queries, and capacity planning (such 
as using the admission control and
+      resource management features), you can enable more detailed and 
informative output for the
+      <code class="ph codeph">EXPLAIN</code> statement. In the <span 
class="keyword cmdname">impala-shell</span> interpreter, issue the command
+      <code class="ph codeph">SET EXPLAIN_LEVEL=<var class="keyword 
varname">level</var></code>, where <var class="keyword varname">level</var> is 
an integer
+      from 0 to 3 or corresponding mnemonic values <code class="ph 
codeph">minimal</code>, <code class="ph codeph">standard</code>,
+      <code class="ph codeph">extended</code>, or <code class="ph 
codeph">verbose</code>.
+    </p>
+
+    <p class="p">
+      When extended <code class="ph codeph">EXPLAIN</code> output is enabled, 
<code class="ph codeph">EXPLAIN</code> statements print
+      information about estimated memory requirements, minimum number of 
virtual cores, and so on.
+      
+    </p>
+
+    <p class="p">
+      See <a class="xref" 
href="impala_explain_level.html#explain_level">EXPLAIN_LEVEL Query Option</a> 
for details and examples.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Examples:</strong>
+      </p>
+
+    <p class="p">
+      This example shows how the standard <code class="ph 
codeph">EXPLAIN</code> output moves from the lowest (physical) level to
+      the higher (logical) levels. The query begins by scanning a certain 
amount of data; each node performs an
+      aggregation operation (evaluating <code class="ph 
codeph">COUNT(*)</code>) on some subset of data that is local to that
+      node; the intermediate results are transmitted back to the coordinator 
node (labelled here as the
+      <code class="ph codeph">EXCHANGE</code> node); lastly, the intermediate 
results are summed to display the final result.
+    </p>
+
+<pre class="pre codeblock" 
id="explain__explain_plan_simple"><code>[impalad-host:21000] &gt; explain 
select count(*) from customer_address;
++----------------------------------------------------------+
+| Explain String                                           |
++----------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=42.00MB VCores=1 |
+|                                                          |
+| 03:AGGREGATE [MERGE FINALIZE]                            |
+| |  output: sum(count(*))                                 |
+| |                                                        |
+| 02:EXCHANGE [PARTITION=UNPARTITIONED]                    |
+| |                                                        |
+| 01:AGGREGATE                                             |
+| |  output: count(*)                                      |
+| |                                                        |
+| 00:SCAN HDFS [default.customer_address]                  |
+|    partitions=1/1 size=5.25MB                            |
++----------------------------------------------------------+
+</code></pre>
+
+    <p class="p">
+      These examples show how the extended <code class="ph 
codeph">EXPLAIN</code> output becomes more accurate and informative as
+      statistics are gathered by the <code class="ph codeph">COMPUTE 
STATS</code> statement. Initially, much of the information
+      about data size and distribution is marked <span 
class="q">"unavailable"</span>. Impala can determine the raw data size, but
+      not the number of rows or number of distinct values for each column 
without additional analysis. The
+      <code class="ph codeph">COMPUTE STATS</code> statement performs this 
analysis, so a subsequent <code class="ph codeph">EXPLAIN</code>
+      statement has additional information to use in deciding how to optimize 
the distributed query.
+    </p>
+
+    
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; set 
explain_level=extended;
+EXPLAIN_LEVEL set to extended
+[localhost:21000] &gt; explain select x from t1;
+[localhost:21000] &gt; explain select x from t1;
++----------------------------------------------------------+
+| Explain String                                           |
++----------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=32.00MB VCores=1 |
+|                                                          |
+| 01:EXCHANGE [PARTITION=UNPARTITIONED]                    |
+| |  hosts=1 per-host-mem=unavailable                      |
+<strong class="ph b">| |  tuple-ids=0 row-size=4B cardinality=unavailable      
 |</strong>
+| |                                                        |
+| 00:SCAN HDFS [default.t2, PARTITION=RANDOM]              |
+|    partitions=1/1 size=36B                               |
+<strong class="ph b">|    table stats: unavailable                             
 |</strong>
+<strong class="ph b">|    column stats: unavailable                            
 |</strong>
+|    hosts=1 per-host-mem=32.00MB                          |
+<strong class="ph b">|    tuple-ids=0 row-size=4B cardinality=unavailable      
 |</strong>
++----------------------------------------------------------+
+</code></pre>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; compute stats t1;
++-----------------------------------------+
+| summary                                 |
++-----------------------------------------+
+| Updated 1 partition(s) and 1 column(s). |
++-----------------------------------------+
+[localhost:21000] &gt; explain select x from t1;
++----------------------------------------------------------+
+| Explain String                                           |
++----------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=64.00MB VCores=1 |
+|                                                          |
+| 01:EXCHANGE [PARTITION=UNPARTITIONED]                    |
+| |  hosts=1 per-host-mem=unavailable                      |
+| |  tuple-ids=0 row-size=4B cardinality=0                 |
+| |                                                        |
+| 00:SCAN HDFS [default.t1, PARTITION=RANDOM]              |
+|    partitions=1/1 size=36B                               |
+<strong class="ph b">|    table stats: 0 rows total                            
 |</strong>
+<strong class="ph b">|    column stats: all                                    
 |</strong>
+|    hosts=1 per-host-mem=64.00MB                          |
+<strong class="ph b">|    tuple-ids=0 row-size=4B cardinality=0                
 |</strong>
++----------------------------------------------------------+
+</code></pre>
+
+    <p class="p">
+        <strong class="ph b">Security considerations:</strong>
+      </p>
+    <p class="p">
+        If these statements in your environment contain sensitive literal 
values such as credit card numbers or tax
+        identifiers, Impala can redact this sensitive information when 
displaying the statements in log files and
+        other administrative contexts. See <span class="xref">the 
documentation for your Apache Hadoop distribution</span> for details.
+      </p>
+
+    <p class="p">
+        <strong class="ph b">Cancellation:</strong> Cannot be cancelled.
+      </p>
+
+    <p class="p">
+        <strong class="ph b">HDFS permissions:</strong>
+      </p>
+    <p class="p">
+      
+      The user ID that the <span class="keyword cmdname">impalad</span> daemon 
runs under,
+      typically the <code class="ph codeph">impala</code> user, must have read
+      and execute permissions for all applicable directories in all source 
tables
+      for the query that is being explained.
+      (A <code class="ph codeph">SELECT</code> operation could read files from 
multiple different HDFS directories
+      if the source table is partitioned.)
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Kudu considerations:</strong>
+      </p>
+    <p class="p">
+      The <code class="ph codeph">EXPLAIN</code> statement displays equivalent 
plan
+      information for queries against Kudu tables as for queries
+      against HDFS-based tables.
+    </p>
+
+    <div class="p">
+      To see which predicates Impala can <span class="q">"push down"</span> to 
Kudu for
+      efficient evaluation, without transmitting unnecessary rows back
+      to Impala, look for the <code class="ph codeph">kudu predicates</code> 
item in
+      the scan phase of the query. The label <code class="ph codeph">kudu 
predicates</code>
+      indicates a condition that can be evaluated efficiently on the Kudu
+      side. The label <code class="ph codeph">predicates</code> in a <code 
class="ph codeph">SCAN KUDU</code>
+      node indicates a condition that is evaluated by Impala.
+      For example, in a table with primary key column <code class="ph 
codeph">X</code>
+      and non-primary key column <code class="ph codeph">Y</code>, you can see 
that
+      some operators in the <code class="ph codeph">WHERE</code> clause are 
evaluated
+      immediately by Kudu and others are evaluated later by Impala:
+<pre class="pre codeblock"><code>
+EXPLAIN SELECT x,y from kudu_table WHERE
+  x = 1 AND x NOT IN (2,3) AND y = 1
+  AND x IS NOT NULL AND x &gt; 0;
++----------------
+| Explain String
++----------------
+...
+| 00:SCAN KUDU [jrussell.hash_only]
+|    predicates: x IS NOT NULL, x NOT IN (2, 3)
+|    kudu predicates: x = 1, x &gt; 0, y = 1
+</code></pre>
+      Only binary predicates and <code class="ph codeph">IN</code> predicates 
containing
+      literal values that exactly match the types in the Kudu table, and do not
+      require any casting, can be pushed to Kudu.
+    </div>
+
+    <p class="p">
+        <strong class="ph b">Related information:</strong>
+      </p>
+    <p class="p">
+      <a class="xref" href="impala_select.html#select">SELECT Statement</a>,
+      <a class="xref" href="impala_insert.html#insert">INSERT Statement</a>,
+      <a class="xref" href="impala_create_table.html#create_table">CREATE 
TABLE Statement</a>,
+      <a class="xref" 
href="impala_explain_plan.html#explain_plan">Understanding Impala Query 
Performance - EXPLAIN Plans and Query Profiles</a>
+    </p>
+
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_langref_sql.html">Impala SQL 
Statements</a></div></div></nav></article></main></body></html>
\ No newline at end of file


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_explain_level.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_explain_level.html 
b/docs/build/html/topics/impala_explain_level.html
new file mode 100644
index 0000000..c9f527b
--- /dev/null
+++ b/docs/build/html/topics/impala_explain_level.html
@@ -0,0 +1,342 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_query_options.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="explain_level"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>EXPLAIN_LEVEL Query Option</title></head><body 
id="explain_level"><main role="main"><article role="article" 
aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">EXPLAIN_LEVEL Query 
Option</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      Controls the amount of detail provided in the output of the <code 
class="ph codeph">EXPLAIN</code> statement. The basic
+      output can help you identify high-level performance issues such as 
scanning a higher volume of data or more
+      partitions than you expect. The higher levels of detail show how 
intermediate results flow between nodes and
+      how different SQL operations such as <code class="ph codeph">ORDER 
BY</code>, <code class="ph codeph">GROUP BY</code>, joins, and
+      <code class="ph codeph">WHERE</code> clauses are implemented within a 
distributed query.
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Type:</strong> <code class="ph 
codeph">STRING</code> or <code class="ph codeph">INT</code>
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Default:</strong> <code class="ph codeph">1</code>
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Arguments:</strong>
+    </p>
+
+    <p class="p">
+      The allowed range of numeric values for this option is 0 to 3:
+    </p>
+
+    <ul class="ul">
+      <li class="li">
+        <code class="ph codeph">0</code> or <code class="ph 
codeph">MINIMAL</code>: A barebones list, one line per operation. Primarily 
useful
+        for checking the join order in very long queries where the regular 
<code class="ph codeph">EXPLAIN</code> output is too
+        long to read easily.
+      </li>
+
+      <li class="li">
+        <code class="ph codeph">1</code> or <code class="ph 
codeph">STANDARD</code>: The default level of detail, showing the logical way 
that
+        work is split up for the distributed query.
+      </li>
+
+      <li class="li">
+        <code class="ph codeph">2</code> or <code class="ph 
codeph">EXTENDED</code>: Includes additional detail about how the query planner
+        uses statistics in its decision-making process, to understand how a 
query could be tuned by gathering
+        statistics, using query hints, adding or removing predicates, and so 
on.
+      </li>
+
+      <li class="li">
+        <code class="ph codeph">3</code> or <code class="ph 
codeph">VERBOSE</code>: The maximum level of detail, showing how work is split 
up
+        within each node into <span class="q">"query fragments"</span> that 
are connected in a pipeline. This extra detail is
+        primarily useful for low-level performance testing and tuning within 
Impala itself, rather than for
+        rewriting the SQL code at the user level.
+      </li>
+    </ul>
+
+    <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+      Prior to Impala 1.3, the allowed argument range for <code class="ph 
codeph">EXPLAIN_LEVEL</code> was 0 to 1: level 0 had
+      the mnemonic <code class="ph codeph">NORMAL</code>, and level 1 was 
<code class="ph codeph">VERBOSE</code>. In Impala 1.3 and higher,
+      <code class="ph codeph">NORMAL</code> is not a valid mnemonic value, and 
<code class="ph codeph">VERBOSE</code> still applies to the
+      highest level of detail but now corresponds to level 3. You might need 
to adjust the values if you have any
+      older <code class="ph codeph">impala-shell</code> script files that set 
the <code class="ph codeph">EXPLAIN_LEVEL</code> query option.
+    </div>
+
+    <p class="p">
+      Changing the value of this option controls the amount of detail in the 
output of the <code class="ph codeph">EXPLAIN</code>
+      statement. The extended information from level 2 or 3 is especially 
useful during performance tuning, when
+      you need to confirm whether the work for the query is distributed the 
way you expect, particularly for the
+      most resource-intensive operations such as join queries against large 
tables, queries against tables with
+      large numbers of partitions, and insert operations for Parquet tables. 
The extended information also helps to
+      check estimated resource usage when you use the admission control or 
resource management features explained
+      in <a class="xref" 
href="impala_resource_management.html#resource_management">Resource Management 
for Impala</a>. See
+      <a class="xref" href="impala_explain.html#explain">EXPLAIN Statement</a> 
for the syntax of the <code class="ph codeph">EXPLAIN</code> statement, and
+      <a class="xref" href="impala_explain_plan.html#perf_explain">Using the 
EXPLAIN Plan for Performance Tuning</a> for details about how to use the 
extended information.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Usage notes:</strong>
+      </p>
+
+    <p class="p">
+      As always, read the <code class="ph codeph">EXPLAIN</code> output from 
bottom to top. The lowest lines represent the
+      initial work of the query (scanning data files), the lines in the middle 
represent calculations done on each
+      node and how intermediate results are transmitted from one node to 
another, and the topmost lines represent
+      the final results being sent back to the coordinator node.
+    </p>
+
+    <p class="p">
+      The numbers in the left column are generated internally during the 
initial planning phase and do not
+      represent the actual order of operations, so it is not significant if 
they appear out of order in the
+      <code class="ph codeph">EXPLAIN</code> output.
+    </p>
+
+    <p class="p">
+      At all <code class="ph codeph">EXPLAIN</code> levels, the plan contains 
a warning if any tables in the query are missing
+      statistics. Use the <code class="ph codeph">COMPUTE STATS</code> 
statement to gather statistics for each table and suppress
+      this warning. See <a class="xref" 
href="impala_perf_stats.html#perf_stats">Table and Column Statistics</a> for 
details about how the statistics help
+      query performance.
+    </p>
+
+    <p class="p">
+      The <code class="ph codeph">PROFILE</code> command in <span 
class="keyword cmdname">impala-shell</span> always starts with an explain plan
+      showing full detail, the same as with <code class="ph 
codeph">EXPLAIN_LEVEL=3</code>. <span class="ph">After the explain
+      plan comes the executive summary, the same output as produced by the 
<code class="ph codeph">SUMMARY</code> command in
+      <span class="keyword cmdname">impala-shell</span>.</span>
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Examples:</strong>
+      </p>
+
+    <p class="p">
+      These examples use a trivial, empty table to illustrate how the 
essential aspects of query planning are shown
+      in <code class="ph codeph">EXPLAIN</code> output:
+    </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; create table t1 (x 
int, s string);
+[localhost:21000] &gt; set explain_level=1;
+[localhost:21000] &gt; explain select count(*) from t1;
++------------------------------------------------------------------------+
+| Explain String                                                         |
++------------------------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=10.00MB VCores=1               |
+| WARNING: The following tables are missing relevant table and/or column |
+|   statistics.                                                          |
+| explain_plan.t1                                                        |
+|                                                                        |
+| 03:AGGREGATE [MERGE FINALIZE]                                          |
+| |  output: sum(count(*))                                               |
+| |                                                                      |
+| 02:EXCHANGE [PARTITION=UNPARTITIONED]                                  |
+| |                                                                      |
+| 01:AGGREGATE                                                           |
+| |  output: count(*)                                                    |
+| |                                                                      |
+| 00:SCAN HDFS [explain_plan.t1]                                         |
+|    partitions=1/1 size=0B                                              |
++------------------------------------------------------------------------+
+[localhost:21000] &gt; explain select * from t1;
++------------------------------------------------------------------------+
+| Explain String                                                         |
++------------------------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
+| WARNING: The following tables are missing relevant table and/or column |
+|   statistics.                                                          |
+| explain_plan.t1                                                        |
+|                                                                        |
+| 01:EXCHANGE [PARTITION=UNPARTITIONED]                                  |
+| |                                                                      |
+| 00:SCAN HDFS [explain_plan.t1]                                         |
+|    partitions=1/1 size=0B                                              |
++------------------------------------------------------------------------+
+[localhost:21000] &gt; set explain_level=2;
+[localhost:21000] &gt; explain select * from t1;
++------------------------------------------------------------------------+
+| Explain String                                                         |
++------------------------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
+| WARNING: The following tables are missing relevant table and/or column |
+|   statistics.                                                          |
+| explain_plan.t1                                                        |
+|                                                                        |
+| 01:EXCHANGE [PARTITION=UNPARTITIONED]                                  |
+| |  hosts=0 per-host-mem=unavailable                                    |
+| |  tuple-ids=0 row-size=19B cardinality=unavailable                    |
+| |                                                                      |
+| 00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM]                       |
+|    partitions=1/1 size=0B                                              |
+|    table stats: unavailable                                            |
+|    column stats: unavailable                                           |
+|    hosts=0 per-host-mem=0B                                             |
+|    tuple-ids=0 row-size=19B cardinality=unavailable                    |
++------------------------------------------------------------------------+
+[localhost:21000] &gt; set explain_level=3;
+[localhost:21000] &gt; explain select * from t1;
++------------------------------------------------------------------------+
+| Explain String                                                         |
++------------------------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
+<strong class="ph b">| WARNING: The following tables are missing relevant 
table and/or column |</strong>
+<strong class="ph b">|   statistics.                                           
               |</strong>
+<strong class="ph b">| explain_plan.t1                                         
               |</strong>
+|                                                                        |
+| F01:PLAN FRAGMENT [PARTITION=UNPARTITIONED]                            |
+|   01:EXCHANGE [PARTITION=UNPARTITIONED]                                |
+|      hosts=0 per-host-mem=unavailable                                  |
+|      tuple-ids=0 row-size=19B cardinality=unavailable                  |
+|                                                                        |
+| F00:PLAN FRAGMENT [PARTITION=RANDOM]                                   |
+|   DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, PARTITION=UNPARTITIONED] |
+|   00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM]                     |
+|      partitions=1/1 size=0B                                            |
+<strong class="ph b">|      table stats: unavailable                           
               |</strong>
+<strong class="ph b">|      column stats: unavailable                          
               |</strong>
+|      hosts=0 per-host-mem=0B                                           |
+|      tuple-ids=0 row-size=19B cardinality=unavailable                  |
++------------------------------------------------------------------------+
+</code></pre>
+
+    <p class="p">
+      As the warning message demonstrates, most of the information needed for 
Impala to do efficient query
+      planning, and for you to understand the performance characteristics of 
the query, requires running the
+      <code class="ph codeph">COMPUTE STATS</code> statement for the table:
+    </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; compute stats t1;
++-----------------------------------------+
+| summary                                 |
++-----------------------------------------+
+| Updated 1 partition(s) and 2 column(s). |
++-----------------------------------------+
+[localhost:21000] &gt; explain select * from t1;
++------------------------------------------------------------------------+
+| Explain String                                                         |
++------------------------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
+|                                                                        |
+| F01:PLAN FRAGMENT [PARTITION=UNPARTITIONED]                            |
+|   01:EXCHANGE [PARTITION=UNPARTITIONED]                                |
+|      hosts=0 per-host-mem=unavailable                                  |
+|      tuple-ids=0 row-size=20B cardinality=0                            |
+|                                                                        |
+| F00:PLAN FRAGMENT [PARTITION=RANDOM]                                   |
+|   DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, PARTITION=UNPARTITIONED] |
+|   00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM]                     |
+|      partitions=1/1 size=0B                                            |
+<strong class="ph b">|      table stats: 0 rows total                          
               |</strong>
+<strong class="ph b">|      column stats: all                                  
               |</strong>
+|      hosts=0 per-host-mem=0B                                           |
+|      tuple-ids=0 row-size=20B cardinality=0                            |
++------------------------------------------------------------------------+
+</code></pre>
+
+    <p class="p">
+      Joins and other complicated, multi-part queries are the ones where you 
most commonly need to examine the
+      <code class="ph codeph">EXPLAIN</code> output and customize the amount 
of detail in the output. This example shows the
+      default <code class="ph codeph">EXPLAIN</code> output for a three-way 
join query, then the equivalent output with a
+      <code class="ph codeph">[SHUFFLE]</code> hint to change the join 
mechanism between the first two tables from a broadcast
+      join to a shuffle join.
+    </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; set explain_level=1;
+[localhost:21000] &gt; explain select one.*, two.*, three.* from t1 one, t1 
two, t1 three where one.x = two.x and two.x = three.x;
++---------------------------------------------------------+
+| Explain String                                          |
++---------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 |
+|                                                         |
+| 07:EXCHANGE [PARTITION=UNPARTITIONED]                   |
+| |                                                       |
+<strong class="ph b">| 04:HASH JOIN [INNER JOIN, BROADCAST]                    
|</strong>
+| |  hash predicates: two.x = three.x                     |
+| |                                                       |
+<strong class="ph b">| |--06:EXCHANGE [BROADCAST]                              
|</strong>
+| |  |                                                    |
+| |  02:SCAN HDFS [explain_plan.t1 three]                 |
+| |     partitions=1/1 size=0B                            |
+| |                                                       |
+<strong class="ph b">| 03:HASH JOIN [INNER JOIN, BROADCAST]                    
|</strong>
+| |  hash predicates: one.x = two.x                       |
+| |                                                       |
+<strong class="ph b">| |--05:EXCHANGE [BROADCAST]                              
|</strong>
+| |  |                                                    |
+| |  01:SCAN HDFS [explain_plan.t1 two]                   |
+| |     partitions=1/1 size=0B                            |
+| |                                                       |
+| 00:SCAN HDFS [explain_plan.t1 one]                      |
+|    partitions=1/1 size=0B                               |
++---------------------------------------------------------+
+[localhost:21000] &gt; explain select one.*, two.*, three.*
+                  &gt; from t1 one join [shuffle] t1 two join t1 three
+                  &gt; where one.x = two.x and two.x = three.x;
++---------------------------------------------------------+
+| Explain String                                          |
++---------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 |
+|                                                         |
+| 08:EXCHANGE [PARTITION=UNPARTITIONED]                   |
+| |                                                       |
+<strong class="ph b">| 04:HASH JOIN [INNER JOIN, BROADCAST]                    
|</strong>
+| |  hash predicates: two.x = three.x                     |
+| |                                                       |
+<strong class="ph b">| |--07:EXCHANGE [BROADCAST]                              
|</strong>
+| |  |                                                    |
+| |  02:SCAN HDFS [explain_plan.t1 three]                 |
+| |     partitions=1/1 size=0B                            |
+| |                                                       |
+<strong class="ph b">| 03:HASH JOIN [INNER JOIN, PARTITIONED]                  
|</strong>
+| |  hash predicates: one.x = two.x                       |
+| |                                                       |
+<strong class="ph b">| |--06:EXCHANGE [PARTITION=HASH(two.x)]                  
|</strong>
+| |  |                                                    |
+| |  01:SCAN HDFS [explain_plan.t1 two]                   |
+| |     partitions=1/1 size=0B                            |
+| |                                                       |
+<strong class="ph b">| 05:EXCHANGE [PARTITION=HASH(one.x)]                     
|</strong>
+| |                                                       |
+| 00:SCAN HDFS [explain_plan.t1 one]                      |
+|    partitions=1/1 size=0B                               |
++---------------------------------------------------------+
+</code></pre>
+
+    <p class="p">
+      For a join involving many different tables, the default <code class="ph 
codeph">EXPLAIN</code> output might stretch over
+      several pages, and the only details you care about might be the join 
order and the mechanism (broadcast or
+      shuffle) for joining each pair of tables. In that case, you might set 
<code class="ph codeph">EXPLAIN_LEVEL</code> to its
+      lowest value of 0, to focus on just the join order and join mechanism 
for each stage. The following example
+      shows how the rows from the first and second joined tables are hashed 
and divided among the nodes of the
+      cluster for further filtering; then the entire contents of the third 
table are broadcast to all nodes for the
+      final stage of join processing.
+    </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; set explain_level=0;
+[localhost:21000] &gt; explain select one.*, two.*, three.*
+                  &gt; from t1 one join [shuffle] t1 two join t1 three
+                  &gt; where one.x = two.x and two.x = three.x;
++---------------------------------------------------------+
+| Explain String                                          |
++---------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 |
+|                                                         |
+| 08:EXCHANGE [PARTITION=UNPARTITIONED]                   |
+<strong class="ph b">| 04:HASH JOIN [INNER JOIN, BROADCAST]                    
|</strong>
+<strong class="ph b">| |--07:EXCHANGE [BROADCAST]                              
|</strong>
+| |  02:SCAN HDFS [explain_plan.t1 three]                 |
+<strong class="ph b">| 03:HASH JOIN [INNER JOIN, PARTITIONED]                  
|</strong>
+<strong class="ph b">| |--06:EXCHANGE [PARTITION=HASH(two.x)]                  
|</strong>
+| |  01:SCAN HDFS [explain_plan.t1 two]                   |
+<strong class="ph b">| 05:EXCHANGE [PARTITION=HASH(one.x)]                     
|</strong>
+| 00:SCAN HDFS [explain_plan.t1 one]                      |
++---------------------------------------------------------+
+</code></pre>
+
+
+
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_query_options.html">Query Options for the SET 
Statement</a></div></div></nav></article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_explain_plan.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_explain_plan.html 
b/docs/build/html/topics/impala_explain_plan.html
new file mode 100644
index 0000000..bcd0855
--- /dev/null
+++ b/docs/build/html/topics/impala_explain_plan.html
@@ -0,0 +1,592 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_performance.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" 
content="Impala"><meta name="version" content="Impala 2.8.x"><meta 
name="version" content="Impala 2.8.x"><meta name="version" content="Impala 
2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" 
content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta 
name="DC.Identifier" content="explain_plan"><link rel="stylesheet" 
type="text/css" href="../commonltr.css"><title>Understanding Impala Query 
Performance - EXPLAIN Plans and Query Profiles</title>
 </head><body id="explain_plan"><main role="main"><article role="article" 
aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Understanding Impala Query 
Performance - EXPLAIN Plans and Query Profiles</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      To understand the high-level performance considerations for Impala 
queries, read the output of the
+      <code class="ph codeph">EXPLAIN</code> statement for the query. You can 
get the <code class="ph codeph">EXPLAIN</code> plan without
+      actually running the query itself.
+    </p>
+
+    <p class="p">
+      For an overview of the physical performance characteristics for a query, 
issue the <code class="ph codeph">SUMMARY</code>
+      statement in <span class="keyword cmdname">impala-shell</span> 
immediately after executing a query. This condensed information
+      shows which phases of execution took the most time, and how the 
estimates for memory usage and number of rows
+      at each phase compare to the actual values.
+    </p>
+
+    <p class="p">
+      To understand the detailed performance characteristics for a query, 
issue the <code class="ph codeph">PROFILE</code>
+      statement in <span class="keyword cmdname">impala-shell</span> 
immediately after executing a query. This low-level information
+      includes physical details about memory, CPU, I/O, and network usage, and 
thus is only available after the
+      query is actually run.
+    </p>
+
+    <p class="p toc inpage"></p>
+
+    <p class="p">
+      Also, see <a class="xref" 
href="impala_hbase.html#hbase_performance">Performance Considerations for the 
Impala-HBase Integration</a>
+      and <a class="xref" href="impala_s3.html#s3_performance">Understanding 
and Tuning Impala Query Performance for S3 Data</a>
+      for examples of interpreting
+      <code class="ph codeph">EXPLAIN</code> plans for queries against HBase 
tables
+      <span class="ph">and data stored in the Amazon Simple Storage System 
(S3)</span>.
+    </p>
+  </div>
+
+  <nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_performance.html">Tuning Impala for 
Performance</a></div></div></nav><article class="topic concept nested1" 
aria-labelledby="ariaid-title2" id="explain_plan__perf_explain">
+
+    <h2 class="title topictitle2" id="ariaid-title2">Using the EXPLAIN Plan 
for Performance Tuning</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        The <code class="ph codeph"><a class="xref" 
href="impala_explain.html#explain">EXPLAIN</a></code> statement gives you an 
outline
+        of the logical steps that a query will perform, such as how the work 
will be distributed among the nodes
+        and how intermediate results will be combined to produce the final 
result set. You can see these details
+        before actually running the query. You can use this information to 
check that the query will not operate in
+        some very unexpected or inefficient way.
+      </p>
+
+
+
+<pre class="pre codeblock"><code>[impalad-host:21000] &gt; explain select 
count(*) from customer_address;
++----------------------------------------------------------+
+| Explain String                                           |
++----------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=42.00MB VCores=1 |
+|                                                          |
+| 03:AGGREGATE [MERGE FINALIZE]                            |
+| |  output: sum(count(*))                                 |
+| |                                                        |
+| 02:EXCHANGE [PARTITION=UNPARTITIONED]                    |
+| |                                                        |
+| 01:AGGREGATE                                             |
+| |  output: count(*)                                      |
+| |                                                        |
+| 00:SCAN HDFS [default.customer_address]                  |
+|    partitions=1/1 size=5.25MB                            |
++----------------------------------------------------------+
+</code></pre>
+
+      <div class="p">
+        Read the <code class="ph codeph">EXPLAIN</code> plan from bottom to 
top:
+        <ul class="ul">
+          <li class="li">
+            The last part of the plan shows the low-level details such as the 
expected amount of data that will be
+            read, where you can judge the effectiveness of your partitioning 
strategy and estimate how long it will
+            take to scan a table based on total data size and the size of the 
cluster.
+          </li>
+
+          <li class="li">
+            As you work your way up, next you see the operations that will be 
parallelized and performed on each
+            Impala node.
+          </li>
+
+          <li class="li">
+            At the higher levels, you see how data flows when intermediate 
result sets are combined and transmitted
+            from one node to another.
+          </li>
+
+          <li class="li">
+            See <a class="xref" 
href="../shared/../topics/impala_explain_level.html#explain_level">EXPLAIN_LEVEL
 Query Option</a> for details about the
+            <code class="ph codeph">EXPLAIN_LEVEL</code> query option, which 
lets you customize how much detail to show in the
+            <code class="ph codeph">EXPLAIN</code> plan depending on whether 
you are doing high-level or low-level tuning,
+            dealing with logical or physical aspects of the query.
+          </li>
+        </ul>
+      </div>
+
+      <p class="p">
+        The <code class="ph codeph">EXPLAIN</code> plan is also printed at the 
beginning of the query profile report described in
+        <a class="xref" href="#perf_profile">Using the Query Profile for 
Performance Tuning</a>, for convenience in examining both the logical and 
physical aspects of the
+        query side-by-side.
+      </p>
+
+      <p class="p">
+        The amount of detail displayed in the <code class="ph 
codeph">EXPLAIN</code> output is controlled by the
+        <a class="xref" 
href="impala_explain_level.html#explain_level">EXPLAIN_LEVEL</a> query option. 
You typically
+        increase this setting from <code class="ph codeph">normal</code> to 
<code class="ph codeph">verbose</code> (or from <code class="ph codeph">0</code>
+        to <code class="ph codeph">1</code>) when doublechecking the presence 
of table and column statistics during performance
+        tuning, or when estimating query resource usage in conjunction with 
the resource management features.
+      </p>
+
+      
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title3" 
id="explain_plan__perf_summary">
+
+    <h2 class="title topictitle2" id="ariaid-title3">Using the SUMMARY Report 
for Performance Tuning</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        The <code class="ph codeph"><a class="xref" 
href="impala_shell_commands.html#shell_commands">SUMMARY</a></code> command 
within
+        the <span class="keyword cmdname">impala-shell</span> interpreter 
gives you an easy-to-digest overview of the timings for the
+        different phases of execution for a query. Like the <code class="ph 
codeph">EXPLAIN</code> plan, it is easy to see
+        potential performance bottlenecks. Like the <code class="ph 
codeph">PROFILE</code> output, it is available after the
+        query is run and so displays actual timing numbers.
+      </p>
+
+      <p class="p">
+        The <code class="ph codeph">SUMMARY</code> report is also printed at 
the beginning of the query profile report described
+        in <a class="xref" href="#perf_profile">Using the Query Profile for 
Performance Tuning</a>, for convenience in examining high-level and low-level 
aspects of the query
+        side-by-side.
+      </p>
+
+      <p class="p">
+        For example, here is a query involving an aggregate function, on a 
single-node VM. The different stages of
+        the query and their timings are shown (rolled up for all nodes), along 
with estimated and actual values
+        used in planning the query. In this case, the <code class="ph 
codeph">AVG()</code> function is computed for a subset of
+        data on each node (stage 01) and then the aggregated results from all 
nodes are combined at the end (stage
+        03). You can see which stages took the most time, and whether any 
estimates were substantially different
+        than the actual data distribution. (When examining the time values, be 
sure to consider the suffixes such
+        as <code class="ph codeph">us</code> for microseconds and <code 
class="ph codeph">ms</code> for milliseconds, rather than just looking
+        for the largest numbers.)
+      </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; select 
avg(ss_sales_price) from store_sales where ss_coupon_amt = 0;
++---------------------+
+| avg(ss_sales_price) |
++---------------------+
+| 37.80770926328327   |
++---------------------+
+[localhost:21000] &gt; summary;
++--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+
+| Operator     | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem 
| Est. Peak Mem | Detail          |
++--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+
+| 03:AGGREGATE | 1      | 1.03ms   | 1.03ms   | 1     | 1          | 48.00 KB 
| -1 B          | MERGE FINALIZE  |
+| 02:EXCHANGE  | 1      | 0ns      | 0ns      | 1     | 1          | 0 B      
| -1 B          | UNPARTITIONED   |
+| 01:AGGREGATE | 1      | 30.79ms  | 30.79ms  | 1     | 1          | 80.00 KB 
| 10.00 MB      |                 |
+| 00:SCAN HDFS | 1      | 5.45s    | 5.45s    | 2.21M | -1         | 64.05 MB 
| 432.00 MB     | tpc.store_sales |
++--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+
+</code></pre>
+
+      <p class="p">
+        Notice how the longest initial phase of the query is measured in 
seconds (s), while later phases working on
+        smaller intermediate results are measured in milliseconds (ms) or even 
nanoseconds (ns).
+      </p>
+
+      <p class="p">
+        Here is an example from a more complicated query, as it would appear 
in the <code class="ph codeph">PROFILE</code>
+        output:
+      </p>
+
+<pre class="pre codeblock"><code>Operator              #Hosts   Avg Time   Max 
Time    #Rows  Est. #Rows  Peak Mem  Est. Peak Mem  Detail
+------------------------------------------------------------------------------------------------------------------------
+09:MERGING-EXCHANGE        1   79.738us   79.738us        5           5        
 0        -1.00 B  UNPARTITIONED
+05:TOP-N                   3   84.693us   88.810us        5           5  12.00 
KB       120.00 B
+04:AGGREGATE               3    5.263ms    6.432ms        5           5  44.00 
KB       10.00 MB  MERGE FINALIZE
+08:AGGREGATE               3   16.659ms   27.444ms   52.52K     600.12K   3.20 
MB       15.11 MB  MERGE
+07:EXCHANGE                3    2.644ms      5.1ms   52.52K     600.12K        
 0              0  HASH(o_orderpriority)
+03:AGGREGATE               3  342.913ms  966.291ms   52.52K     600.12K  10.80 
MB       15.11 MB
+02:HASH JOIN               3    2s165ms    2s171ms  144.87K     600.12K  13.63 
MB      941.01 KB  INNER JOIN, BROADCAST
+|--06:EXCHANGE             3    8.296ms    8.692ms   57.22K      15.00K        
 0              0  BROADCAST
+|  01:SCAN HDFS            2    1s412ms    1s978ms   57.22K      15.00K  24.21 
MB      176.00 MB  tpch.orders o
+00:SCAN HDFS               3    8s032ms    8s558ms    3.79M     600.12K  32.29 
MB      264.00 MB  tpch.lineitem l
+</code></pre>
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title4" 
id="explain_plan__perf_profile">
+
+    <h2 class="title topictitle2" id="ariaid-title4">Using the Query Profile 
for Performance Tuning</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        The <code class="ph codeph">PROFILE</code> statement, available in the 
<span class="keyword cmdname">impala-shell</span> interpreter,
+        produces a detailed low-level report showing how the most recent query 
was executed. Unlike the
+        <code class="ph codeph">EXPLAIN</code> plan described in <a 
class="xref" href="#perf_explain">Using the EXPLAIN Plan for Performance 
Tuning</a>, this information is only available
+        after the query has finished. It shows physical details such as the 
number of bytes read, maximum memory
+        usage, and so on for each node. You can use this information to 
determine if the query is I/O-bound or
+        CPU-bound, whether some network condition is imposing a bottleneck, 
whether a slowdown is affecting some
+        nodes but not others, and to check that recommended configuration 
settings such as short-circuit local
+        reads are in effect.
+      </p>
+
+      <p class="p">
+        By default, time values in the profile output reflect the wall-clock 
time taken by an operation.
+        For values denoting system time or user time, the measurement unit is 
reflected in the metric
+        name, such as <code class="ph codeph">ScannerThreadsSysTime</code> or 
<code class="ph codeph">ScannerThreadsUserTime</code>.
+        For example, a multi-threaded I/O operation might show a small figure 
for wall-clock time,
+        while the corresponding system time is larger, representing the sum of 
the CPU time taken by each thread.
+        Or a wall-clock time figure might be larger because it counts time 
spent waiting, while
+        the corresponding system and user time figures only measure the time 
while the operation
+        is actively using CPU cycles.
+      </p>
+
+      <p class="p">
+        The <a class="xref" href="impala_explain_plan.html#perf_explain"><code 
class="ph codeph">EXPLAIN</code> plan</a> is also printed
+        at the beginning of the query profile report, for convenience in 
examining both the logical and physical
+        aspects of the query side-by-side. The
+        <a class="xref" 
href="impala_explain_level.html#explain_level">EXPLAIN_LEVEL</a> query option 
also controls the
+        verbosity of the <code class="ph codeph">EXPLAIN</code> output printed 
by the <code class="ph codeph">PROFILE</code> command.
+      </p>
+
+      
+
+      <p class="p">
+        Here is an example of a query profile, from a relatively 
straightforward query on a single-node
+        pseudo-distributed cluster to keep the output relatively brief.
+      </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; profile;
+Query Runtime Profile:
+Query (id=6540a03d4bee0691:4963d6269b210ebd):
+  Summary:
+    Session ID: ea4a197f1c7bf858:c74e66f72e3a33ba
+    Session Type: BEESWAX
+    Start Time: 2013-12-02 17:10:30.263067000
+    End Time: 2013-12-02 17:10:50.932044000
+    Query Type: QUERY
+    Query State: FINISHED
+    Query Status: OK
+    Impala Version: impalad version 1.2.1 RELEASE (build 
edb5af1bcad63d410bc5d47cc203df3a880e9324)
+    User: doc_demo
+    Network Address: 127.0.0.1:49161
+    Default Db: stats_testing
+    Sql Statement: select t1.s, t2.s from t1 join t2 on (t1.id = t2.parent)
+    Plan:
+----------------
+Estimated Per-Host Requirements: Memory=2.09GB VCores=2
+
+PLAN FRAGMENT 0
+  PARTITION: UNPARTITIONED
+
+  4:EXCHANGE
+     cardinality: unavailable
+     per-host memory: unavailable
+     tuple ids: 0 1
+
+PLAN FRAGMENT 1
+  PARTITION: RANDOM
+
+  STREAM DATA SINK
+    EXCHANGE ID: 4
+    UNPARTITIONED
+
+  2:HASH JOIN
+  |  join op: INNER JOIN (BROADCAST)
+  |  hash predicates:
+  |    t1.id = t2.parent
+  |  cardinality: unavailable
+  |  per-host memory: 2.00GB
+  |  tuple ids: 0 1
+  |
+  |----3:EXCHANGE
+  |       cardinality: unavailable
+  |       per-host memory: 0B
+  |       tuple ids: 1
+  |
+  0:SCAN HDFS
+     table=stats_testing.t1 #partitions=1/1 size=33B
+     table stats: unavailable
+     column stats: unavailable
+     cardinality: unavailable
+     per-host memory: 32.00MB
+     tuple ids: 0
+
+PLAN FRAGMENT 2
+  PARTITION: RANDOM
+
+  STREAM DATA SINK
+    EXCHANGE ID: 3
+    UNPARTITIONED
+
+  1:SCAN HDFS
+     table=stats_testing.t2 #partitions=1/1 size=960.00KB
+     table stats: unavailable
+     column stats: unavailable
+     cardinality: unavailable
+     per-host memory: 96.00MB
+     tuple ids: 1
+----------------
+    Query Timeline: 20s670ms
+       - Start execution: 2.559ms (2.559ms)
+       - Planning finished: 23.587ms (21.27ms)
+       - Rows available: 666.199ms (642.612ms)
+       - First row fetched: 668.919ms (2.719ms)
+       - Unregister query: 20s668ms (20s000ms)
+  ImpalaServer:
+     - ClientFetchWaitTimer: 19s637ms
+     - RowMaterializationTimer: 167.121ms
+  Execution Profile 6540a03d4bee0691:4963d6269b210ebd:(Active: 837.815ms, % 
non-child: 0.00%)
+    Per Node Peak Memory Usage: impala-1.example.com:22000(7.42 MB)
+     - FinalizationTimer: 0ns
+    Coordinator Fragment:(Active: 195.198ms, % non-child: 0.00%)
+      MemoryUsage(500.0ms): 16.00 KB, 7.42 MB, 7.33 MB, 7.10 MB, 6.94 MB, 6.71 
MB, 6.56 MB, 6.40 MB, 6.17 MB, 6.02 MB, 5.79 MB, 5.63 MB, 5.48 MB, 5.25 MB, 
5.09 MB, 4.86 MB, 4.71 MB, 4.47 MB, 4.32 MB, 4.09 MB, 3.93 MB, 3.78 MB, 3.55 
MB, 3.39 MB, 3.16 MB, 3.01 MB, 2.78 MB, 2.62 MB, 2.39 MB, 2.24 MB, 2.08 MB, 
1.85 MB, 1.70 MB, 1.54 MB, 1.31 MB, 1.16 MB, 948.00 KB, 790.00 KB, 553.00 KB, 
395.00 KB, 237.00 KB
+      ThreadUsage(500.0ms): 1
+       - AverageThreadTokens: 1.00
+       - PeakMemoryUsage: 7.42 MB
+       - PrepareTime: 36.144us
+       - RowsProduced: 98.30K (98304)
+       - TotalCpuTime: 20s449ms
+       - TotalNetworkWaitTime: 191.630ms
+       - TotalStorageWaitTime: 0ns
+      CodeGen:(Active: 150.679ms, % non-child: 77.19%)
+         - CodegenTime: 0ns
+         - CompileTime: 139.503ms
+         - LoadTime: 10.7ms
+         - ModuleFileSize: 95.27 KB
+      EXCHANGE_NODE (id=4):(Active: 194.858ms, % non-child: 99.83%)
+         - BytesReceived: 2.33 MB
+         - ConvertRowBatchTime: 2.732ms
+         - DataArrivalWaitTime: 191.118ms
+         - DeserializeRowBatchTimer: 14.943ms
+         - FirstBatchArrivalWaitTime: 191.117ms
+         - PeakMemoryUsage: 7.41 MB
+         - RowsReturned: 98.30K (98304)
+         - RowsReturnedRate: 504.49 K/sec
+         - SendersBlockedTimer: 0ns
+         - SendersBlockedTotalTimer(*): 0ns
+    Averaged Fragment 1:(Active: 442.360ms, % non-child: 0.00%)
+      split sizes:  min: 33.00 B, max: 33.00 B, avg: 33.00 B, stddev: 0.00
+      completion times: min:443.720ms  max:443.720ms  mean: 443.720ms  
stddev:0ns
+      execution rates: min:74.00 B/sec  max:74.00 B/sec  mean:74.00 B/sec  
stddev:0.00 /sec
+      num instances: 1
+       - AverageThreadTokens: 1.00
+       - PeakMemoryUsage: 6.06 MB
+       - PrepareTime: 7.291ms
+       - RowsProduced: 98.30K (98304)
+       - TotalCpuTime: 784.259ms
+       - TotalNetworkWaitTime: 388.818ms
+       - TotalStorageWaitTime: 3.934ms
+      CodeGen:(Active: 312.862ms, % non-child: 70.73%)
+         - CodegenTime: 2.669ms
+         - CompileTime: 302.467ms
+         - LoadTime: 9.231ms
+         - ModuleFileSize: 95.27 KB
+      DataStreamSender (dst_id=4):(Active: 80.63ms, % non-child: 18.10%)
+         - BytesSent: 2.33 MB
+         - NetworkThroughput(*): 35.89 MB/sec
+         - OverallThroughput: 29.06 MB/sec
+         - PeakMemoryUsage: 5.33 KB
+         - SerializeBatchTime: 26.487ms
+         - ThriftTransmitTime(*): 64.814ms
+         - UncompressedRowBatchSize: 6.66 MB
+      HASH_JOIN_NODE (id=2):(Active: 362.25ms, % non-child: 3.92%)
+         - BuildBuckets: 1.02K (1024)
+         - BuildRows: 98.30K (98304)
+         - BuildTime: 12.622ms
+         - LoadFactor: 0.00
+         - PeakMemoryUsage: 6.02 MB
+         - ProbeRows: 3
+         - ProbeTime: 3.579ms
+         - RowsReturned: 98.30K (98304)
+         - RowsReturnedRate: 271.54 K/sec
+        EXCHANGE_NODE (id=3):(Active: 344.680ms, % non-child: 77.92%)
+           - BytesReceived: 1.15 MB
+           - ConvertRowBatchTime: 2.792ms
+           - DataArrivalWaitTime: 339.936ms
+           - DeserializeRowBatchTimer: 9.910ms
+           - FirstBatchArrivalWaitTime: 199.474ms
+           - PeakMemoryUsage: 156.00 KB
+           - RowsReturned: 98.30K (98304)
+           - RowsReturnedRate: 285.20 K/sec
+           - SendersBlockedTimer: 0ns
+           - SendersBlockedTotalTimer(*): 0ns
+      HDFS_SCAN_NODE (id=0):(Active: 13.616us, % non-child: 0.00%)
+         - AverageHdfsReadThreadConcurrency: 0.00
+         - AverageScannerThreadConcurrency: 0.00
+         - BytesRead: 33.00 B
+         - BytesReadLocal: 33.00 B
+         - BytesReadShortCircuit: 33.00 B
+         - NumDisksAccessed: 1
+         - NumScannerThreadsStarted: 1
+         - PeakMemoryUsage: 46.00 KB
+         - PerReadThreadRawHdfsThroughput: 287.52 KB/sec
+         - RowsRead: 3
+         - RowsReturned: 3
+         - RowsReturnedRate: 220.33 K/sec
+         - ScanRangesComplete: 1
+         - ScannerThreadsInvoluntaryContextSwitches: 26
+         - ScannerThreadsTotalWallClockTime: 55.199ms
+           - DelimiterParseTime: 2.463us
+           - MaterializeTupleTime(*): 1.226us
+           - ScannerThreadsSysTime: 0ns
+           - ScannerThreadsUserTime: 42.993ms
+         - ScannerThreadsVoluntaryContextSwitches: 1
+         - TotalRawHdfsReadTime(*): 112.86us
+         - TotalReadThroughput: 0.00 /sec
+    Averaged Fragment 2:(Active: 190.120ms, % non-child: 0.00%)
+      split sizes:  min: 960.00 KB, max: 960.00 KB, avg: 960.00 KB, stddev: 
0.00
+      completion times: min:191.736ms  max:191.736ms  mean: 191.736ms  
stddev:0ns
+      execution rates: min:4.89 MB/sec  max:4.89 MB/sec  mean:4.89 MB/sec  
stddev:0.00 /sec
+      num instances: 1
+       - AverageThreadTokens: 0.00
+       - PeakMemoryUsage: 906.33 KB
+       - PrepareTime: 3.67ms
+       - RowsProduced: 98.30K (98304)
+       - TotalCpuTime: 403.351ms
+       - TotalNetworkWaitTime: 34.999ms
+       - TotalStorageWaitTime: 108.675ms
+      CodeGen:(Active: 162.57ms, % non-child: 85.24%)
+         - CodegenTime: 3.133ms
+         - CompileTime: 148.316ms
+         - LoadTime: 12.317ms
+         - ModuleFileSize: 95.27 KB
+      DataStreamSender (dst_id=3):(Active: 70.620ms, % non-child: 37.14%)
+         - BytesSent: 1.15 MB
+         - NetworkThroughput(*): 23.30 MB/sec
+         - OverallThroughput: 16.23 MB/sec
+         - PeakMemoryUsage: 5.33 KB
+         - SerializeBatchTime: 22.69ms
+         - ThriftTransmitTime(*): 49.178ms
+         - UncompressedRowBatchSize: 3.28 MB
+      HDFS_SCAN_NODE (id=1):(Active: 118.839ms, % non-child: 62.51%)
+         - AverageHdfsReadThreadConcurrency: 0.00
+         - AverageScannerThreadConcurrency: 0.00
+         - BytesRead: 960.00 KB
+         - BytesReadLocal: 960.00 KB
+         - BytesReadShortCircuit: 960.00 KB
+         - NumDisksAccessed: 1
+         - NumScannerThreadsStarted: 1
+         - PeakMemoryUsage: 869.00 KB
+         - PerReadThreadRawHdfsThroughput: 130.21 MB/sec
+         - RowsRead: 98.30K (98304)
+         - RowsReturned: 98.30K (98304)
+         - RowsReturnedRate: 827.20 K/sec
+         - ScanRangesComplete: 15
+         - ScannerThreadsInvoluntaryContextSwitches: 34
+         - ScannerThreadsTotalWallClockTime: 189.774ms
+           - DelimiterParseTime: 15.703ms
+           - MaterializeTupleTime(*): 3.419ms
+           - ScannerThreadsSysTime: 1.999ms
+           - ScannerThreadsUserTime: 44.993ms
+         - ScannerThreadsVoluntaryContextSwitches: 118
+         - TotalRawHdfsReadTime(*): 7.199ms
+         - TotalReadThroughput: 0.00 /sec
+    Fragment 1:
+      Instance 6540a03d4bee0691:4963d6269b210ebf 
(host=impala-1.example.com:22000):(Active: 442.360ms, % non-child: 0.00%)
+        Hdfs split stats (&lt;volume id&gt;:&lt;# splits&gt;/&lt;split 
lengths&gt;): 0:1/33.00 B
+        MemoryUsage(500.0ms): 69.33 KB
+        ThreadUsage(500.0ms): 1
+         - AverageThreadTokens: 1.00
+         - PeakMemoryUsage: 6.06 MB
+         - PrepareTime: 7.291ms
+         - RowsProduced: 98.30K (98304)
+         - TotalCpuTime: 784.259ms
+         - TotalNetworkWaitTime: 388.818ms
+         - TotalStorageWaitTime: 3.934ms
+        CodeGen:(Active: 312.862ms, % non-child: 70.73%)
+           - CodegenTime: 2.669ms
+           - CompileTime: 302.467ms
+           - LoadTime: 9.231ms
+           - ModuleFileSize: 95.27 KB
+        DataStreamSender (dst_id=4):(Active: 80.63ms, % non-child: 18.10%)
+           - BytesSent: 2.33 MB
+           - NetworkThroughput(*): 35.89 MB/sec
+           - OverallThroughput: 29.06 MB/sec
+           - PeakMemoryUsage: 5.33 KB
+           - SerializeBatchTime: 26.487ms
+           - ThriftTransmitTime(*): 64.814ms
+           - UncompressedRowBatchSize: 6.66 MB
+        HASH_JOIN_NODE (id=2):(Active: 362.25ms, % non-child: 3.92%)
+          ExecOption: Build Side Codegen Enabled, Probe Side Codegen Enabled, 
Hash Table Built Asynchronously
+           - BuildBuckets: 1.02K (1024)
+           - BuildRows: 98.30K (98304)
+           - BuildTime: 12.622ms
+           - LoadFactor: 0.00
+           - PeakMemoryUsage: 6.02 MB
+           - ProbeRows: 3
+           - ProbeTime: 3.579ms
+           - RowsReturned: 98.30K (98304)
+           - RowsReturnedRate: 271.54 K/sec
+          EXCHANGE_NODE (id=3):(Active: 344.680ms, % non-child: 77.92%)
+             - BytesReceived: 1.15 MB
+             - ConvertRowBatchTime: 2.792ms
+             - DataArrivalWaitTime: 339.936ms
+             - DeserializeRowBatchTimer: 9.910ms
+             - FirstBatchArrivalWaitTime: 199.474ms
+             - PeakMemoryUsage: 156.00 KB
+             - RowsReturned: 98.30K (98304)
+             - RowsReturnedRate: 285.20 K/sec
+             - SendersBlockedTimer: 0ns
+             - SendersBlockedTotalTimer(*): 0ns
+        HDFS_SCAN_NODE (id=0):(Active: 13.616us, % non-child: 0.00%)
+          Hdfs split stats (&lt;volume id&gt;:&lt;# splits&gt;/&lt;split 
lengths&gt;): 0:1/33.00 B
+          Hdfs Read Thread Concurrency Bucket: 0:0% 1:0%
+          File Formats: TEXT/NONE:1
+          ExecOption: Codegen enabled: 1 out of 1
+           - AverageHdfsReadThreadConcurrency: 0.00
+           - AverageScannerThreadConcurrency: 0.00
+           - BytesRead: 33.00 B
+           - BytesReadLocal: 33.00 B
+           - BytesReadShortCircuit: 33.00 B
+           - NumDisksAccessed: 1
+           - NumScannerThreadsStarted: 1
+           - PeakMemoryUsage: 46.00 KB
+           - PerReadThreadRawHdfsThroughput: 287.52 KB/sec
+           - RowsRead: 3
+           - RowsReturned: 3
+           - RowsReturnedRate: 220.33 K/sec
+           - ScanRangesComplete: 1
+           - ScannerThreadsInvoluntaryContextSwitches: 26
+           - ScannerThreadsTotalWallClockTime: 55.199ms
+             - DelimiterParseTime: 2.463us
+             - MaterializeTupleTime(*): 1.226us
+             - ScannerThreadsSysTime: 0ns
+             - ScannerThreadsUserTime: 42.993ms
+           - ScannerThreadsVoluntaryContextSwitches: 1
+           - TotalRawHdfsReadTime(*): 112.86us
+           - TotalReadThroughput: 0.00 /sec
+    Fragment 2:
+      Instance 6540a03d4bee0691:4963d6269b210ec0 
(host=impala-1.example.com:22000):(Active: 190.120ms, % non-child: 0.00%)
+        Hdfs split stats (&lt;volume id&gt;:&lt;# splits&gt;/&lt;split 
lengths&gt;): 0:15/960.00 KB
+         - AverageThreadTokens: 0.00
+         - PeakMemoryUsage: 906.33 KB
+         - PrepareTime: 3.67ms
+         - RowsProduced: 98.30K (98304)
+         - TotalCpuTime: 403.351ms
+         - TotalNetworkWaitTime: 34.999ms
+         - TotalStorageWaitTime: 108.675ms
+        CodeGen:(Active: 162.57ms, % non-child: 85.24%)
+           - CodegenTime: 3.133ms
+           - CompileTime: 148.316ms
+           - LoadTime: 12.317ms
+           - ModuleFileSize: 95.27 KB
+        DataStreamSender (dst_id=3):(Active: 70.620ms, % non-child: 37.14%)
+           - BytesSent: 1.15 MB
+           - NetworkThroughput(*): 23.30 MB/sec
+           - OverallThroughput: 16.23 MB/sec
+           - PeakMemoryUsage: 5.33 KB
+           - SerializeBatchTime: 22.69ms
+           - ThriftTransmitTime(*): 49.178ms
+           - UncompressedRowBatchSize: 3.28 MB
+        HDFS_SCAN_NODE (id=1):(Active: 118.839ms, % non-child: 62.51%)
+          Hdfs split stats (&lt;volume id&gt;:&lt;# splits&gt;/&lt;split 
lengths&gt;): 0:15/960.00 KB
+          Hdfs Read Thread Concurrency Bucket: 0:0% 1:0%
+          File Formats: TEXT/NONE:15
+          ExecOption: Codegen enabled: 15 out of 15
+           - AverageHdfsReadThreadConcurrency: 0.00
+           - AverageScannerThreadConcurrency: 0.00
+           - BytesRead: 960.00 KB
+           - BytesReadLocal: 960.00 KB
+           - BytesReadShortCircuit: 960.00 KB
+           - NumDisksAccessed: 1
+           - NumScannerThreadsStarted: 1
+           - PeakMemoryUsage: 869.00 KB
+           - PerReadThreadRawHdfsThroughput: 130.21 MB/sec
+           - RowsRead: 98.30K (98304)
+           - RowsReturned: 98.30K (98304)
+           - RowsReturnedRate: 827.20 K/sec
+           - ScanRangesComplete: 15
+           - ScannerThreadsInvoluntaryContextSwitches: 34
+           - ScannerThreadsTotalWallClockTime: 189.774ms
+             - DelimiterParseTime: 15.703ms
+             - MaterializeTupleTime(*): 3.419ms
+             - ScannerThreadsSysTime: 1.999ms
+             - ScannerThreadsUserTime: 44.993ms
+           - ScannerThreadsVoluntaryContextSwitches: 118
+           - TotalRawHdfsReadTime(*): 7.199ms
+           - TotalReadThroughput: 0.00 /sec</code></pre>
+    </div>
+  </article>
+</article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_faq.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_faq.html 
b/docs/build/html/topics/impala_faq.html
new file mode 100644
index 0000000..b85bb8a
--- /dev/null
+++ b/docs/build/html/topics/impala_faq.html
@@ -0,0 +1,21 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta 
name="prodname" content="Impala"><meta name="version" content="Impala 
2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" 
content="XHTML"><meta name="DC.Identifier" content="faq"><link rel="stylesheet" 
type="text/css" href="../commonltr.css"><title>Impala Frequently Asked 
Questions</title></head><body id="faq"><main role="main"><article 
role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Impala Frequently Asked 
Questions</h1>
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      This section lists frequently asked questions for Apache Impala 
(incubating),
+      the interactive SQL engine for Hadoop.
+    </p>
+
+    <p class="p">
+      This section is under construction.
+    </p>
+
+  </div>
+
+</article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_file_formats.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_file_formats.html 
b/docs/build/html/topics/impala_file_formats.html
new file mode 100644
index 0000000..d9ccbca
--- /dev/null
+++ b/docs/build/html/topics/impala_file_formats.html
@@ -0,0 +1,236 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_txtfile.html"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_parquet.html"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_avro.html"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_rcfile.html"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_seqfile.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="file_formats"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>How Impala Works with Hado
 op File Formats</title></head><body id="file_formats"><main 
role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">How Impala Works with 
Hadoop File Formats</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      
+      Impala supports several familiar file formats used in Apache Hadoop. 
Impala can load and query data files
+      produced by other Hadoop components such as Pig or MapReduce, and data 
files produced by Impala can be used
+      by other components also. The following sections discuss the procedures, 
limitations, and performance
+      considerations for using each file format with Impala.
+    </p>
+
+    <p class="p">
+      The file format used for an Impala table has significant performance 
consequences. Some file formats include
+      compression support that affects the size of data on the disk and, 
consequently, the amount of I/O and CPU
+      resources required to deserialize data. The amounts of I/O and CPU 
resources required can be a limiting
+      factor in query performance since querying often begins with moving and 
decompressing data. To reduce the
+      potential impact of this part of the process, data is often compressed. 
By compressing data, a smaller total
+      number of bytes are transferred from disk to memory. This reduces the 
amount of time taken to transfer the
+      data, but a tradeoff occurs when the CPU decompresses the content.
+    </p>
+
+    <p class="p">
+      Impala can query files encoded with most of the popular file formats and 
compression codecs used in Hadoop.
+      Impala can create and insert data into tables that use some file formats 
but not others; for file formats
+      that Impala cannot write to, create the table in Hive, issue the <code 
class="ph codeph">INVALIDATE METADATA <var class="keyword 
varname">table_name</var></code>
+      statement in <code class="ph codeph">impala-shell</code>, and query the 
table through Impala. File formats can be
+      structured, in which case they may include metadata and built-in 
compression. Supported formats include:
+    </p>
+
+    <table class="table"><caption><span class="table--title-label">Table 1. 
</span><span class="title">File Format Support in 
Impala</span></caption><colgroup><col style="width:10%"><col 
style="width:10%"><col style="width:20%"><col style="width:30%"><col 
style="width:30%"></colgroup><thead class="thead">
+          <tr class="row">
+            <th class="entry nocellnorowborder" id="file_formats__entry__1">
+              File Type
+            </th>
+            <th class="entry nocellnorowborder" id="file_formats__entry__2">
+              Format
+            </th>
+            <th class="entry nocellnorowborder" id="file_formats__entry__3">
+              Compression Codecs
+            </th>
+            <th class="entry nocellnorowborder" id="file_formats__entry__4">
+              Impala Can CREATE?
+            </th>
+            <th class="entry nocellnorowborder" id="file_formats__entry__5">
+              Impala Can INSERT?
+            </th>
+          </tr>
+        </thead><tbody class="tbody">
+          <tr class="row" id="file_formats__parquet_support">
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__1 ">
+              <a class="xref" href="impala_parquet.html#parquet">Parquet</a>
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__2 ">
+              Structured
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__3 ">
+              Snappy, gzip; currently Snappy by default
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__4 ">
+              Yes.
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__5 ">
+              Yes: <code class="ph codeph">CREATE TABLE</code>, <code 
class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, and 
query.
+            </td>
+          </tr>
+          <tr class="row" id="file_formats__txtfile_support">
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__1 ">
+              <a class="xref" href="impala_txtfile.html#txtfile">Text</a>
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__2 ">
+              Unstructured
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__3 ">
+              LZO, gzip, bzip2, Snappy
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__4 ">
+              Yes. For <code class="ph codeph">CREATE TABLE</code> with no 
<code class="ph codeph">STORED AS</code> clause, the default file
+              format is uncompressed text, with values separated by ASCII 
<code class="ph codeph">0x01</code> characters
+              (typically represented as Ctrl-A).
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__5 ">
+              Yes: <code class="ph codeph">CREATE TABLE</code>, <code 
class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, and 
query.
+              If LZO compression is used, you must create the table and load 
data in Hive. If other kinds of
+              compression are used, you must load data through <code class="ph 
codeph">LOAD DATA</code>, Hive, or manually in
+              HDFS.
+
+
+            </td>
+          </tr>
+          <tr class="row" id="file_formats__avro_support">
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__1 ">
+              <a class="xref" href="impala_avro.html#avro">Avro</a>
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__2 ">
+              Structured
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__3 ">
+              Snappy, gzip, deflate, bzip2
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__4 ">
+              Yes, in Impala 1.4.0 and higher. Before that, create the table 
using Hive.
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__5 ">
+              No. Import data by using <code class="ph codeph">LOAD 
DATA</code> on data files already in the right format, or use
+              <code class="ph codeph">INSERT</code> in Hive followed by <code 
class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code> 
in Impala.
+            </td>
+
+          </tr>
+          <tr class="row" id="file_formats__rcfile_support">
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__1 ">
+              <a class="xref" href="impala_rcfile.html#rcfile">RCFile</a>
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__2 ">
+              Structured
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__3 ">
+              Snappy, gzip, deflate, bzip2
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__4 ">
+              Yes.
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__5 ">
+              No. Import data by using <code class="ph codeph">LOAD 
DATA</code> on data files already in the right format, or use
+              <code class="ph codeph">INSERT</code> in Hive followed by <code 
class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code> 
in Impala.
+            </td>
+
+          </tr>
+          <tr class="row" id="file_formats__sequencefile_support">
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__1 ">
+              <a class="xref" 
href="impala_seqfile.html#seqfile">SequenceFile</a>
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__2 ">
+              Structured
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__3 ">
+              Snappy, gzip, deflate, bzip2
+            </td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__4 ">Yes.</td>
+            <td class="entry nocellnorowborder" 
headers="file_formats__entry__5 ">
+              No. Import data by using <code class="ph codeph">LOAD 
DATA</code> on data files already in the right format, or use
+              <code class="ph codeph">INSERT</code> in Hive followed by <code 
class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code> 
in Impala.
+            </td>
+
+          </tr>
+        </tbody></table>
+
+    <p class="p">
+      Impala can only query the file formats listed in the preceding table.
+      In particular, Impala does not support the ORC file format.
+    </p>
+
+    <p class="p">
+      Impala supports the following compression codecs:
+    </p>
+
+    <ul class="ul">
+      <li class="li">
+        Snappy. Recommended for its effective balance between compression 
ratio and decompression speed. Snappy
+        compression is very fast, but gzip provides greater space savings. 
Supported for text files in Impala 2.0
+        and higher.
+
+      </li>
+
+      <li class="li">
+        Gzip. Recommended when achieving the highest level of compression (and 
therefore greatest disk-space
+        savings) is desired. Supported for text files in Impala 2.0 and higher.
+      </li>
+
+      <li class="li">
+        Deflate. Not supported for text files.
+      </li>
+
+      <li class="li">
+        Bzip2. Supported for text files in Impala 2.0 and higher.
+
+      </li>
+
+      <li class="li">
+        <p class="p"> LZO, for text files only. Impala can query
+          LZO-compressed text tables, but currently cannot create them or 
insert
+          data into them; perform these operations in Hive. </p>
+      </li>
+    </ul>
+  </div>
+
+  <nav role="navigation" class="related-links"><ul class="ullinks"><li 
class="link ulchildlink"><strong><a href="../topics/impala_txtfile.html">Using 
Text Data Files with Impala Tables</a></strong><br></li><li class="link 
ulchildlink"><strong><a href="../topics/impala_parquet.html">Using the Parquet 
File Format with Impala Tables</a></strong><br></li><li class="link 
ulchildlink"><strong><a href="../topics/impala_avro.html">Using the Avro File 
Format with Impala Tables</a></strong><br></li><li class="link 
ulchildlink"><strong><a href="../topics/impala_rcfile.html">Using the RCFile 
File Format with Impala Tables</a></strong><br></li><li class="link 
ulchildlink"><strong><a href="../topics/impala_seqfile.html">Using the 
SequenceFile File Format with Impala 
Tables</a></strong><br></li></ul></nav><article class="topic concept nested1" 
aria-labelledby="ariaid-title2" id="file_formats__file_format_choosing">
+
+    <h2 class="title topictitle2" id="ariaid-title2">Choosing the File Format 
for a Table</h2>
+  
+
+    <div class="body conbody">
+
+      <p class="p">
+        Different file formats and compression codecs work better for 
different data sets. While Impala typically
+        provides performance gains regardless of file format, choosing the 
proper format for your data can yield
+        further performance improvements. Use the following considerations to 
decide which combination of file
+        format and compression to use for a particular table:
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          If you are working with existing files that are already in a 
supported file format, use the same format
+          for the Impala table where practical. If the original format does 
not yield acceptable query performance
+          or resource usage, consider creating a new Impala table with 
different file format or compression
+          characteristics, and doing a one-time conversion by copying the data 
to the new table using the
+          <code class="ph codeph">INSERT</code> statement. Depending on the 
file format, you might run the
+          <code class="ph codeph">INSERT</code> statement in <code class="ph 
codeph">impala-shell</code> or in Hive.
+        </li>
+
+        <li class="li">
+          Text files are convenient to produce through many different tools, 
and are human-readable for ease of
+          verification and debugging. Those characteristics are why text is 
the default format for an Impala
+          <code class="ph codeph">CREATE TABLE</code> statement. When 
performance and resource usage are the primary
+          considerations, use one of the other file formats and consider using 
compression. A typical workflow
+          might involve bringing data into an Impala table by copying CSV or 
TSV files into the appropriate data
+          directory, and then using the <code class="ph codeph">INSERT ... 
SELECT</code> syntax to copy the data into a table
+          using a different, more compact file format.
+        </li>
+
+        <li class="li">
+          If your architecture involves storing data to be queried in memory, 
do not compress the data. There is no
+          I/O savings since the data does not need to be moved from disk, but 
there is a CPU cost to decompress the
+          data.
+        </li>
+      </ul>
+    </div>
+  </article>
+</article></main></body></html>
\ No newline at end of file

[34/51] [partial] incubator-impala git commit: IMPALA-4181 [DOCS] Publish rendered Impala documentation to ASF site

Reply via email to