[16/51] [partial] incubator-impala git commit: IMPALA-4181 [DOCS] Publish rendered Impala documentation to ASF site

jbapple Wed, 12 Apr 2017 11:26:42 -0700

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_partitioning.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_partitioning.html 
b/docs/build/html/topics/impala_partitioning.html
new file mode 100644
index 0000000..b361083
--- /dev/null
+++ b/docs/build/html/topics/impala_partitioning.html
@@ -0,0 +1,653 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta 
name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta 
name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta 
name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta 
name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta 
name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta 
name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta 
name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta 
name="version" content="Impala 2.8.x"><meta name="version" content="Impala 
2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version"
  content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="version" content="Impala 2.8.x"><meta name="version" content="Impala 
2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="version" content="Impala 2.8.x"><meta name="version" content="Impala 
2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="partitioning"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>Partitioning for Impala 
Tables</title></head><body id="partitioning"><main role="main"><article 
role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Partitioning for Impala 
Tables</h1>
+
+  
+
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      By default, all the data files for a table are located in a single 
directory. Partitioning is a technique for physically dividing the
+      data during loading, based on values from one or more columns, to speed 
up queries that test those columns. For example, with a
+      <code class="ph codeph">school_records</code> table partitioned on a 
<code class="ph codeph">year</code> column, there is a separate data directory 
for each
+      different year value, and all the data for that year is stored in a data 
file in that directory. A query that includes a
+      <code class="ph codeph">WHERE</code> condition such as <code class="ph 
codeph">YEAR=1966</code>, <code class="ph codeph">YEAR IN (1989,1999)</code>, 
or <code class="ph codeph">YEAR BETWEEN
+      1984 AND 1989</code> can examine only the data files from the 
appropriate directory or directories, greatly reducing the amount of
+      data to read and test.
+    </p>
+
+    <p class="p toc inpage"></p>
+
+    <p class="p">
+      See <a class="xref" 
href="impala_tutorial.html#tut_external_partition_data">Attaching an External 
Partitioned Table to an HDFS Directory Structure</a> for an example that 
illustrates the syntax for creating partitioned
+      tables, the underlying directory structure in HDFS, and how to attach a 
partitioned Impala external table to data files stored
+      elsewhere in HDFS.
+    </p>
+
+    <p class="p">
+      Parquet is a popular format for partitioned Impala tables because it is 
well suited to handle huge data volumes. See
+      <a class="xref" href="impala_parquet.html#parquet_performance">Query 
Performance for Impala Parquet Tables</a> for performance considerations for 
partitioned Parquet tables.
+    </p>
+
+    <p class="p">
+      See <a class="xref" href="impala_literals.html#null">NULL</a> for 
details about how <code class="ph codeph">NULL</code> values are represented in 
partitioned tables.
+    </p>
+
+    <p class="p">
+      See <a class="xref" href="impala_s3.html#s3">Using Impala with the 
Amazon S3 Filesystem</a> for details about setting up tables where some or all 
partitions reside on the Amazon Simple
+      Storage Service (S3).
+    </p>
+
+  </div>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title2" 
id="partitioning__partitioning_choosing">
+
+    <h2 class="title topictitle2" id="ariaid-title2">When to Use Partitioned 
Tables</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        Partitioning is typically appropriate for:
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          Tables that are very large, where reading the entire data set takes 
an impractical amount of time.
+        </li>
+
+        <li class="li">
+          Tables that are always or almost always queried with conditions on 
the partitioning columns. In our example of a table partitioned
+          by year, <code class="ph codeph">SELECT COUNT(*) FROM school_records 
WHERE year = 1985</code> is efficient, only examining a small fraction of
+          the data; but <code class="ph codeph">SELECT COUNT(*) FROM 
school_records</code> has to process a separate data file for each year, 
resulting in
+          more overall work than in an unpartitioned table. You would probably 
not partition this way if you frequently queried the table
+          based on last name, student ID, and so on without testing the year.
+        </li>
+
+        <li class="li">
+          Columns that have reasonable cardinality (number of different 
values). If a column only has a small number of values, for example
+          <code class="ph codeph">Male</code> or <code class="ph 
codeph">Female</code>, you do not gain much efficiency by eliminating only 
about 50% of the data to
+          read for each query. If a column has only a few rows matching each 
value, the number of directories to process can become a
+          limiting factor, and the data file in each directory could be too 
small to take advantage of the Hadoop mechanism for transmitting
+          data in multi-megabyte blocks. For example, you might partition 
census data by year, store sales data by year and month, and web
+          traffic data by year, month, and day. (Some users with high volumes 
of incoming data might even partition down to the individual
+          hour and minute.)
+        </li>
+
+        <li class="li">
+          Data that already passes through an extract, transform, and load 
(ETL) pipeline. The values of the partitioning columns are
+          stripped from the original data files and represented by directory 
names, so loading data into a partitioned table involves some
+          sort of transformation or preprocessing.
+        </li>
+      </ul>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title3" 
id="partitioning__partition_sql">
+
+    <h2 class="title topictitle2" id="ariaid-title3">SQL Statements for 
Partitioned Tables</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        In terms of Impala SQL syntax, partitioning affects these statements:
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          <code class="ph codeph"><a class="xref" 
href="impala_create_table.html#create_table">CREATE TABLE</a></code>: you 
specify a <code class="ph codeph">PARTITIONED
+          BY</code> clause when creating the table to identify names and data 
types of the partitioning columns. These columns are not
+          included in the main list of columns for the table.
+        </li>
+
+        <li class="li">
+          In <span class="keyword">Impala 2.5</span> and higher, you can also 
use the <code class="ph codeph">PARTITIONED BY</code> clause in a <code 
class="ph codeph">CREATE TABLE AS
+          SELECT</code> statement. This syntax lets you use a single statement 
to create a partitioned table, copy data into it, and
+          create new partitions based on the values in the inserted data.
+        </li>
+
+        <li class="li">
+          <code class="ph codeph"><a class="xref" 
href="impala_alter_table.html#alter_table">ALTER TABLE</a></code>: you can add 
or drop partitions, to work with
+          different portions of a huge data set. You can designate the HDFS 
directory that holds the data files for a specific partition.
+          With data partitioned by date values, you might <span class="q">"age 
out"</span> data that is no longer relevant.
+          <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+        If you are creating a partition for the first time and specifying its 
location, for maximum efficiency, use
+        a single <code class="ph codeph">ALTER TABLE</code> statement 
including both the <code class="ph codeph">ADD PARTITION</code> and
+        <code class="ph codeph">LOCATION</code> clauses, rather than separate 
statements with <code class="ph codeph">ADD PARTITION</code> and
+        <code class="ph codeph">SET LOCATION</code> clauses.
+      </div>
+        </li>
+
+        <li class="li">
+          <code class="ph codeph"><a class="xref" 
href="impala_insert.html#insert">INSERT</a></code>: When you insert data into a 
partitioned table, you identify
+          the partitioning columns. One or more values from each inserted row 
are not stored in data files, but instead determine the
+          directory where that row value is stored. You can also specify which 
partition to load a set of data into, with <code class="ph codeph">INSERT
+          OVERWRITE</code> statements; you can replace the contents of a 
specific partition but you cannot append data to a specific
+          partition.
+          <p class="p">
+        By default, if an <code class="ph codeph">INSERT</code> statement 
creates any new subdirectories underneath a partitioned
+        table, those subdirectories are assigned default HDFS permissions for 
the <code class="ph codeph">impala</code> user. To
+        make each subdirectory have the same permissions as its parent 
directory in HDFS, specify the
+        <code class="ph codeph">--insert_inherit_permissions</code> startup 
option for the <span class="keyword cmdname">impalad</span> daemon.
+      </p>
+        </li>
+
+        <li class="li">
+          Although the syntax of the <code class="ph codeph"><a class="xref" 
href="impala_select.html#select">SELECT</a></code> statement is the same 
whether or
+          not the table is partitioned, the way queries interact with 
partitioned tables can have a dramatic impact on performance and
+          scalability. The mechanism that lets queries skip certain partitions 
during a query is known as partition pruning; see
+          <a class="xref" 
href="impala_partitioning.html#partition_pruning">Partition Pruning for 
Queries</a> for details.
+        </li>
+
+        <li class="li">
+          In Impala 1.4 and later, there is a <code class="ph codeph">SHOW 
PARTITIONS</code> statement that displays information about each partition in a
+          table. See <a class="xref" href="impala_show.html#show">SHOW 
Statement</a> for details.
+        </li>
+      </ul>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title4" 
id="partitioning__partition_static_dynamic">
+
+    <h2 class="title topictitle2" id="ariaid-title4">Static and Dynamic 
Partitioning Clauses</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        Specifying all the partition columns in a SQL statement is called <dfn 
class="term">static partitioning</dfn>, because the statement affects a
+        single predictable partition. For example, you use static partitioning 
with an <code class="ph codeph">ALTER TABLE</code> statement that affects
+        only one partition, or with an <code class="ph codeph">INSERT</code> 
statement that inserts all values into the same partition:
+      </p>
+
+<pre class="pre codeblock"><code>insert into t1 <strong class="ph 
b">partition(x=10, y='a')</strong> select c1 from some_other_table;
+</code></pre>
+
+      <p class="p">
+        When you specify some partition key columns in an <code class="ph 
codeph">INSERT</code> statement, but leave out the values, Impala determines
+        which partition to insert. This technique is called <dfn 
class="term">dynamic partitioning</dfn>:
+      </p>
+
+<pre class="pre codeblock"><code>insert into t1 <strong class="ph 
b">partition(x, y='b')</strong> select c1, c2 from some_other_table;
+-- Create new partition if necessary based on variable year, month, and day; 
insert a single value.
+insert into weather <strong class="ph b">partition (year, month, day)</strong> 
select 'cloudy',2014,4,21;
+-- Create new partition if necessary for specified year and month but variable 
day; insert a single value.
+insert into weather <strong class="ph b">partition (year=2014, month=04, 
day)</strong> select 'sunny',22;
+</code></pre>
+
+      <p class="p">
+        The more key columns you specify in the <code class="ph 
codeph">PARTITION</code> clause, the fewer columns you need in the <code 
class="ph codeph">SELECT</code>
+        list. The trailing columns in the <code class="ph 
codeph">SELECT</code> list are substituted in order for the partition key 
columns with no
+        specified value.
+      </p>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title5" 
id="partitioning__partition_refresh">
+
+    <h2 class="title topictitle2" id="ariaid-title5">Refreshing a Single 
Partition</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        The <code class="ph codeph">REFRESH</code> statement is typically used 
with partitioned tables when new data files are loaded into a partition by
+        some non-Impala mechanism, such as a Hive or Spark job. The <code 
class="ph codeph">REFRESH</code> statement makes Impala aware of the new data
+        files so that they can be used in Impala queries. Because partitioned 
tables typically contain a high volume of data, the
+        <code class="ph codeph">REFRESH</code> operation for a full 
partitioned table can take significant time.
+      </p>
+
+      <p class="p">
+        In <span class="keyword">Impala 2.7</span> and higher, you can include 
a <code class="ph codeph">PARTITION (<var class="keyword 
varname">partition_spec</var>)</code> clause in the
+        <code class="ph codeph">REFRESH</code> statement so that only a single 
partition is refreshed. For example, <code class="ph codeph">REFRESH big_table 
PARTITION
+        (year=2017, month=9, day=30)</code>. The partition spec must include 
all the partition key columns. See
+        <a class="xref" href="impala_refresh.html#refresh">REFRESH 
Statement</a> for more details and examples of <code class="ph 
codeph">REFRESH</code> syntax and usage.
+      </p>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title6" 
id="partitioning__partition_permissions">
+
+    <h2 class="title topictitle2" id="ariaid-title6">Permissions for Partition 
Subdirectories</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        By default, if an <code class="ph codeph">INSERT</code> statement 
creates any new subdirectories underneath a partitioned
+        table, those subdirectories are assigned default HDFS permissions for 
the <code class="ph codeph">impala</code> user. To
+        make each subdirectory have the same permissions as its parent 
directory in HDFS, specify the
+        <code class="ph codeph">--insert_inherit_permissions</code> startup 
option for the <span class="keyword cmdname">impalad</span> daemon.
+      </p>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title7" 
id="partitioning__partition_pruning">
+
+    <h2 class="title topictitle2" id="ariaid-title7">Partition Pruning for 
Queries</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        Partition pruning refers to the mechanism where a query can skip 
reading the data files corresponding to one or more partitions. If
+        you can arrange for queries to prune large numbers of unnecessary 
partitions from the query execution plan, the queries use fewer
+        resources and are thus proportionally faster and more scalable.
+      </p>
+
+      <p class="p">
+        For example, if a table is partitioned by columns <code class="ph 
codeph">YEAR</code>, <code class="ph codeph">MONTH</code>, and <code class="ph 
codeph">DAY</code>, then
+        <code class="ph codeph">WHERE</code> clauses such as <code class="ph 
codeph">WHERE year = 2013</code>, <code class="ph codeph">WHERE year &lt; 
2010</code>, or <code class="ph codeph">WHERE
+        year BETWEEN 1995 AND 1998</code> allow Impala to skip the data files 
in all partitions outside the specified range. Likewise,
+        <code class="ph codeph">WHERE year = 2013 AND month BETWEEN 1 AND 
3</code> could prune even more partitions, reading the data files for only a
+        portion of one year.
+      </p>
+
+      <p class="p toc inpage"></p>
+
+    </div>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title8" 
id="partition_pruning__partition_pruning_checking">
+
+      <h3 class="title topictitle3" id="ariaid-title8">Checking if Partition 
Pruning Happens for a Query</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          To check the effectiveness of partition pruning for a query, check 
the <code class="ph codeph">EXPLAIN</code> output for the query before
+          running it. For example, this example shows a table with 3 
partitions, where the query only reads 1 of them. The notation
+          <code class="ph codeph">#partitions=1/3</code> in the <code 
class="ph codeph">EXPLAIN</code> plan confirms that Impala can do the 
appropriate partition
+          pruning.
+        </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; insert into census 
partition (year=2010) values ('Smith'),('Jones');
+[localhost:21000] &gt; insert into census partition (year=2011) values 
('Smith'),('Jones'),('Doe');
+[localhost:21000] &gt; insert into census partition (year=2012) values 
('Smith'),('Doe');
+[localhost:21000] &gt; select name from census where year=2010;
++-------+
+| name  |
++-------+
+| Smith |
+| Jones |
++-------+
+[localhost:21000] &gt; explain select name from census <strong class="ph 
b">where year=2010</strong>;
++------------------------------------------------------------------+
+| Explain String                                                   |
++------------------------------------------------------------------+
+| PLAN FRAGMENT 0                                                  |
+|   PARTITION: UNPARTITIONED                                       |
+|                                                                  |
+|   1:EXCHANGE                                                     |
+|                                                                  |
+| PLAN FRAGMENT 1                                                  |
+|   PARTITION: RANDOM                                              |
+|                                                                  |
+|   STREAM DATA SINK                                               |
+|     EXCHANGE ID: 1                                               |
+|     UNPARTITIONED                                                |
+|                                                                  |
+|   0:SCAN HDFS                                                    |
+|      table=predicate_propagation.census <strong class="ph 
b">#partitions=1/3</strong> size=12B |
++------------------------------------------------------------------+</code></pre>
+
+        <p class="p">
+          For a report of the volume of data that was actually read and 
processed at each stage of the query, check the output of the
+          <code class="ph codeph">SUMMARY</code> command immediately after 
running the query. For a more detailed analysis, look at the output of the
+          <code class="ph codeph">PROFILE</code> command; it includes this 
same summary report near the start of the profile output.
+        </p>
+
+      </div>
+
+    </article>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title9" 
id="partition_pruning__partition_pruning_sql">
+
+      <h3 class="title topictitle3" id="ariaid-title9">What SQL Constructs 
Work with Partition Pruning</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          
+          Impala can even do partition pruning in cases where the partition 
key column is not directly compared to a constant, by applying
+          the transitive property to other parts of the <code class="ph 
codeph">WHERE</code> clause. This technique is known as predicate propagation, 
and
+          is available in Impala 1.2.2 and later. In this example, the census 
table includes another column indicating when the data was
+          collected, which happens in 10-year intervals. Even though the query 
does not compare the partition key column
+          (<code class="ph codeph">YEAR</code>) to a constant value, Impala 
can deduce that only the partition <code class="ph codeph">YEAR=2010</code> is 
required, and
+          again only reads 1 out of 3 partitions.
+        </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; drop table census;
+[localhost:21000] &gt; create table census (name string, census_year int) 
partitioned by (year int);
+[localhost:21000] &gt; insert into census partition (year=2010) values 
('Smith',2010),('Jones',2010);
+[localhost:21000] &gt; insert into census partition (year=2011) values 
('Smith',2020),('Jones',2020),('Doe',2020);
+[localhost:21000] &gt; insert into census partition (year=2012) values 
('Smith',2020),('Doe',2020);
+[localhost:21000] &gt; select name from census where year = census_year and 
census_year=2010;
++-------+
+| name  |
++-------+
+| Smith |
+| Jones |
++-------+
+[localhost:21000] &gt; explain select name from census <strong class="ph 
b">where year = census_year and census_year=2010</strong>;
++------------------------------------------------------------------+
+| Explain String                                                   |
++------------------------------------------------------------------+
+| PLAN FRAGMENT 0                                                  |
+|   PARTITION: UNPARTITIONED                                       |
+|                                                                  |
+|   1:EXCHANGE                                                     |
+|                                                                  |
+| PLAN FRAGMENT 1                                                  |
+|   PARTITION: RANDOM                                              |
+|                                                                  |
+|   STREAM DATA SINK                                               |
+|     EXCHANGE ID: 1                                               |
+|     UNPARTITIONED                                                |
+|                                                                  |
+|   0:SCAN HDFS                                                    |
+|      table=predicate_propagation.census <strong class="ph 
b">#partitions=1/3</strong> size=22B |
+|      predicates: census_year = 2010, year = census_year          |
++------------------------------------------------------------------+
+</code></pre>
+
+        <p class="p">
+        If a view applies to a partitioned table, any partition pruning 
considers the clauses on both
+        the original query and any additional <code class="ph 
codeph">WHERE</code> predicates in the query that refers to the view.
+        Prior to Impala 1.4, only the <code class="ph codeph">WHERE</code> 
clauses on the original query from the
+        <code class="ph codeph">CREATE VIEW</code> statement were used for 
partition pruning.
+      </p>
+
+        <p class="p">
+        In queries involving both analytic functions and partitioned tables, 
partition pruning only occurs for columns named in the <code class="ph 
codeph">PARTITION BY</code>
+        clause of the analytic function call. For example, if an analytic 
function query has a clause such as <code class="ph codeph">WHERE 
year=2016</code>,
+        the way to make the query prune all other <code class="ph 
codeph">YEAR</code> partitions is to include <code class="ph codeph">PARTITION 
BY year</code>in the analytic function call;
+        for example, <code class="ph codeph">OVER (PARTITION BY year,<var 
class="keyword varname">other_columns</var> <var class="keyword 
varname">other_analytic_clauses</var>)</code>.
+
+      </p>
+
+      </div>
+
+    </article>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title10" 
id="partition_pruning__dynamic_partition_pruning">
+
+      <h3 class="title topictitle3" id="ariaid-title10">Dynamic Partition 
Pruning</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          The original mechanism uses to prune partitions is <dfn 
class="term">static partition pruning</dfn>, in which the conditions in the
+          <code class="ph codeph">WHERE</code> clause are analyzed to 
determine in advance which partitions can be safely skipped. In <span 
class="keyword">Impala 2.5</span>
+          and higher, Impala can perform <dfn class="term">dynamic partition 
pruning</dfn>, where information about the partitions is collected during
+          the query, and Impala prunes unnecessary partitions in ways that 
were impractical to predict in advance.
+        </p>
+
+        <p class="p">
+          For example, if partition key columns are compared to literal values 
in a <code class="ph codeph">WHERE</code> clause, Impala can perform static
+          partition pruning during the planning phase to only read the 
relevant partitions:
+        </p>
+
+<pre class="pre codeblock"><code>
+-- The query only needs to read 3 partitions whose key values are known ahead 
of time.
+-- That's static partition pruning.
+SELECT COUNT(*) FROM sales_table WHERE year IN (2005, 2010, 2015);
+</code></pre>
+
+        <p class="p">
+          Dynamic partition pruning involves using information only available 
at run time, such as the result of a subquery:
+        </p>
+
+<pre class="pre codeblock"><code>
+create table yy (s string) partitioned by (year int) stored as parquet;
+insert into yy partition (year) values ('1999', 1999), ('2000', 2000),
+  ('2001', 2001), ('2010',2010);
+compute stats yy;
+
+create table yy2 (s string) partitioned by (year int) stored as parquet;
+insert into yy2 partition (year) values ('1999', 1999), ('2000', 2000),
+  ('2001', 2001);
+compute stats yy2;
+
+-- The query reads an unknown number of partitions, whose key values are only
+-- known at run time. The 'runtime filters' lines show how the information 
about
+-- the partitions is calculated in query fragment 02, and then used in query
+-- fragment 00 to decide which partitions to skip.
+explain select s from yy2 where year in (select year from yy where year 
between 2000 and 2005);
++----------------------------------------------------------+
+| Explain String                                           |
++----------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=16.00MB VCores=2 |
+|                                                          |
+| 04:EXCHANGE [UNPARTITIONED]                              |
+| |                                                        |
+| 02:HASH JOIN [LEFT SEMI JOIN, BROADCAST]                 |
+| |  hash predicates: year = year                          |
+| |  <strong class="ph b">runtime filters: RF000 &lt;- year</strong>           
             |
+| |                                                        |
+| |--03:EXCHANGE [BROADCAST]                               |
+| |  |                                                     |
+| |  01:SCAN HDFS [dpp.yy]                                 |
+| |     partitions=2/4 files=2 size=468B                   |
+| |                                                        |
+| 00:SCAN HDFS [dpp.yy2]                                   |
+|    partitions=2/3 files=2 size=468B                      |
+|    <strong class="ph b">runtime filters: RF000 -&gt; year</strong>           
             |
++----------------------------------------------------------+
+</code></pre>
+
+
+
+        <p class="p">
+          In this case, Impala evaluates the subquery, sends the subquery 
results to all Impala nodes participating in the query, and then
+          each <span class="keyword cmdname">impalad</span> daemon uses the 
dynamic partition pruning optimization to read only the partitions with the
+          relevant key values.
+        </p>
+
+        <p class="p">
+          Dynamic partition pruning is especially effective for queries 
involving joins of several large partitioned tables. Evaluating the
+          <code class="ph codeph">ON</code> clauses of the join predicates 
might normally require reading data from all partitions of certain tables. If
+          the <code class="ph codeph">WHERE</code> clauses of the query refer 
to the partition key columns, Impala can now often skip reading many of the
+          partitions while evaluating the <code class="ph codeph">ON</code> 
clauses. The dynamic partition pruning optimization reduces the amount of I/O
+          and the amount of intermediate data stored and transmitted across 
the network during the query.
+        </p>
+
+        <p class="p">
+        When the spill-to-disk feature is activated for a join node within a 
query, Impala does not
+        produce any runtime filters for that join operation on that host. 
Other join nodes within
+        the query are not affected.
+      </p>
+
+        <p class="p">
+          Dynamic partition pruning is part of the runtime filtering feature, 
which applies to other kinds of queries in addition to queries
+          against partitioned tables. See <a class="xref" 
href="impala_runtime_filtering.html#runtime_filtering">Runtime Filtering for 
Impala Queries (Impala 2.5 or higher only)</a> for full details about this 
feature.
+        </p>
+
+      </div>
+
+    </article>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title11" 
id="partitioning__partition_key_columns">
+
+    <h2 class="title topictitle2" id="ariaid-title11">Partition Key 
Columns</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        The columns you choose as the partition keys should be ones that are 
frequently used to filter query results in important,
+        large-scale queries. Popular examples are some combination of year, 
month, and day when the data has associated time values, and
+        geographic region when the data is associated with some place.
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          <p class="p">
+            For time-based data, split out the separate parts into their own 
columns, because Impala cannot partition based on a
+            <code class="ph codeph">TIMESTAMP</code> column.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            The data type of the partition columns does not have a significant 
effect on the storage required, because the values from those
+            columns are not stored in the data files, rather they are 
represented as strings inside HDFS directory names.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            In <span class="keyword">Impala 2.5</span> and higher, you can 
enable the <code class="ph codeph">OPTIMIZE_PARTITION_KEY_SCANS</code> query 
option to speed up
+            queries that only refer to partition key columns, such as <code 
class="ph codeph">SELECT MAX(year)</code>. This setting is not enabled by
+            default because the query behavior is slightly different if the 
table contains partition directories without actual data inside.
+            See <a class="xref" 
href="impala_optimize_partition_key_scans.html#optimize_partition_key_scans">OPTIMIZE_PARTITION_KEY_SCANS
 Query Option (Impala 2.5 or higher only)</a> for details.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+        Partitioned tables can contain complex type columns.
+        All the partition key columns must be scalar types.
+      </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            Remember that when Impala queries data stored in HDFS, it is most 
efficient to use multi-megabyte files to take advantage of the
+            HDFS block size. For Parquet tables, the block size (and ideal 
size of the data files) is <span class="ph">256 MB in
+            Impala 2.0 and later</span>. Therefore, avoid specifying too many 
partition key columns, which could result in individual
+            partitions containing only small amounts of data. For example, if 
you receive 1 GB of data per day, you might partition by year,
+            month, and day; while if you receive 5 GB of data per minute, you 
might partition by year, month, day, hour, and minute. If you
+            have data with a geographic component, you might partition based 
on postal code if you have many megabytes of data for each
+            postal code, but if not, you might partition by some larger region 
such as city, state, or country. state
+          </p>
+        </li>
+      </ul>
+
+      <p class="p">
+        If you frequently run aggregate functions such as <code class="ph 
codeph">MIN()</code>, <code class="ph codeph">MAX()</code>, and
+        <code class="ph codeph">COUNT(DISTINCT)</code> on partition key 
columns, consider enabling the <code class="ph 
codeph">OPTIMIZE_PARTITION_KEY_SCANS</code>
+        query option, which optimizes such queries. This feature is available 
in <span class="keyword">Impala 2.5</span> and higher.
+        See <a class="xref" 
href="../shared/../topics/impala_optimize_partition_key_scans.html">OPTIMIZE_PARTITION_KEY_SCANS
 Query Option (Impala 2.5 or higher only)</a>
+        for the kinds of queries that this option applies to, and slight 
differences in how partitions are
+        evaluated when this query option is enabled.
+      </p>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title12" 
id="partitioning__mixed_format_partitions">
+
+    <h2 class="title topictitle2" id="ariaid-title12">Setting Different File 
Formats for Partitions</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        Partitioned tables have the flexibility to use different file formats 
for different partitions. (For background information about
+        the different file formats Impala supports, see <a class="xref" 
href="impala_file_formats.html#file_formats">How Impala Works with Hadoop File 
Formats</a>.) For example, if you originally
+        received data in text format, then received new data in RCFile format, 
and eventually began receiving data in Parquet format, all
+        that data could reside in the same table for queries. You just need to 
ensure that the table is structured so that the data files
+        that use different file formats reside in separate partitions.
+      </p>
+
+      <p class="p">
+        For example, here is how you might switch from text to Parquet data as 
you receive data for different years:
+      </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; create table census 
(name string) partitioned by (year smallint);
+[localhost:21000] &gt; alter table census add partition (year=2012); -- Text 
format;
+
+[localhost:21000] &gt; alter table census add partition (year=2013); -- Text 
format switches to Parquet before data loaded;
+[localhost:21000] &gt; alter table census partition (year=2013) set fileformat 
parquet;
+
+[localhost:21000] &gt; insert into census partition (year=2012) values 
('Smith'),('Jones'),('Lee'),('Singh');
+[localhost:21000] &gt; insert into census partition (year=2013) values 
('Flores'),('Bogomolov'),('Cooper'),('Appiah');</code></pre>
+
+      <p class="p">
+        At this point, the HDFS directory for <code class="ph 
codeph">year=2012</code> contains a text-format data file, while the HDFS 
directory for
+        <code class="ph codeph">year=2013</code> contains a Parquet data file. 
As always, when loading non-trivial data, you would use <code class="ph 
codeph">INSERT ...
+        SELECT</code> or <code class="ph codeph">LOAD DATA</code> to import 
data in large batches, rather than <code class="ph codeph">INSERT ... 
VALUES</code> which
+        produces small files that are inefficient for real-world queries.
+      </p>
+
+      <p class="p">
+        For other file types that Impala cannot create natively, you can 
switch into Hive and issue the <code class="ph codeph">ALTER TABLE ... SET
+        FILEFORMAT</code> statements and <code class="ph codeph">INSERT</code> 
or <code class="ph codeph">LOAD DATA</code> statements there. After switching 
back to
+        Impala, issue a <code class="ph codeph">REFRESH <var class="keyword 
varname">table_name</var></code> statement so that Impala recognizes any 
partitions or new
+        data added through Hive.
+      </p>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title13" 
id="partitioning__partition_management">
+
+    <h2 class="title topictitle2" id="ariaid-title13">Managing Partitions</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        You can add, drop, set the expected file format, or set the HDFS 
location of the data files for individual partitions within an
+        Impala table. See <a class="xref" 
href="impala_alter_table.html#alter_table">ALTER TABLE Statement</a> for syntax 
details, and
+        <a class="xref" 
href="impala_partitioning.html#mixed_format_partitions">Setting Different File 
Formats for Partitions</a> for tips on managing tables containing partitions 
with different file
+        formats.
+      </p>
+
+      <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+        If you are creating a partition for the first time and specifying its 
location, for maximum efficiency, use
+        a single <code class="ph codeph">ALTER TABLE</code> statement 
including both the <code class="ph codeph">ADD PARTITION</code> and
+        <code class="ph codeph">LOCATION</code> clauses, rather than separate 
statements with <code class="ph codeph">ADD PARTITION</code> and
+        <code class="ph codeph">SET LOCATION</code> clauses.
+      </div>
+
+      <p class="p">
+        What happens to the data files when a partition is dropped depends on 
whether the partitioned table is designated as internal or
+        external. For an internal (managed) table, the data files are deleted. 
For example, if data in the partitioned table is a copy of
+        raw data files stored elsewhere, you might save disk space by dropping 
older partitions that are no longer required for reporting,
+        knowing that the original data is still available if needed later. For 
an external table, the data files are left alone. For
+        example, dropping a partition without deleting the associated files 
lets Impala consider a smaller set of partitions, improving
+        query efficiency and reducing overhead for DDL operations on the 
table; if the data is needed again later, you can add the partition
+        again. See <a class="xref" href="impala_tables.html#tables">Overview 
of Impala Tables</a> for details and examples.
+      </p>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title14" 
id="partitioning__partition_kudu">
+
+    <h2 class="title topictitle2" id="ariaid-title14">Using Partitioning with 
Kudu Tables</h2>
+
+    
+
+    <div class="body conbody">
+
+      <p class="p">
+        Kudu tables use a more fine-grained partitioning scheme than tables 
containing HDFS data files. You specify a <code class="ph codeph">PARTITION
+        BY</code> clause with the <code class="ph codeph">CREATE TABLE</code> 
statement to identify how to divide the values from the partition key
+        columns.
+      </p>
+
+      <p class="p">
+        See <a class="xref" 
href="impala_kudu.html#kudu_partitioning">Partitioning for Kudu Tables</a> for
+        details and examples of the partitioning techniques
+        for Kudu tables.
+      </p>
+
+    </div>
+
+  </article>
+
+</article></main></body></html>
\ No newline at end of file


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_perf_benchmarking.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_perf_benchmarking.html 
b/docs/build/html/topics/impala_perf_benchmarking.html
new file mode 100644
index 0000000..ce9d995
--- /dev/null
+++ b/docs/build/html/topics/impala_perf_benchmarking.html
@@ -0,0 +1,27 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_performance.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="perf_benchmarks"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>Benchmarking Impala Queries</title></head><body 
id="perf_benchmarks"><main role="main"><article role="article" 
aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Benchmarking Impala 
Queries</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      Because Impala, like other Hadoop components, is designed to handle 
large data volumes in a distributed
+      environment, conduct any performance tests using realistic data and 
cluster configurations. Use a multi-node
+      cluster rather than a single node; run queries against tables containing 
terabytes of data rather than tens
+      of gigabytes. The parallel processing techniques used by Impala are most 
appropriate for workloads that are
+      beyond the capacity of a single server.
+    </p>
+
+    <p class="p">
+      When you run queries returning large numbers of rows, the CPU time to 
pretty-print the output can be
+      substantial, giving an inaccurate measurement of the actual query time. 
Consider using the
+      <code class="ph codeph">-B</code> option on the <code class="ph 
codeph">impala-shell</code> command to turn off the pretty-printing, and
+      optionally the <code class="ph codeph">-o</code> option to store query 
results in a file rather than printing to the
+      screen. See <a class="xref" 
href="impala_shell_options.html#shell_options">impala-shell Configuration 
Options</a> for details.
+    </p>
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_performance.html">Tuning Impala for 
Performance</a></div></div></nav></article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_perf_cookbook.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_perf_cookbook.html 
b/docs/build/html/topics/impala_perf_cookbook.html
new file mode 100644
index 0000000..fc68b45
--- /dev/null
+++ b/docs/build/html/topics/impala_perf_cookbook.html
@@ -0,0 +1,256 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_performance.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="perf_cookbook"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>Impala Performance Guidelines and Best 
Practices</title></head><body id="perf_cookbook"><main role="main"><article 
role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Impala Performance 
Guidelines and Best Practices</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      Here are performance guidelines and best practices that you can use 
during planning, experimentation, and
+      performance tuning for an Impala-enabled <span class="keyword"></span> 
cluster. All of this information is also available in more
+      detail elsewhere in the Impala documentation; it is gathered together 
here to serve as a cookbook and
+      emphasize which performance techniques typically provide the highest 
return on investment
+    </p>
+
+    <p class="p toc inpage"></p>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_file_format"><h2 
class="title sectiontitle">Choose the appropriate file format for the data.</h2>
+
+      
+
+      <p class="p">
+        Typically, for large volumes of data (multiple gigabytes per table or 
partition), the Parquet file format
+        performs best because of its combination of columnar storage layout, 
large I/O request size, and
+        compression and encoding. See <a class="xref" 
href="impala_file_formats.html#file_formats">How Impala Works with Hadoop File 
Formats</a> for comparisons of all
+        file formats supported by Impala, and <a class="xref" 
href="impala_parquet.html#parquet">Using the Parquet File Format with Impala 
Tables</a> for details about the
+        Parquet file format.
+      </p>
+
+      <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+        For smaller volumes of data, a few gigabytes or less for each table or 
partition, you might not see
+        significant performance differences between file formats. At small 
data volumes, reduced I/O from an
+        efficient compressed file format can be counterbalanced by reduced 
opportunity for parallel execution. When
+        planning for a production deployment or conducting benchmarks, always 
use realistic data volumes to get a
+        true picture of performance and scalability.
+      </div>
+    </section>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_small_files"><h2 
class="title sectiontitle">Avoid data ingestion processes that produce many 
small files.</h2>
+
+      
+
+      <p class="p">
+        When producing data files outside of Impala, prefer either text format 
or Avro, where you can build up the
+        files row by row. Once the data is in Impala, you can convert it to 
the more efficient Parquet format and
+        split into multiple data files using a single <code class="ph 
codeph">INSERT ... SELECT</code> statement. Or, if you have
+        the infrastructure to produce multi-megabyte Parquet files as part of 
your data preparation process, do
+        that and skip the conversion step inside Impala.
+      </p>
+
+      <p class="p">
+        Always use <code class="ph codeph">INSERT ... SELECT</code> to copy 
significant volumes of data from table to table
+        within Impala. Avoid <code class="ph codeph">INSERT ... VALUES</code> 
for any substantial volume of data or
+        performance-critical tables, because each such statement produces a 
separate tiny data file. See
+        <a class="xref" href="impala_insert.html#insert">INSERT Statement</a> 
for examples of the <code class="ph codeph">INSERT ... SELECT</code> syntax.
+      </p>
+
+      <p class="p">
+        For example, if you have thousands of partitions in a Parquet table, 
each with less than
+        <span class="ph">256 MB</span> of data, consider partitioning in a 
less granular way, such as by
+        year / month rather than year / month / day. If an inefficient data 
ingestion process produces thousands of
+        data files in the same table or partition, consider compacting the 
data by performing an <code class="ph codeph">INSERT ...
+        SELECT</code> to copy all the data to a different table; the data will 
be reorganized into a smaller
+        number of larger files by this process.
+      </p>
+    </section>
+
+    <section class="section" 
id="perf_cookbook__perf_cookbook_partitioning"><h2 class="title 
sectiontitle">Choose partitioning granularity based on actual data volume.</h2>
+
+      
+
+      <p class="p">
+        Partitioning is a technique that physically divides the data based on 
values of one or more columns, such
+        as by year, month, day, region, city, section of a web site, and so 
on. When you issue queries that request
+        a specific value or range of values for the partition key columns, 
Impala can avoid reading the irrelevant
+        data, potentially yielding a huge savings in disk I/O.
+      </p>
+
+      <p class="p">
+        When deciding which column(s) to use for partitioning, choose the 
right level of granularity. For example,
+        should you partition by year, month, and day, or only by year and 
month? Choose a partitioning strategy
+        that puts at least <span class="ph">256 MB</span> of data in each 
partition, to take advantage of
+        HDFS bulk I/O and Impala distributed queries.
+      </p>
+
+      <p class="p">
+        Over-partitioning can also cause query planning to take longer than 
necessary, as Impala prunes the
+        unnecessary partitions. Ideally, keep the number of partitions in the 
table under 30 thousand.
+      </p>
+
+      <p class="p">
+        When preparing data files to go in a partition directory, create 
several large files rather than many small
+        ones. If you receive data in the form of many small files and have no 
control over the input format,
+        consider using the <code class="ph codeph">INSERT ... SELECT</code> 
syntax to copy data from one table or partition to
+        another, which compacts the files into a relatively small number 
(based on the number of nodes in the
+        cluster).
+      </p>
+
+      <p class="p">
+        If you need to reduce the overall number of partitions and increase 
the amount of data in each partition,
+        first look for partition key columns that are rarely referenced or are 
referenced in non-critical queries
+        (not subject to an SLA). For example, your web site log data might be 
partitioned by year, month, day, and
+        hour, but if most queries roll up the results by day, perhaps you only 
need to partition by year, month,
+        and day.
+      </p>
+
+      <p class="p">
+        If you need to reduce the granularity even more, consider creating 
<span class="q">"buckets"</span>, computed values
+        corresponding to different sets of partition key values. For example, 
you can use the
+        <code class="ph codeph">TRUNC()</code> function with a <code class="ph 
codeph">TIMESTAMP</code> column to group date and time values
+        based on intervals such as week or quarter. See
+        <a class="xref" 
href="impala_datetime_functions.html#datetime_functions">Impala Date and Time 
Functions</a> for details.
+      </p>
+
+      <p class="p">
+        See <a class="xref" 
href="impala_partitioning.html#partitioning">Partitioning for Impala Tables</a> 
for full details and performance considerations for
+        partitioning.
+      </p>
+    </section>
+
+    <section class="section" 
id="perf_cookbook__perf_cookbook_partition_keys"><h2 class="title 
sectiontitle">Use smallest appropriate integer types for partition key 
columns.</h2>
+
+      
+
+      <p class="p">
+        Although it is tempting to use strings for partition key columns, 
since those values are turned into HDFS
+        directory names anyway, you can minimize memory usage by using numeric 
values for common partition key
+        fields such as <code class="ph codeph">YEAR</code>, <code class="ph 
codeph">MONTH</code>, and <code class="ph codeph">DAY</code>. Use the smallest
+        integer type that holds the appropriate range of values, typically 
<code class="ph codeph">TINYINT</code> for
+        <code class="ph codeph">MONTH</code> and <code class="ph 
codeph">DAY</code>, and <code class="ph codeph">SMALLINT</code> for <code 
class="ph codeph">YEAR</code>.
+        Use the <code class="ph codeph">EXTRACT()</code> function to pull out 
individual date and time fields from a
+        <code class="ph codeph">TIMESTAMP</code> value, and <code class="ph 
codeph">CAST()</code> the return value to the appropriate integer
+        type.
+      </p>
+    </section>
+
+    <section class="section" 
id="perf_cookbook__perf_cookbook_parquet_block_size"><h2 class="title 
sectiontitle">Choose an appropriate Parquet block size.</h2>
+
+      
+
+      <p class="p">
+        By default, the Impala <code class="ph codeph">INSERT ... 
SELECT</code> statement creates Parquet files with a 256 MB
+        block size. (This default was changed in Impala 2.0. Formerly, the 
limit was 1 GB, but Impala made
+        conservative estimates about compression, resulting in files that were 
smaller than 1 GB.)
+      </p>
+
+      <p class="p">
+        Each Parquet file written by Impala is a single block, allowing the 
whole file to be processed as a unit by a single host.
+        As you copy Parquet files into HDFS or between HDFS filesystems, use 
<code class="ph codeph">hdfs dfs -pb</code> to preserve the original
+        block size.
+      </p>
+
+      <p class="p">
+        If there is only one or a few data block in your Parquet table, or in 
a partition that is the only one
+        accessed by a query, then you might experience a slowdown for a 
different reason: not enough data to take
+        advantage of Impala's parallel distributed queries. Each data block is 
processed by a single core on one of
+        the DataNodes. In a 100-node cluster of 16-core machines, you could 
potentially process thousands of data
+        files simultaneously. You want to find a sweet spot between <span 
class="q">"many tiny files"</span> and <span class="q">"single giant
+        file"</span> that balances bulk I/O and parallel processing. You can 
set the <code class="ph codeph">PARQUET_FILE_SIZE</code>
+        query option before doing an <code class="ph codeph">INSERT ... 
SELECT</code> statement to reduce the size of each
+        generated Parquet file. <span class="ph">(Specify the file size as an 
absolute number of bytes, or in Impala
+        2.0 and later, in units ending with <code class="ph codeph">m</code> 
for megabytes or <code class="ph codeph">g</code> for
+        gigabytes.)</span> Run benchmarks with different file sizes to find 
the right balance point for your
+        particular data volume.
+      </p>
+    </section>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_stats"><h2 
class="title sectiontitle">Gather statistics for all tables used in 
performance-critical or high-volume join queries.</h2>
+
+      
+
+      <p class="p">
+        Gather the statistics with the <code class="ph codeph">COMPUTE 
STATS</code> statement. See
+        <a class="xref" href="impala_perf_joins.html#perf_joins">Performance 
Considerations for Join Queries</a> for details.
+      </p>
+    </section>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_network"><h2 
class="title sectiontitle">Minimize the overhead of transmitting results back 
to the client.</h2>
+
+      
+
+      <p class="p">
+        Use techniques such as:
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          Aggregation. If you need to know how many rows match a condition, 
the total values of matching values
+          from some column, the lowest or highest matching value, and so on, 
call aggregate functions such as
+          <code class="ph codeph">COUNT()</code>, <code class="ph 
codeph">SUM()</code>, and <code class="ph codeph">MAX()</code> in the query 
rather than
+          sending the result set to an application and doing those 
computations there. Remember that the size of an
+          unaggregated result set could be huge, requiring substantial time to 
transmit across the network.
+        </li>
+
+        <li class="li">
+          Filtering. Use all applicable tests in the <code class="ph 
codeph">WHERE</code> clause of a query to eliminate rows
+          that are not relevant, rather than producing a big result set and 
filtering it using application logic.
+        </li>
+
+        <li class="li">
+          <code class="ph codeph">LIMIT</code> clause. If you only need to see 
a few sample values from a result set, or the top
+          or bottom values from a query using <code class="ph codeph">ORDER 
BY</code>, include the <code class="ph codeph">LIMIT</code> clause
+          to reduce the size of the result set rather than asking for the full 
result set and then throwing most of
+          the rows away.
+        </li>
+
+        <li class="li">
+          Avoid overhead from pretty-printing the result set and displaying it 
on the screen. When you retrieve the
+          results through <span class="keyword cmdname">impala-shell</span>, 
use <span class="keyword cmdname">impala-shell</span> options such as
+          <code class="ph codeph">-B</code> and <code class="ph 
codeph">--output_delimiter</code> to produce results without special
+          formatting, and redirect output to a file rather than printing to 
the screen. Consider using
+          <code class="ph codeph">INSERT ... SELECT</code> to write the 
results directly to new files in HDFS. See
+          <a class="xref" 
href="impala_shell_options.html#shell_options">impala-shell Configuration 
Options</a> for details about the
+          <span class="keyword cmdname">impala-shell</span> command-line 
options.
+        </li>
+      </ul>
+    </section>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_explain"><h2 
class="title sectiontitle">Verify that your queries are planned in an efficient 
logical manner.</h2>
+
+      
+
+      <p class="p">
+        Examine the <code class="ph codeph">EXPLAIN</code> plan for a query 
before actually running it. See
+        <a class="xref" href="impala_explain.html#explain">EXPLAIN 
Statement</a> and <a class="xref" 
href="impala_explain_plan.html#perf_explain">Using the EXPLAIN Plan for 
Performance Tuning</a> for
+        details.
+      </p>
+    </section>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_profile"><h2 
class="title sectiontitle">Verify performance characteristics of queries.</h2>
+
+      
+
+      <p class="p">
+        Verify that the low-level aspects of I/O, memory usage, network 
bandwidth, CPU utilization, and so on are
+        within expected ranges by examining the query profile for a query 
after running it. See
+        <a class="xref" href="impala_explain_plan.html#perf_profile">Using the 
Query Profile for Performance Tuning</a> for details.
+      </p>
+    </section>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_os"><h2 
class="title sectiontitle">Use appropriate operating system settings.</h2>
+
+      
+
+      <p class="p">
+        See <span class="xref">the documentation for your Apache Hadoop 
distribution</span> for recommendations about operating system
+        settings that you can change to influence Impala performance. In 
particular, you might find
+        that changing the <code class="ph codeph">vm.swappiness</code> Linux 
kernel setting to a non-zero value improves
+        overall performance.
+      </p>
+    </section>
+
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_performance.html">Tuning Impala for 
Performance</a></div></div></nav></article></main></body></html>
\ No newline at end of file

[16/51] [partial] incubator-impala git commit: IMPALA-4181 [DOCS] Publish rendered Impala documentation to ASF site

Reply via email to