[15/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch.

jbapple Thu, 17 Nov 2016 15:12:49 -0800

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_partitioning.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_partitioning.xml 
b/docs/topics/impala_partitioning.xml
new file mode 100644
index 0000000..1aedfa1
--- /dev/null
+++ b/docs/topics/impala_partitioning.xml
@@ -0,0 +1,582 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="partitioning">
+
+  <title>Partitioning for Impala Tables</title>
+
+  <titlealts audience="PDF">
+
+    <navtitle>Partitioning</navtitle>
+
+  </titlealts>
+
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="SQL"/>
+      <data name="Category" value="Performance"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">partitioning</indexterm>
+      By default, all the data files for a table are located in a single 
directory. Partitioning is a technique for physically dividing the
+      data during loading, based on values from one or more columns, to speed 
up queries that test those columns. For example, with a
+      <codeph>school_records</codeph> table partitioned on a 
<codeph>year</codeph> column, there is a separate data directory for each
+      different year value, and all the data for that year is stored in a data 
file in that directory. A query that includes a
+      <codeph>WHERE</codeph> condition such as <codeph>YEAR=1966</codeph>, 
<codeph>YEAR IN (1989,1999)</codeph>, or <codeph>YEAR BETWEEN
+      1984 AND 1989</codeph> can examine only the data files from the 
appropriate directory or directories, greatly reducing the amount of
+      data to read and test.
+    </p>
+
+    <p outputclass="toc inpage"/>
+
+    <p>
+      See <xref href="impala_tutorial.xml#tut_external_partition_data"/> for 
an example that illustrates the syntax for creating partitioned
+      tables, the underlying directory structure in HDFS, and how to attach a 
partitioned Impala external table to data files stored
+      elsewhere in HDFS.
+    </p>
+
+    <p>
+      Parquet is a popular format for partitioned Impala tables because it is 
well suited to handle huge data volumes. See
+      <xref href="impala_parquet.xml#parquet_performance"/> for performance 
considerations for partitioned Parquet tables.
+    </p>
+
+    <p>
+      See <xref href="impala_literals.xml#null"/> for details about how 
<codeph>NULL</codeph> values are represented in partitioned tables.
+    </p>
+
+    <p rev="2.2.0">
+      See <xref href="impala_s3.xml#s3"/> for details about setting up tables 
where some or all partitions reside on the Amazon Simple
+      Storage Service (S3).
+    </p>
+
+  </conbody>
+
+  <concept id="partitioning_choosing">
+
+    <title>When to Use Partitioned Tables</title>
+
+    <conbody>
+
+      <p>
+        Partitioning is typically appropriate for:
+      </p>
+
+      <ul>
+        <li>
+          Tables that are very large, where reading the entire data set takes 
an impractical amount of time.
+        </li>
+
+        <li>
+          Tables that are always or almost always queried with conditions on 
the partitioning columns. In our example of a table partitioned
+          by year, <codeph>SELECT COUNT(*) FROM school_records WHERE year = 
1985</codeph> is efficient, only examining a small fraction of
+          the data; but <codeph>SELECT COUNT(*) FROM school_records</codeph> 
has to process a separate data file for each year, resulting in
+          more overall work than in an unpartitioned table. You would probably 
not partition this way if you frequently queried the table
+          based on last name, student ID, and so on without testing the year.
+        </li>
+
+        <li>
+          Columns that have reasonable cardinality (number of different 
values). If a column only has a small number of values, for example
+          <codeph>Male</codeph> or <codeph>Female</codeph>, you do not gain 
much efficiency by eliminating only about 50% of the data to
+          read for each query. If a column has only a few rows matching each 
value, the number of directories to process can become a
+          limiting factor, and the data file in each directory could be too 
small to take advantage of the Hadoop mechanism for transmitting
+          data in multi-megabyte blocks. For example, you might partition 
census data by year, store sales data by year and month, and web
+          traffic data by year, month, and day. (Some users with high volumes 
of incoming data might even partition down to the individual
+          hour and minute.)
+        </li>
+
+        <li>
+          Data that already passes through an extract, transform, and load 
(ETL) pipeline. The values of the partitioning columns are
+          stripped from the original data files and represented by directory 
names, so loading data into a partitioned table involves some
+          sort of transformation or preprocessing.
+        </li>
+      </ul>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="partition_sql">
+
+    <title>SQL Statements for Partitioned Tables</title>
+
+    <conbody>
+
+      <p>
+        In terms of Impala SQL syntax, partitioning affects these statements:
+      </p>
+
+      <ul>
+        <li>
+          <codeph><xref href="impala_create_table.xml#create_table">CREATE 
TABLE</xref></codeph>: you specify a <codeph>PARTITIONED
+          BY</codeph> clause when creating the table to identify names and 
data types of the partitioning columns. These columns are not
+          included in the main list of columns for the table.
+        </li>
+
+        <li rev="2.5.0">
+          In <keyword keyref="impala25_full"/> and higher, you can also use 
the <codeph>PARTITIONED BY</codeph> clause in a <codeph>CREATE TABLE AS
+          SELECT</codeph> statement. This syntax lets you use a single 
statement to create a partitioned table, copy data into it, and
+          create new partitions based on the values in the inserted data.
+        </li>
+
+        <li>
+          <codeph><xref href="impala_alter_table.xml#alter_table">ALTER 
TABLE</xref></codeph>: you can add or drop partitions, to work with
+          different portions of a huge data set. You can designate the HDFS 
directory that holds the data files for a specific partition.
+          With data partitioned by date values, you might <q>age out</q> data 
that is no longer relevant.
+          <note 
conref="../shared/impala_common.xml#common/add_partition_set_location"/>
+        </li>
+
+        <li>
+          <codeph><xref 
href="impala_insert.xml#insert">INSERT</xref></codeph>: When you insert data 
into a partitioned table, you identify
+          the partitioning columns. One or more values from each inserted row 
are not stored in data files, but instead determine the
+          directory where that row value is stored. You can also specify which 
partition to load a set of data into, with <codeph>INSERT
+          OVERWRITE</codeph> statements; you can replace the contents of a 
specific partition but you cannot append data to a specific
+          partition.
+          <p rev="1.3.1" 
conref="../shared/impala_common.xml#common/insert_inherit_permissions"/>
+        </li>
+
+        <li>
+          Although the syntax of the <codeph><xref 
href="impala_select.xml#select">SELECT</xref></codeph> statement is the same 
whether or
+          not the table is partitioned, the way queries interact with 
partitioned tables can have a dramatic impact on performance and
+          scalability. The mechanism that lets queries skip certain partitions 
during a query is known as partition pruning; see
+          <xref href="impala_partitioning.xml#partition_pruning"/> for details.
+        </li>
+
+        <li rev="1.4.0">
+          In Impala 1.4 and later, there is a <codeph>SHOW PARTITIONS</codeph> 
statement that displays information about each partition in a
+          table. See <xref href="impala_show.xml#show"/> for details.
+        </li>
+      </ul>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="partition_static_dynamic">
+
+    <title>Static and Dynamic Partitioning Clauses</title>
+
+    <conbody>
+
+      <p>
+        Specifying all the partition columns in a SQL statement is called 
<term>static partitioning</term>, because the statement affects a
+        single predictable partition. For example, you use static partitioning 
with an <codeph>ALTER TABLE</codeph> statement that affects
+        only one partition, or with an <codeph>INSERT</codeph> statement that 
inserts all values into the same partition:
+      </p>
+
+<codeblock>insert into t1 <b>partition(x=10, y='a')</b> select c1 from 
some_other_table;
+</codeblock>
+
+      <p>
+        When you specify some partition key columns in an 
<codeph>INSERT</codeph> statement, but leave out the values, Impala determines
+        which partition to insert. This technique is called <term>dynamic 
partitioning</term>:
+      </p>
+
+<codeblock>insert into t1 <b>partition(x, y='b')</b> select c1, c2 from 
some_other_table;
+-- Create new partition if necessary based on variable year, month, and day; 
insert a single value.
+insert into weather <b>partition (year, month, day)</b> select 
'cloudy',2014,4,21;
+-- Create new partition if necessary for specified year and month but variable 
day; insert a single value.
+insert into weather <b>partition (year=2014, month=04, day)</b> select 
'sunny',22;
+</codeblock>
+
+      <p>
+        The more key columns you specify in the <codeph>PARTITION</codeph> 
clause, the fewer columns you need in the <codeph>SELECT</codeph>
+        list. The trailing columns in the <codeph>SELECT</codeph> list are 
substituted in order for the partition key columns with no
+        specified value.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="partition_refresh" rev="2.7.0 IMPALA-1683 CDH-43732">
+
+    <title>Refreshing a Single Partition</title>
+
+    <conbody>
+
+      <p>
+        The <codeph>REFRESH</codeph> statement is typically used with 
partitioned tables when new data files are loaded into a partition by
+        some non-Impala mechanism, such as a Hive or Spark job. The 
<codeph>REFRESH</codeph> statement makes Impala aware of the new data
+        files so that they can be used in Impala queries. Because partitioned 
tables typically contain a high volume of data, the
+        <codeph>REFRESH</codeph> operation for a full partitioned table can 
take significant time.
+      </p>
+
+      <p>
+        In <keyword keyref="impala27_full"/> and higher, you can include a 
<codeph>PARTITION (<varname>partition_spec</varname>)</codeph> clause in the
+        <codeph>REFRESH</codeph> statement so that only a single partition is 
refreshed. For example, <codeph>REFRESH big_table PARTITION
+        (year=2017, month=9, day=30)</codeph>. The partition spec must include 
all the partition key columns. See
+        <xref href="impala_refresh.xml#refresh"/> for more details and 
examples of <codeph>REFRESH</codeph> syntax and usage.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="partition_permissions">
+
+    <title>Permissions for Partition Subdirectories</title>
+
+    <conbody>
+
+      <p rev="1.3.1" 
conref="../shared/impala_common.xml#common/insert_inherit_permissions"/>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="partition_pruning">
+
+    <title>Partition Pruning for Queries</title>
+
+    <conbody>
+
+      <p>
+        Partition pruning refers to the mechanism where a query can skip 
reading the data files corresponding to one or more partitions. If
+        you can arrange for queries to prune large numbers of unnecessary 
partitions from the query execution plan, the queries use fewer
+        resources and are thus proportionally faster and more scalable.
+      </p>
+
+      <p>
+        For example, if a table is partitioned by columns 
<codeph>YEAR</codeph>, <codeph>MONTH</codeph>, and <codeph>DAY</codeph>, then
+        <codeph>WHERE</codeph> clauses such as <codeph>WHERE year = 
2013</codeph>, <codeph>WHERE year &lt; 2010</codeph>, or <codeph>WHERE
+        year BETWEEN 1995 AND 1998</codeph> allow Impala to skip the data 
files in all partitions outside the specified range. Likewise,
+        <codeph>WHERE year = 2013 AND month BETWEEN 1 AND 3</codeph> could 
prune even more partitions, reading the data files for only a
+        portion of one year.
+      </p>
+
+      <p outputclass="toc inpage"/>
+
+    </conbody>
+
+    <concept id="partition_pruning_checking">
+
+      <title>Checking if Partition Pruning Happens for a Query</title>
+
+      <conbody>
+
+        <p>
+          To check the effectiveness of partition pruning for a query, check 
the <codeph>EXPLAIN</codeph> output for the query before
+          running it. For example, this example shows a table with 3 
partitions, where the query only reads 1 of them. The notation
+          <codeph>#partitions=1/3</codeph> in the <codeph>EXPLAIN</codeph> 
plan confirms that Impala can do the appropriate partition
+          pruning.
+        </p>
+
+<codeblock>[localhost:21000] &gt; insert into census partition (year=2010) 
values ('Smith'),('Jones');
+[localhost:21000] &gt; insert into census partition (year=2011) values 
('Smith'),('Jones'),('Doe');
+[localhost:21000] &gt; insert into census partition (year=2012) values 
('Smith'),('Doe');
+[localhost:21000] &gt; select name from census where year=2010;
++-------+
+| name  |
++-------+
+| Smith |
+| Jones |
++-------+
+[localhost:21000] &gt; explain select name from census <b>where year=2010</b>;
++------------------------------------------------------------------+
+| Explain String                                                   |
++------------------------------------------------------------------+
+| PLAN FRAGMENT 0                                                  |
+|   PARTITION: UNPARTITIONED                                       |
+|                                                                  |
+|   1:EXCHANGE                                                     |
+|                                                                  |
+| PLAN FRAGMENT 1                                                  |
+|   PARTITION: RANDOM                                              |
+|                                                                  |
+|   STREAM DATA SINK                                               |
+|     EXCHANGE ID: 1                                               |
+|     UNPARTITIONED                                                |
+|                                                                  |
+|   0:SCAN HDFS                                                    |
+|      table=predicate_propagation.census <b>#partitions=1/3</b> size=12B |
++------------------------------------------------------------------+</codeblock>
+
+        <p rev="1.4.0">
+          For a report of the volume of data that was actually read and 
processed at each stage of the query, check the output of the
+          <codeph>SUMMARY</codeph> command immediately after running the 
query. For a more detailed analysis, look at the output of the
+          <codeph>PROFILE</codeph> command; it includes this same summary 
report near the start of the profile output.
+        </p>
+
+      </conbody>
+
+    </concept>
+
+    <concept id="partition_pruning_sql">
+
+      <title>What SQL Constructs Work with Partition Pruning</title>
+
+      <conbody>
+
+        <p rev="1.2.2">
+          <indexterm audience="Cloudera">predicate propagation</indexterm>
+          Impala can even do partition pruning in cases where the partition 
key column is not directly compared to a constant, by applying
+          the transitive property to other parts of the <codeph>WHERE</codeph> 
clause. This technique is known as predicate propagation, and
+          is available in Impala 1.2.2 and later. In this example, the census 
table includes another column indicating when the data was
+          collected, which happens in 10-year intervals. Even though the query 
does not compare the partition key column
+          (<codeph>YEAR</codeph>) to a constant value, Impala can deduce that 
only the partition <codeph>YEAR=2010</codeph> is required, and
+          again only reads 1 out of 3 partitions.
+        </p>
+
+<codeblock rev="1.2.2">[localhost:21000] &gt; drop table census;
+[localhost:21000] &gt; create table census (name string, census_year int) 
partitioned by (year int);
+[localhost:21000] &gt; insert into census partition (year=2010) values 
('Smith',2010),('Jones',2010);
+[localhost:21000] &gt; insert into census partition (year=2011) values 
('Smith',2020),('Jones',2020),('Doe',2020);
+[localhost:21000] &gt; insert into census partition (year=2012) values 
('Smith',2020),('Doe',2020);
+[localhost:21000] &gt; select name from census where year = census_year and 
census_year=2010;
++-------+
+| name  |
++-------+
+| Smith |
+| Jones |
++-------+
+[localhost:21000] &gt; explain select name from census <b>where year = 
census_year and census_year=2010</b>;
++------------------------------------------------------------------+
+| Explain String                                                   |
++------------------------------------------------------------------+
+| PLAN FRAGMENT 0                                                  |
+|   PARTITION: UNPARTITIONED                                       |
+|                                                                  |
+|   1:EXCHANGE                                                     |
+|                                                                  |
+| PLAN FRAGMENT 1                                                  |
+|   PARTITION: RANDOM                                              |
+|                                                                  |
+|   STREAM DATA SINK                                               |
+|     EXCHANGE ID: 1                                               |
+|     UNPARTITIONED                                                |
+|                                                                  |
+|   0:SCAN HDFS                                                    |
+|      table=predicate_propagation.census <b>#partitions=1/3</b> size=22B |
+|      predicates: census_year = 2010, year = census_year          |
++------------------------------------------------------------------+
+</codeblock>
+
+        <p conref="../shared/impala_common.xml#common/partitions_and_views"/>
+
+        <p 
conref="../shared/impala_common.xml#common/analytic_partition_pruning_caveat"/>
+
+      </conbody>
+
+    </concept>
+
+    <concept id="dynamic_partition_pruning">
+
+      <title>Dynamic Partition Pruning</title>
+
+      <conbody>
+
+        <p>
+          The original mechanism uses to prune partitions is <term>static 
partition pruning</term>, in which the conditions in the
+          <codeph>WHERE</codeph> clause are analyzed to determine in advance 
which partitions can be safely skipped. In Impala 2.5 / CDH 5.7
+          and higher, Impala can perform <term>dynamic partition 
pruning</term>, where information about the partitions is collected during
+          the query, and Impala prunes unnecessary partitions in ways that 
were impractical to predict in advance.
+        </p>
+
+        <p>
+          For example, if partition key columns are compared to literal values 
in a <codeph>WHERE</codeph> clause, Impala can perform static
+          partition pruning during the planning phase to only read the 
relevant partitions:
+        </p>
+
+<codeblock>
+-- The query only needs to read 3 partitions whose key values are known ahead 
of time.
+-- That's static partition pruning.
+SELECT COUNT(*) FROM sales_table WHERE year IN (2005, 2010, 2015);
+</codeblock>
+
+        <p>
+          Dynamic partition pruning involves using information only available 
at run time, such as the result of a subquery:
+        </p>
+
+<codeblock conref="../shared/impala_common.xml#common/simple_dpp_example"/>
+
+<!-- Former example. Not sure it really would trigger DPP. SELECT COUNT(*) 
FROM sales_table WHERE year = (SELECT MAX(year) FROM some_other_table); -->
+
+        <p>
+          In this case, Impala evaluates the subquery, sends the subquery 
results to all Impala nodes participating in the query, and then
+          each <cmdname>impalad</cmdname> daemon uses the dynamic partition 
pruning optimization to read only the partitions with the
+          relevant key values.
+        </p>
+
+        <p>
+          Dynamic partition pruning is especially effective for queries 
involving joins of several large partitioned tables. Evaluating the
+          <codeph>ON</codeph> clauses of the join predicates might normally 
require reading data from all partitions of certain tables. If
+          the <codeph>WHERE</codeph> clauses of the query refer to the 
partition key columns, Impala can now often skip reading many of the
+          partitions while evaluating the <codeph>ON</codeph> clauses. The 
dynamic partition pruning optimization reduces the amount of I/O
+          and the amount of intermediate data stored and transmitted across 
the network during the query.
+        </p>
+
+        <p 
conref="../shared/impala_common.xml#common/spill_to_disk_vs_dynamic_partition_pruning"/>
+
+        <p>
+          Dynamic partition pruning is part of the runtime filtering feature, 
which applies to other kinds of queries in addition to queries
+          against partitioned tables. See <xref 
href="impala_runtime_filtering.xml#runtime_filtering"/> for full details about 
this feature.
+        </p>
+
+      </conbody>
+
+    </concept>
+
+  </concept>
+
+  <concept id="partition_key_columns">
+
+    <title>Partition Key Columns</title>
+
+    <conbody>
+
+      <p>
+        The columns you choose as the partition keys should be ones that are 
frequently used to filter query results in important,
+        large-scale queries. Popular examples are some combination of year, 
month, and day when the data has associated time values, and
+        geographic region when the data is associated with some place.
+      </p>
+
+      <ul>
+        <li>
+          <p>
+            For time-based data, split out the separate parts into their own 
columns, because Impala cannot partition based on a
+            <codeph>TIMESTAMP</codeph> column.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            The data type of the partition columns does not have a significant 
effect on the storage required, because the values from those
+            columns are not stored in the data files, rather they are 
represented as strings inside HDFS directory names.
+          </p>
+        </li>
+
+        <li rev="IMPALA-2499">
+          <p>
+            In <keyword keyref="impala25_full"/> and higher, you can enable 
the <codeph>OPTIMIZE_PARTITION_KEY_SCANS</codeph> query option to speed up
+            queries that only refer to partition key columns, such as 
<codeph>SELECT MAX(year)</codeph>. This setting is not enabled by
+            default because the query behavior is slightly different if the 
table contains partition directories without actual data inside.
+            See <xref 
href="impala_optimize_partition_key_scans.xml#optimize_partition_key_scans"/> 
for details.
+          </p>
+        </li>
+
+        <li>
+          <p 
conref="../shared/impala_common.xml#common/complex_types_partitioning"/>
+        </li>
+
+        <li>
+          <p>
+            Remember that when Impala queries data stored in HDFS, it is most 
efficient to use multi-megabyte files to take advantage of the
+            HDFS block size. For Parquet tables, the block size (and ideal 
size of the data files) is <ph rev="parquet_block_size">256 MB in
+            Impala 2.0 and later</ph>. Therefore, avoid specifying too many 
partition key columns, which could result in individual
+            partitions containing only small amounts of data. For example, if 
you receive 1 GB of data per day, you might partition by year,
+            month, and day; while if you receive 5 GB of data per minute, you 
might partition by year, month, day, hour, and minute. If you
+            have data with a geographic component, you might partition based 
on postal code if you have many megabytes of data for each
+            postal code, but if not, you might partition by some larger region 
such as city, state, or country. state
+          </p>
+        </li>
+      </ul>
+
+      <p 
conref="../shared/impala_common.xml#common/partition_key_optimization"/>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="mixed_format_partitions">
+
+    <title>Setting Different File Formats for Partitions</title>
+
+    <conbody>
+
+      <p>
+        Partitioned tables have the flexibility to use different file formats 
for different partitions. (For background information about
+        the different file formats Impala supports, see <xref 
href="impala_file_formats.xml#file_formats"/>.) For example, if you originally
+        received data in text format, then received new data in RCFile format, 
and eventually began receiving data in Parquet format, all
+        that data could reside in the same table for queries. You just need to 
ensure that the table is structured so that the data files
+        that use different file formats reside in separate partitions.
+      </p>
+
+      <p>
+        For example, here is how you might switch from text to Parquet data as 
you receive data for different years:
+      </p>
+
+<codeblock>[localhost:21000] &gt; create table census (name string) 
partitioned by (year smallint);
+[localhost:21000] &gt; alter table census add partition (year=2012); -- Text 
format;
+
+[localhost:21000] &gt; alter table census add partition (year=2013); -- Text 
format switches to Parquet before data loaded;
+[localhost:21000] &gt; alter table census partition (year=2013) set fileformat 
parquet;
+
+[localhost:21000] &gt; insert into census partition (year=2012) values 
('Smith'),('Jones'),('Lee'),('Singh');
+[localhost:21000] &gt; insert into census partition (year=2013) values 
('Flores'),('Bogomolov'),('Cooper'),('Appiah');</codeblock>
+
+      <p>
+        At this point, the HDFS directory for <codeph>year=2012</codeph> 
contains a text-format data file, while the HDFS directory for
+        <codeph>year=2013</codeph> contains a Parquet data file. As always, 
when loading non-trivial data, you would use <codeph>INSERT ...
+        SELECT</codeph> or <codeph>LOAD DATA</codeph> to import data in large 
batches, rather than <codeph>INSERT ... VALUES</codeph> which
+        produces small files that are inefficient for real-world queries.
+      </p>
+
+      <p>
+        For other file types that Impala cannot create natively, you can 
switch into Hive and issue the <codeph>ALTER TABLE ... SET
+        FILEFORMAT</codeph> statements and <codeph>INSERT</codeph> or 
<codeph>LOAD DATA</codeph> statements there. After switching back to
+        Impala, issue a <codeph>REFRESH <varname>table_name</varname></codeph> 
statement so that Impala recognizes any partitions or new
+        data added through Hive.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="partition_management">
+
+    <title>Managing Partitions</title>
+
+    <conbody>
+
+      <p>
+        You can add, drop, set the expected file format, or set the HDFS 
location of the data files for individual partitions within an
+        Impala table. See <xref href="impala_alter_table.xml#alter_table"/> 
for syntax details, and
+        <xref href="impala_partitioning.xml#mixed_format_partitions"/> for 
tips on managing tables containing partitions with different file
+        formats.
+      </p>
+
+      <note 
conref="../shared/impala_common.xml#common/add_partition_set_location"/>
+
+      <p>
+        What happens to the data files when a partition is dropped depends on 
whether the partitioned table is designated as internal or
+        external. For an internal (managed) table, the data files are deleted. 
For example, if data in the partitioned table is a copy of
+        raw data files stored elsewhere, you might save disk space by dropping 
older partitions that are no longer required for reporting,
+        knowing that the original data is still available if needed later. For 
an external table, the data files are left alone. For
+        example, dropping a partition without deleting the associated files 
lets Impala consider a smaller set of partitions, improving
+        query efficiency and reducing overhead for DDL operations on the 
table; if the data is needed again later, you can add the partition
+        again. See <xref href="impala_tables.xml#tables" /> for details and 
examples.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept rev="kudu" id="partition_kudu" audience="Cloudera">
+
+    <title>Using Partitioning with Kudu Tables</title>
+
+    <prolog>
+      <metadata>
+        <data name="Category" value="Kudu"/>
+      </metadata>
+    </prolog>
+
+    <conbody>
+
+      <p>
+        Kudu tables use a more fine-grained partitioning scheme than tables 
containing HDFS data files. You specify a <codeph>DISTRIBUTE
+        BY</codeph> clause with the <codeph>CREATE TABLE</codeph> statement to 
identify how to divide the values from the partition key
+        columns.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+</concept>


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_perf_benchmarking.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_perf_benchmarking.xml 
b/docs/topics/impala_perf_benchmarking.xml
new file mode 100644
index 0000000..b2e058d
--- /dev/null
+++ b/docs/topics/impala_perf_benchmarking.xml
@@ -0,0 +1,36 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="perf_benchmarks">
+
+  <title>Benchmarking Impala Queries</title>
+  <titlealts audience="PDF"><navtitle>Benchmarking</navtitle></titlealts>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Performance"/>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Querying"/>
+      <data name="Category" value="Proof of Concept"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      Because Impala, like other Hadoop components, is designed to handle 
large data volumes in a distributed
+      environment, conduct any performance tests using realistic data and 
cluster configurations. Use a multi-node
+      cluster rather than a single node; run queries against tables containing 
terabytes of data rather than tens
+      of gigabytes. The parallel processing techniques used by Impala are most 
appropriate for workloads that are
+      beyond the capacity of a single server.
+    </p>
+
+    <p>
+      When you run queries returning large numbers of rows, the CPU time to 
pretty-print the output can be
+      substantial, giving an inaccurate measurement of the actual query time. 
Consider using the
+      <codeph>-B</codeph> option on the <codeph>impala-shell</codeph> command 
to turn off the pretty-printing, and
+      optionally the <codeph>-o</codeph> option to store query results in a 
file rather than printing to the
+      screen. See <xref href="impala_shell_options.xml#shell_options"/> for 
details.
+    </p>
+  </conbody>
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_perf_cookbook.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_perf_cookbook.xml 
b/docs/topics/impala_perf_cookbook.xml
new file mode 100644
index 0000000..a42f7c9
--- /dev/null
+++ b/docs/topics/impala_perf_cookbook.xml
@@ -0,0 +1,269 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="perf_cookbook">
+
+  <title>Impala Performance Guidelines and Best Practices</title>
+  <titlealts audience="PDF"><navtitle>Performance Best 
Practices</navtitle></titlealts>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Performance"/>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Planning"/>
+      <data name="Category" value="Proof of Concept"/>
+      <data name="Category" value="Guidelines"/>
+      <data name="Category" value="Best Practices"/>
+      <data name="Category" value="Proof of Concept"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      Here are performance guidelines and best practices that you can use 
during planning, experimentation, and
+      performance tuning for an Impala-enabled CDH cluster. All of this 
information is also available in more
+      detail elsewhere in the Impala documentation; it is gathered together 
here to serve as a cookbook and
+      emphasize which performance techniques typically provide the highest 
return on investment
+    </p>
+
+    <p outputclass="toc inpage"/>
+
+    <section id="perf_cookbook_file_format">
+
+      <title>Choose the appropriate file format for the data.</title>
+
+      <p>
+        Typically, for large volumes of data (multiple gigabytes per table or 
partition), the Parquet file format
+        performs best because of its combination of columnar storage layout, 
large I/O request size, and
+        compression and encoding. See <xref 
href="impala_file_formats.xml#file_formats"/> for comparisons of all
+        file formats supported by Impala, and <xref 
href="impala_parquet.xml#parquet"/> for details about the
+        Parquet file format.
+      </p>
+
+      <note>
+        For smaller volumes of data, a few gigabytes or less for each table or 
partition, you might not see
+        significant performance differences between file formats. At small 
data volumes, reduced I/O from an
+        efficient compressed file format can be counterbalanced by reduced 
opportunity for parallel execution. When
+        planning for a production deployment or conducting benchmarks, always 
use realistic data volumes to get a
+        true picture of performance and scalability.
+      </note>
+    </section>
+
+    <section id="perf_cookbook_small_files">
+
+      <title>Avoid data ingestion processes that produce many small 
files.</title>
+
+      <p>
+        When producing data files outside of Impala, prefer either text format 
or Avro, where you can build up the
+        files row by row. Once the data is in Impala, you can convert it to 
the more efficient Parquet format and
+        split into multiple data files using a single <codeph>INSERT ... 
SELECT</codeph> statement. Or, if you have
+        the infrastructure to produce multi-megabyte Parquet files as part of 
your data preparation process, do
+        that and skip the conversion step inside Impala.
+      </p>
+
+      <p>
+        Always use <codeph>INSERT ... SELECT</codeph> to copy significant 
volumes of data from table to table
+        within Impala. Avoid <codeph>INSERT ... VALUES</codeph> for any 
substantial volume of data or
+        performance-critical tables, because each such statement produces a 
separate tiny data file. See
+        <xref href="impala_insert.xml#insert"/> for examples of the 
<codeph>INSERT ... SELECT</codeph> syntax.
+      </p>
+
+      <p>
+        For example, if you have thousands of partitions in a Parquet table, 
each with less than
+        <ph rev="parquet_block_size">256 MB</ph> of data, consider 
partitioning in a less granular way, such as by
+        year / month rather than year / month / day. If an inefficient data 
ingestion process produces thousands of
+        data files in the same table or partition, consider compacting the 
data by performing an <codeph>INSERT ...
+        SELECT</codeph> to copy all the data to a different table; the data 
will be reorganized into a smaller
+        number of larger files by this process.
+      </p>
+    </section>
+
+    <section id="perf_cookbook_partitioning">
+
+      <title>Choose partitioning granularity based on actual data 
volume.</title>
+
+      <p>
+        Partitioning is a technique that physically divides the data based on 
values of one or more columns, such
+        as by year, month, day, region, city, section of a web site, and so 
on. When you issue queries that request
+        a specific value or range of values for the partition key columns, 
Impala can avoid reading the irrelevant
+        data, potentially yielding a huge savings in disk I/O.
+      </p>
+
+      <p>
+        When deciding which column(s) to use for partitioning, choose the 
right level of granularity. For example,
+        should you partition by year, month, and day, or only by year and 
month? Choose a partitioning strategy
+        that puts at least <ph rev="parquet_block_size">256 MB</ph> of data in 
each partition, to take advantage of
+        HDFS bulk I/O and Impala distributed queries.
+      </p>
+
+      <p>
+        Over-partitioning can also cause query planning to take longer than 
necessary, as Impala prunes the
+        unnecessary partitions. Ideally, keep the number of partitions in the 
table under 30 thousand.
+      </p>
+
+      <p>
+        When preparing data files to go in a partition directory, create 
several large files rather than many small
+        ones. If you receive data in the form of many small files and have no 
control over the input format,
+        consider using the <codeph>INSERT ... SELECT</codeph> syntax to copy 
data from one table or partition to
+        another, which compacts the files into a relatively small number 
(based on the number of nodes in the
+        cluster).
+      </p>
+
+      <p>
+        If you need to reduce the overall number of partitions and increase 
the amount of data in each partition,
+        first look for partition key columns that are rarely referenced or are 
referenced in non-critical queries
+        (not subject to an SLA). For example, your web site log data might be 
partitioned by year, month, day, and
+        hour, but if most queries roll up the results by day, perhaps you only 
need to partition by year, month,
+        and day.
+      </p>
+
+      <p>
+        If you need to reduce the granularity even more, consider creating 
<q>buckets</q>, computed values
+        corresponding to different sets of partition key values. For example, 
you can use the
+        <codeph>TRUNC()</codeph> function with a <codeph>TIMESTAMP</codeph> 
column to group date and time values
+        based on intervals such as week or quarter. See
+        <xref href="impala_datetime_functions.xml#datetime_functions"/> for 
details.
+      </p>
+
+      <p>
+        See <xref href="impala_partitioning.xml#partitioning"/> for full 
details and performance considerations for
+        partitioning.
+      </p>
+    </section>
+
+    <section id="perf_cookbook_partition_keys">
+
+      <title>Use smallest appropriate integer types for partition key 
columns.</title>
+
+      <p>
+        Although it is tempting to use strings for partition key columns, 
since those values are turned into HDFS
+        directory names anyway, you can minimize memory usage by using numeric 
values for common partition key
+        fields such as <codeph>YEAR</codeph>, <codeph>MONTH</codeph>, and 
<codeph>DAY</codeph>. Use the smallest
+        integer type that holds the appropriate range of values, typically 
<codeph>TINYINT</codeph> for
+        <codeph>MONTH</codeph> and <codeph>DAY</codeph>, and 
<codeph>SMALLINT</codeph> for <codeph>YEAR</codeph>.
+        Use the <codeph>EXTRACT()</codeph> function to pull out individual 
date and time fields from a
+        <codeph>TIMESTAMP</codeph> value, and <codeph>CAST()</codeph> the 
return value to the appropriate integer
+        type.
+      </p>
+    </section>
+
+    <section id="perf_cookbook_parquet_block_size">
+
+      <title>Choose an appropriate Parquet block size.</title>
+
+      <p rev="parquet_block_size">
+        By default, the Impala <codeph>INSERT ... SELECT</codeph> statement 
creates Parquet files with a 256 MB
+        block size. (This default was changed in Impala 2.0. Formerly, the 
limit was 1 GB, but Impala made
+        conservative estimates about compression, resulting in files that were 
smaller than 1 GB.)
+      </p>
+
+      <p>
+        Each Parquet file written by Impala is a single block, allowing the 
whole file to be processed as a unit by a single host.
+        As you copy Parquet files into HDFS or between HDFS filesystems, use 
<codeph>hdfs dfs -pb</codeph> to preserve the original
+        block size.
+      </p>
+
+      <p>
+        If there is only one or a few data block in your Parquet table, or in 
a partition that is the only one
+        accessed by a query, then you might experience a slowdown for a 
different reason: not enough data to take
+        advantage of Impala's parallel distributed queries. Each data block is 
processed by a single core on one of
+        the DataNodes. In a 100-node cluster of 16-core machines, you could 
potentially process thousands of data
+        files simultaneously. You want to find a sweet spot between <q>many 
tiny files</q> and <q>single giant
+        file</q> that balances bulk I/O and parallel processing. You can set 
the <codeph>PARQUET_FILE_SIZE</codeph>
+        query option before doing an <codeph>INSERT ... SELECT</codeph> 
statement to reduce the size of each
+        generated Parquet file. <ph rev="2.0.0">(Specify the file size as an 
absolute number of bytes, or in Impala
+        2.0 and later, in units ending with <codeph>m</codeph> for megabytes 
or <codeph>g</codeph> for
+        gigabytes.)</ph> Run benchmarks with different file sizes to find the 
right balance point for your
+        particular data volume.
+      </p>
+    </section>
+
+    <section id="perf_cookbook_stats">
+
+      <title>Gather statistics for all tables used in performance-critical or 
high-volume join queries.</title>
+
+      <p>
+        Gather the statistics with the <codeph>COMPUTE STATS</codeph> 
statement. See
+        <xref href="impala_perf_joins.xml#perf_joins"/> for details.
+      </p>
+    </section>
+
+    <section id="perf_cookbook_network">
+
+      <title>Minimize the overhead of transmitting results back to the 
client.</title>
+
+      <p>
+        Use techniques such as:
+      </p>
+
+      <ul>
+        <li>
+          Aggregation. If you need to know how many rows match a condition, 
the total values of matching values
+          from some column, the lowest or highest matching value, and so on, 
call aggregate functions such as
+          <codeph>COUNT()</codeph>, <codeph>SUM()</codeph>, and 
<codeph>MAX()</codeph> in the query rather than
+          sending the result set to an application and doing those 
computations there. Remember that the size of an
+          unaggregated result set could be huge, requiring substantial time to 
transmit across the network.
+        </li>
+
+        <li>
+          Filtering. Use all applicable tests in the <codeph>WHERE</codeph> 
clause of a query to eliminate rows
+          that are not relevant, rather than producing a big result set and 
filtering it using application logic.
+        </li>
+
+        <li>
+          <codeph>LIMIT</codeph> clause. If you only need to see a few sample 
values from a result set, or the top
+          or bottom values from a query using <codeph>ORDER BY</codeph>, 
include the <codeph>LIMIT</codeph> clause
+          to reduce the size of the result set rather than asking for the full 
result set and then throwing most of
+          the rows away.
+        </li>
+
+        <li>
+          Avoid overhead from pretty-printing the result set and displaying it 
on the screen. When you retrieve the
+          results through <cmdname>impala-shell</cmdname>, use 
<cmdname>impala-shell</cmdname> options such as
+          <codeph>-B</codeph> and <codeph>--output_delimiter</codeph> to 
produce results without special
+          formatting, and redirect output to a file rather than printing to 
the screen. Consider using
+          <codeph>INSERT ... SELECT</codeph> to write the results directly to 
new files in HDFS. See
+          <xref href="impala_shell_options.xml#shell_options"/> for details 
about the
+          <cmdname>impala-shell</cmdname> command-line options.
+        </li>
+      </ul>
+    </section>
+
+    <section id="perf_cookbook_explain">
+
+      <title>Verify that your queries are planned in an efficient logical 
manner.</title>
+
+      <p>
+        Examine the <codeph>EXPLAIN</codeph> plan for a query before actually 
running it. See
+        <xref href="impala_explain.xml#explain"/> and <xref 
href="impala_explain_plan.xml#perf_explain"/> for
+        details.
+      </p>
+    </section>
+
+    <section id="perf_cookbook_profile">
+
+      <title>Verify performance characteristics of queries.</title>
+
+      <p>
+        Verify that the low-level aspects of I/O, memory usage, network 
bandwidth, CPU utilization, and so on are
+        within expected ranges by examining the query profile for a query 
after running it. See
+        <xref href="impala_explain_plan.xml#perf_profile"/> for details.
+      </p>
+    </section>
+
+    <section id="perf_cookbook_os">
+
+      <title>Use appropriate operating system settings.</title>
+
+      <p>
+        See <xref 
href="http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_admin_performance.html";
 scope="external" format="html">Optimizing Performance in CDH</xref>
+        for recommendations about operating system
+        settings that you can change to influence Impala performance. In 
particular, you might find
+        that changing the <codeph>vm.swappiness</codeph> Linux kernel setting 
to a non-zero value improves
+        overall performance.
+      </p>
+    </section>
+
+  </conbody>
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_perf_ddl.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_perf_ddl.xml b/docs/topics/impala_perf_ddl.xml
new file mode 100644
index 0000000..d075cd2
--- /dev/null
+++ b/docs/topics/impala_perf_ddl.xml
@@ -0,0 +1,42 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="perf_ddl">
+
+  <title>Performance Considerations for DDL Statements</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Performance"/>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="DDL"/>
+      <data name="Category" value="SQL"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      These tips and guidelines apply to the Impala DDL statements, which are 
listed in
+      <xref href="impala_ddl.xml#ddl"/>.
+    </p>
+
+    <p>
+      Because Impala DDL statements operate on the metastore database, the 
performance considerations for those
+      statements are totally different than for distributed queries that 
operate on HDFS
+      <ph rev="2.2.0">or S3</ph> data files, or on HBase tables.
+    </p>
+
+    <p>
+      Each DDL statement makes a relatively small update to the metastore 
database. The overhead for each statement
+      is proportional to the overall number of Impala and Hive tables, and 
(for a partitioned table) to the overall
+      number of partitions in that table. Issuing large numbers of DDL 
statements (such as one for each table or
+      one for each partition) also has the potential to encounter a bottleneck 
with access to the metastore
+      database. Therefore, for efficient DDL, try to design your application 
logic and ETL pipeline to avoid a huge
+      number of tables and a huge number of partitions within each table. In 
this context, <q>huge</q> is in the
+      range of tens of thousands or hundreds of thousands.
+    </p>
+
+    <note 
conref="../shared/impala_common.xml#common/add_partition_set_location"/>
+  </conbody>
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_perf_hdfs_caching.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_perf_hdfs_caching.xml 
b/docs/topics/impala_perf_hdfs_caching.xml
new file mode 100644
index 0000000..883bc8d
--- /dev/null
+++ b/docs/topics/impala_perf_hdfs_caching.xml
@@ -0,0 +1,609 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept rev="1.4" id="hdfs_caching">
+
+  <title>Using HDFS Caching with Impala (<keyword keyref="impala21"/> or 
higher only)</title>
+  <titlealts audience="PDF"><navtitle>HDFS Caching</navtitle></titlealts>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Performance"/>
+      <data name="Category" value="Scalability"/>
+      <data name="Category" value="HDFS"/>
+      <data name="Category" value="HDFS Caching"/>
+      <data name="Category" value="Memory"/>
+      <data name="Category" value="Administrators"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      HDFS caching provides performance and scalability benefits in production 
environments where Impala queries
+      and other Hadoop jobs operate on quantities of data much larger than the 
physical RAM on the DataNodes,
+      making it impractical to rely on the Linux OS cache, which only keeps 
the most recently used data in memory.
+      Data read from the HDFS cache avoids the overhead of checksumming and 
memory-to-memory copying involved when
+      using data from the Linux OS cache.
+    </p>
+
+    <note>
+      <p>
+        On a small or lightly loaded cluster, HDFS caching might not produce 
any speedup. It might even lead to
+        slower queries, if I/O read operations that were performed in parallel 
across the entire cluster are replaced by in-memory
+        operations operating on a smaller number of hosts. The hosts where the 
HDFS blocks are cached can become
+        bottlenecks because they experience high CPU load while processing the 
cached data blocks, while other hosts remain idle.
+        Therefore, always compare performance with and without this feature 
enabled, using a realistic workload.
+      </p>
+      <p rev="2.2.0">
+        In <keyword keyref="impala22_full"/> and higher, you can spread the 
CPU load more evenly by specifying the <codeph>WITH REPLICATION</codeph>
+        clause of the <codeph>CREATE TABLE</codeph> and <codeph>ALTER 
TABLE</codeph> statements.
+        This clause lets you control the replication factor for
+        HDFS caching for a specific table or partition. By default, each 
cached block is
+        only present on a single host, which can lead to CPU contention if the 
same host
+        processes each cached block. Increasing the replication factor lets 
Impala choose
+        different hosts to process different cached blocks, to better 
distribute the CPU load.
+        Always use a <codeph>WITH REPLICATION</codeph> setting of at least 3, 
and adjust upward
+        if necessary to match the replication factor for the underlying HDFS 
data files.
+      </p>
+      <p rev="2.5.0">
+        In <keyword keyref="impala25_full"/> and higher, Impala automatically 
randomizes which host processes
+        a cached HDFS block, to avoid CPU hotspots. For tables where HDFS 
caching is not applied,
+        Impala designates which host to process a data block using an 
algorithm that estimates
+        the load on each host. If CPU hotspots still arise during queries,
+        you can enable additional randomization for the scheduling algorithm 
for non-HDFS cached data
+        by setting the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option.
+      </p>
+    </note>
+
+    <p outputclass="toc inpage"/>
+
+<!-- Could conref this background link; haven't decided yet the best place or 
if it's needed twice. -->
+
+    <p>
+      For background information about how to set up and manage HDFS caching 
for a CDH cluster, see
+<!-- Original URL: 
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh_ig_hdfs_caching.html
 -->
+      <xref 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hdfs_caching.html";
 scope="external" format="html">the
+      CDH documentation</xref>.
+    </p>
+  </conbody>
+
+  <concept id="hdfs_caching_overview">
+
+    <title>Overview of HDFS Caching for Impala</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Concepts"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        On <ph rev="upstream">CDH 5.1</ph> and higher, Impala can use the HDFS 
caching feature to make more effective use of RAM, so that
+        repeated queries can take advantage of data <q>pinned</q> in memory 
regardless of how much data is
+        processed overall. The HDFS caching feature lets you designate a 
subset of frequently accessed data to be
+        pinned permanently in memory, remaining in the cache across multiple 
queries and never being evicted. This
+        technique is suitable for tables or partitions that are frequently 
accessed and are small enough to fit
+        entirely within the HDFS memory cache. For example, you might 
designate several dimension tables to be
+        pinned in the cache, to speed up many different join queries that 
reference them. Or in a partitioned
+        table, you might pin a partition holding data from the most recent 
time period because that data will be
+        queried intensively; then when the next set of data arrives, you could 
unpin the previous partition and pin
+        the partition holding the new data.
+      </p>
+
+      <p>
+        Because this Impala performance feature relies on HDFS infrastructure, 
it only applies to Impala tables
+        that use HDFS data files. HDFS caching for Impala does not apply to 
HBase tables, S3 tables,
+        Kudu tables,
+        or Isilon tables.
+      </p>
+
+    </conbody>
+  </concept>
+
+  <concept id="hdfs_caching_prereqs">
+
+    <title>Setting Up HDFS Caching for Impala</title>
+
+    <conbody>
+
+      <p>
+        To use HDFS caching with Impala, first set up that feature for your 
CDH cluster:
+      </p>
+
+      <ul>
+        <li>
+          <p>
+          Decide how much memory to devote to the HDFS cache on each host. 
Remember that the total memory available
+          for cached data is the sum of the cache sizes on all the hosts. By 
default, any data block is only cached on one
+          host, although you can cache a block across multiple hosts by 
increasing the replication factor.
+          <!-- Obsoleted in Impala 2.2 and higher by IMPALA-1587.
+          Once a data block is cached on one host, all requests to process 
that block are routed to that same host.)
+          -->
+          </p>
+        </li>
+
+        <li>
+          <p>
+          Issue <cmdname>hdfs cacheadmin</cmdname> commands to set up one or 
more cache pools, owned by the same
+          user as the <cmdname>impalad</cmdname> daemon (typically 
<codeph>impala</codeph>). For example:
+<codeblock>hdfs cacheadmin -addPool four_gig_pool -owner impala -limit 
4000000000
+</codeblock>
+          For details about the <cmdname>hdfs cacheadmin</cmdname> command, see
+<!-- Original URL: 
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh_ig_hdfs_caching.html
 -->
+          <xref 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hdfs_caching.html";
 scope="external" format="html">the
+          CDH documentation</xref>.
+          </p>
+        </li>
+      </ul>
+
+      <p>
+        Once HDFS caching is enabled and one or more pools are available, see
+        <xref href="impala_perf_hdfs_caching.xml#hdfs_caching_ddl"/> for how 
to choose which Impala data to load
+        into the HDFS cache. On the Impala side, you specify the cache pool 
name defined by the <codeph>hdfs
+        cacheadmin</codeph> command in the Impala DDL statements that enable 
HDFS caching for a table or partition,
+        such as <codeph>CREATE TABLE ... CACHED IN 
<varname>pool</varname></codeph> or <codeph>ALTER TABLE ... SET
+        CACHED IN <varname>pool</varname></codeph>.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="hdfs_caching_ddl">
+
+    <title>Enabling HDFS Caching for Impala Tables and Partitions</title>
+
+    <conbody>
+
+      <p>
+        Begin by choosing which tables or partitions to cache. For example, 
these might be lookup tables that are
+        accessed by many different join queries, or partitions corresponding 
to the most recent time period that
+        are analyzed by different reports or ad hoc queries.
+      </p>
+
+      <p>
+        In your SQL statements, you specify logical divisions such as tables 
and partitions to be cached. Impala
+        translates these requests into HDFS-level directives that apply to 
particular directories and files. For
+        example, given a partitioned table <codeph>CENSUS</codeph> with a 
partition key column
+        <codeph>YEAR</codeph>, you could choose to cache all or part of the 
data as follows:
+      </p>
+
+      <p 
conref="../shared/impala_common.xml#common/impala_cache_replication_factor"/>
+
+<codeblock>-- Cache the entire table (all partitions).
+alter table census set cached in '<varname>pool_name</varname>';
+
+-- Remove the entire table from the cache.
+alter table census set uncached;
+
+-- Cache a portion of the table (a single partition).
+-- If the table is partitioned by multiple columns (such as year, month, day),
+-- the ALTER TABLE command must specify values for all those columns.
+alter table census partition (year=1960) set cached in 
'<varname>pool_name</varname>';
+
+<ph rev="2.2.0">-- Cache the data from one partition on up to 4 hosts, to 
minimize CPU load on any
+-- single host when the same data block is processed multiple times.
+alter table census partition (year=1970)
+  set cached in '<varname>pool_name</varname>' with replication = 4;</ph>
+
+-- At each stage, check the volume of cached data.
+-- For large tables or partitions, the background loading might take some time,
+-- so you might have to wait and reissue the statement until all the data
+-- has finished being loaded into the cache.
+show table stats census;
++-------+-------+--------+------+--------------+--------+
+| year  | #Rows | #Files | Size | Bytes Cached | Format |
++-------+-------+--------+------+--------------+--------+
+| 1900  | -1    | 1      | 11B  | NOT CACHED   | TEXT   |
+| 1940  | -1    | 1      | 11B  | NOT CACHED   | TEXT   |
+| 1960  | -1    | 1      | 11B  | 11B          | TEXT   |
+| 1970  | -1    | 1      | 11B  | NOT CACHED   | TEXT   |
+| Total | -1    | 4      | 44B  | 11B          |        |
++-------+-------+--------+------+--------------+--------+
+</codeblock>
+
+      <p>
+        <b>CREATE TABLE considerations:</b>
+      </p>
+
+      <p>
+        The HDFS caching feature affects the Impala <codeph>CREATE 
TABLE</codeph> statement as follows:
+      </p>
+
+      <ul>
+        <li>
+        <p>
+          You can put a <codeph>CACHED IN 
'<varname>pool_name</varname>'</codeph> clause
+          <ph rev="2.2.0">and optionally a <codeph>WITH REPLICATION = 
<varname>number_of_hosts</varname></codeph> clause</ph>
+          at the end of a
+          <codeph>CREATE TABLE</codeph> statement to automatically cache the 
entire contents of the table,
+          including any partitions added later. The 
<varname>pool_name</varname> is a pool that you previously set
+          up with the <cmdname>hdfs cacheadmin</cmdname> command.
+        </p>
+        </li>
+
+        <li>
+        <p>
+          Once a table is designated for HDFS caching through the 
<codeph>CREATE TABLE</codeph> statement, if new
+          partitions are added later through <codeph>ALTER TABLE ... ADD 
PARTITION</codeph> statements, the data in
+          those new partitions is automatically cached in the same pool.
+        </p>
+        </li>
+
+        <li>
+        <p>
+          If you want to perform repetitive queries on a subset of data from a 
large table, and it is not practical
+          to designate the entire table or specific partitions for HDFS 
caching, you can create a new cached table
+          with just a subset of the data by using <codeph>CREATE TABLE ... 
CACHED IN '<varname>pool_name</varname>'
+          AS SELECT ... WHERE ...</codeph>. When you are finished with 
generating reports from this subset of data,
+          drop the table and both the data files and the data cached in RAM 
are automatically deleted.
+        </p>
+        </li>
+      </ul>
+
+      <p>
+        See <xref href="impala_create_table.xml#create_table"/> for the full 
syntax.
+      </p>
+
+      <p>
+        <b>Other memory considerations:</b>
+      </p>
+
+      <p>
+        Certain DDL operations, such as <codeph>ALTER TABLE ... SET 
LOCATION</codeph>, are blocked while the
+        underlying HDFS directories contain cached files. You must uncache the 
files first, before changing the
+        location, dropping the table, and so on.
+      </p>
+
+      <p>
+        When data is requested to be pinned in memory, that process happens in 
the background without blocking
+        access to the data while the caching is in progress. Loading the data 
from disk could take some time.
+        Impala reads each HDFS data block from memory if it has been pinned 
already, or from disk if it has not
+        been pinned yet. When files are added to a table or partition whose 
contents are cached, Impala
+        automatically detects those changes and performs a 
<codeph>REFRESH</codeph> automatically once the relevant
+        data is cached.
+      </p>
+
+      <p>
+        The amount of data that you can pin on each node through the HDFS 
caching mechanism is subject to a quota
+        that is enforced by the underlying HDFS service. Before requesting to 
pin an Impala table or partition in
+        memory, check that its size does not exceed this quota.
+      </p>
+
+      <note>
+        Because the HDFS cache consists of combined memory from all the 
DataNodes in the cluster, cached tables or
+        partitions can be bigger than the amount of HDFS cache memory on any 
single host.
+      </note>
+    </conbody>
+  </concept>
+
+  <concept id="hdfs_caching_etl">
+
+    <title>Loading and Removing Data with HDFS Caching Enabled</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="ETL"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        When HDFS caching is enabled, extra processing happens in the 
background when you add or remove data
+        through statements such as <codeph>INSERT</codeph> and <codeph>DROP 
TABLE</codeph>.
+      </p>
+
+      <p>
+        <b>Inserting or loading data:</b>
+      </p>
+
+      <ul>
+        <li>
+          When Impala performs an <codeph><xref 
href="impala_insert.xml#insert">INSERT</xref></codeph> or
+          <codeph><xref href="impala_load_data.xml#load_data">LOAD 
DATA</xref></codeph> statement for a table or
+          partition that is cached, the new data files are automatically 
cached and Impala recognizes that fact
+          automatically.
+        </li>
+
+        <li>
+          If you perform an <codeph>INSERT</codeph> or <codeph>LOAD 
DATA</codeph> through Hive, as always, Impala
+          only recognizes the new data files after a <codeph>REFRESH 
<varname>table_name</varname></codeph>
+          statement in Impala.
+        </li>
+
+        <li>
+          If the cache pool is entirely full, or becomes full before all the 
requested data can be cached, the
+          Impala DDL statement returns an error. This is to avoid situations 
where only some of the requested data
+          could be cached.
+        </li>
+
+        <li>
+          When HDFS caching is enabled for a table or partition, new data 
files are cached automatically when they
+          are added to the appropriate directory in HDFS, without the need for 
a <codeph>REFRESH</codeph> statement
+          in Impala. Impala automatically performs a <codeph>REFRESH</codeph> 
once the new data is loaded into the
+          HDFS cache.
+        </li>
+      </ul>
+
+      <p>
+        <b>Dropping tables, partitions, or cache pools:</b>
+      </p>
+
+      <p>
+        The HDFS caching feature interacts with the Impala
+        <codeph><xref href="impala_drop_table.xml#drop_table">DROP 
TABLE</xref></codeph> and
+        <codeph><xref href="impala_alter_table.xml#alter_table">ALTER TABLE 
... DROP PARTITION</xref></codeph>
+        statements as follows:
+      </p>
+
+      <ul>
+        <li>
+          When you issue a <codeph>DROP TABLE</codeph> for a table that is 
entirely cached, or has some partitions
+          cached, the <codeph>DROP TABLE</codeph> succeeds and all the cache 
directives Impala submitted for that
+          table are removed from the HDFS cache system.
+        </li>
+
+        <li>
+          The same applies to <codeph>ALTER TABLE ... DROP PARTITION</codeph>. 
The operation succeeds and any cache
+          directives are removed.
+        </li>
+
+        <li>
+          As always, the underlying data files are removed if the dropped 
table is an internal table, or the
+          dropped partition is in its default location underneath an internal 
table. The data files are left alone
+          if the dropped table is an external table, or if the dropped 
partition is in a non-default location.
+        </li>
+
+        <li>
+          If you designated the data files as cached through the <cmdname>hdfs 
cacheadmin</cmdname> command, and
+          the data files are left behind as described in the previous item, 
the data files remain cached. Impala
+          only removes the cache directives submitted by Impala through the 
<codeph>CREATE TABLE</codeph> or
+          <codeph>ALTER TABLE</codeph> statements. It is OK to have multiple 
redundant cache directives pertaining
+          to the same files; the directives all have unique IDs and owners so 
that the system can tell them apart.
+        </li>
+
+        <li>
+          If you drop an HDFS cache pool through the <cmdname>hdfs 
cacheadmin</cmdname> command, all the Impala
+          data files are preserved, just no longer cached. After a subsequent 
<codeph>REFRESH</codeph>,
+          <codeph>SHOW TABLE STATS</codeph> reports 0 bytes cached for each 
associated Impala table or partition.
+        </li>
+      </ul>
+
+      <p>
+        <b>Relocating a table or partition:</b>
+      </p>
+
+      <p>
+        The HDFS caching feature interacts with the Impala
+        <codeph><xref href="impala_alter_table.xml#alter_table">ALTER TABLE 
... SET LOCATION</xref></codeph>
+        statement as follows:
+      </p>
+
+      <ul>
+        <li>
+          If you have designated a table or partition as cached through the 
<codeph>CREATE TABLE</codeph> or
+          <codeph>ALTER TABLE</codeph> statements, subsequent attempts to 
relocate the table or partition through
+          an <codeph>ALTER TABLE ... SET LOCATION</codeph> statement will 
fail. You must issue an <codeph>ALTER
+          TABLE ... SET UNCACHED</codeph> statement for the table or partition 
first. Otherwise, Impala would lose
+          track of some cached data files and have no way to uncache them 
later.
+        </li>
+      </ul>
+    </conbody>
+  </concept>
+
+  <concept id="hdfs_caching_admin">
+
+    <title>Administration for HDFS Caching with Impala</title>
+
+    <conbody>
+
+      <p>
+        Here are the guidelines and steps to check or change the status of 
HDFS caching for Impala data:
+      </p>
+
+      <p>
+        <b>hdfs cacheadmin command:</b>
+      </p>
+
+      <ul>
+        <li>
+          If you drop a cache pool with the <cmdname>hdfs cacheadmin</cmdname> 
command, Impala queries against the
+          associated data files will still work, by falling back to reading 
the files from disk. After performing a
+          <codeph>REFRESH</codeph> on the table, Impala reports the number of 
bytes cached as 0 for all associated
+          tables and partitions.
+        </li>
+
+        <li>
+          You might use <cmdname>hdfs cacheadmin</cmdname> to get a list of 
existing cache pools, or detailed
+          information about the pools, as follows:
+<codeblock scale="60">hdfs cacheadmin -listDirectives         # Basic info
+Found 122 entries
+  ID POOL       REPL EXPIRY  PATH
+ 123 testPool      1 never   /user/hive/warehouse/tpcds.store_sales
+ 124 testPool      1 never   
/user/hive/warehouse/tpcds.store_sales/ss_date=1998-01-15
+ 125 testPool      1 never   
/user/hive/warehouse/tpcds.store_sales/ss_date=1998-02-01
+...
+
+hdfs cacheadmin -listDirectives -stats  # More details
+Found 122 entries
+  ID POOL       REPL EXPIRY  PATH                                              
          BYTES_NEEDED  BYTES_CACHED  FILES_NEEDED  FILES_CACHED
+ 123 testPool      1 never   /user/hive/warehouse/tpcds.store_sales            
                     0             0             0             0
+ 124 testPool      1 never   
/user/hive/warehouse/tpcds.store_sales/ss_date=1998-01-15         143169        
143169             1             1
+ 125 testPool      1 never   
/user/hive/warehouse/tpcds.store_sales/ss_date=1998-02-01         112447        
112447             1             1
+...
+</codeblock>
+        </li>
+      </ul>
+
+      <p>
+        <b>Impala SHOW statement:</b>
+      </p>
+
+      <ul>
+        <li>
+          For each table or partition, the <codeph>SHOW TABLE STATS</codeph> 
or <codeph>SHOW PARTITIONS</codeph>
+          statement displays the number of bytes currently cached by the HDFS 
caching feature. If there are no
+          cache directives in place for that table or partition, the result 
set displays <codeph>NOT
+          CACHED</codeph>. A value of 0, or a smaller number than the overall 
size of the table or partition,
+          indicates that the cache request has been submitted but the data has 
not been entirely loaded into memory
+          yet. See <xref href="impala_show.xml#show"/> for details.
+        </li>
+      </ul>
+
+      <p>
+        <b>Cloudera Manager:</b>
+      </p>
+
+      <ul>
+        <li>
+          You can enable or disable HDFS caching through Cloudera Manager, 
using the configuration setting
+          <uicontrol>Maximum Memory Used for Caching</uicontrol> for the HDFS 
service. This control sets the HDFS
+          configuration parameter 
<codeph>dfs_datanode_max_locked_memory</codeph>, which specifies the upper limit
+          of HDFS cache size on each node.
+        </li>
+
+        <li>
+          All the other manipulation of the HDFS caching settings, such as 
what files are cached, is done through
+          the command line, either Impala DDL statements or the Linux 
<cmdname>hdfs cacheadmin</cmdname> command.
+        </li>
+      </ul>
+
+      <p>
+        <b>Impala memory limits:</b>
+      </p>
+
+      <p>
+        The Impala HDFS caching feature interacts with the Impala memory 
limits as follows:
+      </p>
+
+      <ul>
+        <li>
+          The maximum size of each HDFS cache pool is specified externally to 
Impala, through the <cmdname>hdfs
+          cacheadmin</cmdname> command.
+        </li>
+
+        <li>
+          All the memory used for HDFS caching is separate from the 
<cmdname>impalad</cmdname> daemon address space
+          and does not count towards the limits of the 
<codeph>--mem_limit</codeph> startup option,
+          <codeph>MEM_LIMIT</codeph> query option, or further limits imposed 
through YARN resource management or
+          the Linux <codeph>cgroups</codeph> mechanism.
+        </li>
+
+        <li>
+          Because accessing HDFS cached data avoids a memory-to-memory copy 
operation, queries involving cached
+          data require less memory on the Impala side than the equivalent 
queries on uncached data. In addition to
+          any performance benefits in a single-user environment, the reduced 
memory helps to improve scalability
+          under high-concurrency workloads.
+        </li>
+      </ul>
+    </conbody>
+  </concept>
+
+  <concept id="hdfs_caching_performance">
+
+    <title>Performance Considerations for HDFS Caching with Impala</title>
+
+    <conbody>
+
+      <p>
+        In Impala 1.4.0 and higher, Impala supports efficient reads from data 
that is pinned in memory through HDFS
+        caching. Impala takes advantage of the HDFS API and reads the data 
from memory rather than from disk
+        whether the data files are pinned using Impala DDL statements, or 
using the command-line mechanism where
+        you specify HDFS paths.
+      </p>
+
+      <p>
+        When you examine the output of the <cmdname>impala-shell</cmdname> 
<cmdname>SUMMARY</cmdname> command, or
+        look in the metrics report for the <cmdname>impalad</cmdname> daemon, 
you see how many bytes are read from
+        the HDFS cache. For example, this excerpt from a query profile 
illustrates that all the data read during a
+        particular phase of the query came from the HDFS cache, because the 
<codeph>BytesRead</codeph> and
+        <codeph>BytesReadDataNodeCache</codeph> values are identical.
+      </p>
+
+<codeblock>HDFS_SCAN_NODE (id=0):(Total: 11s114ms, non-child: 11s114ms, % 
non-child: 100.00%)
+        - AverageHdfsReadThreadConcurrency: 0.00
+        - AverageScannerThreadConcurrency: 32.75
+<b>        - BytesRead: 10.47 GB (11240756479)
+        - BytesReadDataNodeCache: 10.47 GB (11240756479)</b>
+        - BytesReadLocal: 10.47 GB (11240756479)
+        - BytesReadShortCircuit: 10.47 GB (11240756479)
+        - DecompressionTime: 27s572ms
+</codeblock>
+
+      <p>
+        For queries involving smaller amounts of data, or in single-user 
workloads, you might not notice a
+        significant difference in query response time with or without HDFS 
caching. Even with HDFS caching turned
+        off, the data for the query might still be in the Linux OS buffer 
cache. The benefits become clearer as
+        data volume increases, and especially as the system processes more 
concurrent queries. HDFS caching
+        improves the scalability of the overall system. That is, it prevents 
query performance from declining when
+        the workload outstrips the capacity of the Linux OS cache.
+      </p>
+
+      <p 
conref="../shared/impala_common.xml#common/hdfs_caching_encryption_caveat"/>
+
+      <p>
+        <b>SELECT considerations:</b>
+      </p>
+
+      <p>
+        The Impala HDFS caching feature interacts with the
+        <codeph><xref href="impala_select.xml#select">SELECT</xref></codeph> 
statement and query performance as
+        follows:
+      </p>
+
+      <ul>
+        <li>
+          Impala automatically reads from memory any data that has been 
designated as cached and actually loaded
+          into the HDFS cache. (It could take some time after the initial 
request to fully populate the cache for a
+          table with large size or many partitions.) The speedup comes from 
two aspects: reading from RAM instead
+          of disk, and accessing the data straight from the cache area instead 
of copying from one RAM area to
+          another. This second aspect yields further performance improvement 
over the standard OS caching
+          mechanism, which still results in memory-to-memory copying of cached 
data.
+        </li>
+
+        <li>
+          For small amounts of data, the query speedup might not be noticeable 
in terms of wall clock time. The
+          performance might be roughly the same with HDFS caching turned on or 
off, due to recently used data being
+          held in the Linux OS cache. The difference is more pronounced with:
+          <ul>
+            <li>
+              Data volumes (for all queries running concurrently) that exceed 
the size of the Linux OS cache.
+            </li>
+
+            <li>
+              A busy cluster running many concurrent queries, where the 
reduction in memory-to-memory copying and
+              overall memory usage during queries results in greater 
scalability and throughput.
+            </li>
+
+            <li>
+              Thus, to really exercise and benchmark this feature in a 
development environment, you might need to
+              simulate realistic workloads and concurrent queries that match 
your production environment.
+            </li>
+
+            <li>
+              One way to simulate a heavy workload on a lightly loaded system 
is to flush the OS buffer cache (on
+              each DataNode) between iterations of queries against the same 
tables or partitions:
+<codeblock>$ sync
+$ echo 1 &gt; /proc/sys/vm/drop_caches
+</codeblock>
+            </li>
+          </ul>
+        </li>
+
+        <li>
+          Impala queries take advantage of HDFS cached data regardless of 
whether the cache directive was issued by
+          Impala or externally through the <cmdname>hdfs cacheadmin</cmdname> 
command, for example for an external
+          table where the cached data files might be accessed by several 
different Hadoop components.
+        </li>
+
+        <li>
+          If your query returns a large result set, the time reported for the 
query could be dominated by the time
+          needed to print the results on the screen. To measure the time for 
the underlying query processing, query
+          the <codeph>COUNT()</codeph> of the big result set, which does all 
the same processing but only prints a
+          single line to the screen.
+        </li>
+      </ul>
+    </conbody>
+  </concept>
+</concept>

[15/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch.

Reply via email to