[26/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch.

jbapple Thu, 17 Nov 2016 15:12:23 -0800

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_hbase.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_hbase.xml b/docs/topics/impala_hbase.xml
new file mode 100644
index 0000000..8b8d013
--- /dev/null
+++ b/docs/topics/impala_hbase.xml
@@ -0,0 +1,904 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="impala_hbase">
+
+  <title id="hbase">Using Impala to Query HBase Tables</title>
+  <titlealts audience="PDF"><navtitle>HBase Tables</navtitle></titlealts>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="HBase"/>
+      <data name="Category" value="Querying"/>
+      <data name="Category" value="Data Analysts"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Tables"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">HBase</indexterm>
+      You can use Impala to query HBase tables. This capability allows 
convenient access to a storage system that
+      is tuned for different kinds of workloads than the default with Impala. 
The default Impala tables use data
+      files stored on HDFS, which are ideal for bulk loads and queries using 
full-table scans. In contrast, HBase
+      can do efficient queries for data organized for OLTP-style workloads, 
with lookups of individual rows or
+      ranges of values.
+    </p>
+
+    <p>
+      From the perspective of an Impala user, coming from an RDBMS background, 
HBase is a kind of key-value store
+      where the value consists of multiple fields. The key is mapped to one 
column in the Impala table, and the
+      various fields of the value are mapped to the other columns in the 
Impala table.
+    </p>
+
+    <p>
+      For background information on HBase, see the snapshot of the Apache 
HBase site (including documentation) for
+      the level of HBase that comes with
+      <xref href="https://archive.cloudera.com/cdh5/cdh/5/hbase/"; 
scope="external" format="html">CDH 5</xref>. To
+      install HBase on a CDH cluster, see the installation instructions for
+      <xref 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hbase_installation.html";
 scope="external" format="html">CDH 5</xref>.
+    </p>
+
+    <p outputclass="toc inpage"/>
+  </conbody>
+
+  <concept id="hbase_using">
+
+    <title>Overview of Using HBase with Impala</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Concepts"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        When you use Impala with HBase:
+      </p>
+
+      <ul>
+        <li>
+          You create the tables on the Impala side using the Hive shell, 
because the Impala <codeph>CREATE
+          TABLE</codeph> statement currently does not support custom SerDes 
and some other syntax needed for these
+          tables:
+          <ul>
+            <li>
+              You designate it as an HBase table using the <codeph>STORED BY
+              'org.apache.hadoop.hive.hbase.HBaseStorageHandler'</codeph> 
clause on the Hive <codeph>CREATE
+              TABLE</codeph> statement.
+            </li>
+
+            <li>
+              You map these specially created tables to corresponding tables 
that exist in HBase, with the clause
+              <codeph>TBLPROPERTIES("hbase.table.name" = 
"<varname>table_name_in_hbase</varname>")</codeph> on the
+              Hive <codeph>CREATE TABLE</codeph> statement.
+            </li>
+
+            <li>
+              See <xref href="#hbase_queries"/> for a full example.
+            </li>
+          </ul>
+        </li>
+
+        <li>
+          You define the column corresponding to the HBase row key as a string 
with the <codeph>#string</codeph>
+          keyword, or map it to a <codeph>STRING</codeph> column.
+        </li>
+
+        <li>
+          Because Impala and Hive share the same metastore database, once you 
create the table in Hive, you can
+          query or insert into it through Impala. (After creating a new table 
through Hive, issue the
+          <codeph>INVALIDATE METADATA</codeph> statement in 
<cmdname>impala-shell</cmdname> to make Impala aware of
+          the new table.)
+        </li>
+
+        <li>
+          You issue queries against the Impala tables. For efficient queries, 
use <codeph>WHERE</codeph> clauses to
+          find a single key value or a range of key values wherever practical, 
by testing the Impala column
+          corresponding to the HBase row key. Avoid queries that do full-table 
scans, which are efficient for
+          regular Impala tables but inefficient in HBase.
+        </li>
+      </ul>
+
+      <p>
+        To work with an HBase table from Impala, ensure that the 
<codeph>impala</codeph> user has read/write
+        privileges for the HBase table, using the <codeph>GRANT</codeph> 
command in the HBase shell. For details
+        about HBase security, see the
+        <xref href="http://hbase.apache.org/book/ch08s04.html"; format="html" 
scope="external">Security chapter in
+        the HBase Reference Guide</xref>.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="hbase_config">
+
+    <title>Configuring HBase for Use with Impala</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Configuring"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        HBase works out of the box with Impala. There is no mandatory 
configuration needed to use these two
+        components together.
+      </p>
+
+      <p>
+        To avoid delays if HBase is unavailable during Impala startup or after 
an <codeph>INVALIDATE
+        METADATA</codeph> statement, <ph rev="upstream">Cloudera</ph> 
recommends setting timeout values as follows in
+        <filepath>/etc/impala/conf/hbase-site.xml</filepath> (for environments 
not managed by Cloudera Manager):
+      </p>
+
+<codeblock>&lt;property&gt;
+  &lt;name&gt;hbase.client.retries.number&lt;/name&gt;
+  &lt;value&gt;3&lt;/value&gt;
+&lt;/property&gt;
+&lt;property&gt;
+  &lt;name&gt;hbase.rpc.timeout&lt;/name&gt;
+  &lt;value&gt;3000&lt;/value&gt;
+&lt;/property&gt;
+</codeblock>
+
+      <p>
+        Currently, Cloudera Manager does not have an Impala-only override for 
HBase settings, so any HBase
+        configuration change you make through Cloudera Manager would take 
affect for all HBase applications.
+        Therefore, this change is not recommended on systems managed by 
Cloudera Manager.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="hbase_types">
+
+    <title>Supported Data Types for HBase Columns</title>
+
+    <conbody>
+
+      <p>
+        To understand how Impala column data types are mapped to fields in 
HBase, you should have some background
+        knowledge about HBase first. You set up the mapping by running the 
<codeph>CREATE TABLE</codeph> statement
+        in the Hive shell. See
+        <xref 
href="https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration"; 
scope="external" format="html">the
+        Hive wiki</xref> for a starting point, and <xref 
href="#hbase_queries"/> for examples.
+      </p>
+
+      <p>
+        HBase works as a kind of <q>bit bucket</q>, in the sense that HBase 
does not enforce any typing for the
+        key or value fields. All the type enforcement is done on the Impala 
side.
+      </p>
+
+      <p>
+        For best performance of Impala queries against HBase tables, most 
queries will perform comparisons in the
+        <codeph>WHERE</codeph> against the column that corresponds to the 
HBase row key. When creating the table
+        through the Hive shell, use the <codeph>STRING</codeph> data type for 
the column that corresponds to the
+        HBase row key. Impala can translate conditional tests (through 
operators such as <codeph>=</codeph>,
+        <codeph>&lt;</codeph>, <codeph>BETWEEN</codeph>, and 
<codeph>IN</codeph>) against this column into fast
+        lookups in HBase, but this optimization (<q>predicate pushdown</q>) 
only works when that column is
+        defined as <codeph>STRING</codeph>.
+      </p>
+
+      <p>
+        Starting in Impala 1.1, Impala also supports reading and writing to 
columns that are defined in the Hive
+        <codeph>CREATE TABLE</codeph> statement using binary data types, 
represented in the Hive table definition
+        using the <codeph>#binary</codeph> keyword, often abbreviated as 
<codeph>#b</codeph>. Defining numeric
+        columns as binary can reduce the overall data volume in the HBase 
tables. You should still define the
+        column that corresponds to the HBase row key as a 
<codeph>STRING</codeph>, to allow fast lookups using
+        those columns.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="hbase_performance">
+
+    <title>Performance Considerations for the Impala-HBase Integration</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Performance"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        To understand the performance characteristics of SQL queries against 
data stored in HBase, you should have
+        some background knowledge about how HBase interacts with SQL-oriented 
systems first. See
+        <xref 
href="https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration"; 
scope="external" format="html">the
+        Hive wiki</xref> for a starting point; because Impala shares the same 
metastore database as Hive, the
+        information about mapping columns from Hive tables to HBase tables is 
generally applicable to Impala too.
+      </p>
+
+      <p>
+        Impala uses the HBase client API via Java Native Interface (JNI) to 
query data stored in HBase. This
+        querying does not read HFiles directly. The extra communication 
overhead makes it important to choose what
+        data to store in HBase or in HDFS, and construct efficient queries 
that can retrieve the HBase data
+        efficiently:
+      </p>
+
+      <ul>
+        <li>
+          Use HBase table for queries that return a single row or a range of 
rows, not queries that scan the entire
+          table. (If a query has no <codeph>WHERE</codeph> clause, that is a 
strong indicator that it is an
+          inefficient query for an HBase table.)
+        </li>
+
+        <li>
+          If you have join queries that do aggregation operations on large 
fact tables and join the results against
+          small dimension tables, consider using Impala for the fact tables 
and HBase for the dimension tables.
+          (Because Impala does a full scan on the HBase table in this case, 
rather than doing single-row HBase
+          lookups based on the join column, only use this technique where the 
HBase table is small enough that
+          doing a full table scan does not cause a performance bottleneck for 
the query.)
+        </li>
+      </ul>
+
+      <p>
+        Query predicates are applied to row keys as start and stop keys, 
thereby limiting the scope of a particular
+        lookup. If row keys are not mapped to string columns, then ordering is 
typically incorrect and comparison
+        operations do not work. For example, if row keys are not mapped to 
string columns, evaluating for greater
+        than (&gt;) or less than (&lt;) cannot be completed.
+      </p>
+
+      <p>
+        Predicates on non-key columns can be sent to HBase to scan as 
<codeph>SingleColumnValueFilters</codeph>,
+        providing some performance gains. In such a case, HBase returns fewer 
rows than if those same predicates
+        were applied using Impala. While there is some improvement, it is not 
as great when start and stop rows are
+        used. This is because the number of rows that HBase must examine is 
not limited as it is when start and
+        stop rows are used. As long as the row key predicate only applies to a 
single row, HBase will locate and
+        return that row. Conversely, if a non-key predicate is used, even if 
it only applies to a single row, HBase
+        must still scan the entire table to find the correct result.
+      </p>
+
+      <example>
+
+        <title>Interpreting EXPLAIN Output for HBase Queries</title>
+
+        <p>
+          For example, here are some queries against the following Impala 
table, which is mapped to an HBase table.
+          The examples show excerpts from the output of the 
<codeph>EXPLAIN</codeph> statement, demonstrating what
+          things to look for to indicate an efficient or inefficient query 
against an HBase table.
+        </p>
+
+        <p>
+          The first column (<codeph>cust_id</codeph>) was specified as the key 
column in the <codeph>CREATE
+          EXTERNAL TABLE</codeph> statement; for performance, it is important 
to declare this column as
+          <codeph>STRING</codeph>. Other columns, such as 
<codeph>BIRTH_YEAR</codeph> and
+          <codeph>NEVER_LOGGED_ON</codeph>, are also declared as 
<codeph>STRING</codeph>, rather than their
+          <q>natural</q> types of <codeph>INT</codeph> or 
<codeph>BOOLEAN</codeph>, because Impala can optimize
+          those types more effectively in HBase tables. For comparison, we 
leave one column,
+          <codeph>YEAR_REGISTERED</codeph>, as <codeph>INT</codeph> to show 
that filtering on this column is
+          inefficient.
+        </p>
+
+<codeblock>describe hbase_table;
+Query: describe hbase_table
++-----------------------+--------+---------+
+| name                  | type   | comment |
++-----------------------+--------+---------+
+| cust_id               | <b>string</b> |         |
+| birth_year            | <b>string</b> |         |
+| never_logged_on       | <b>string</b> |         |
+| private_email_address | string |         |
+| year_registered       | <b>int</b>    |         |
++-----------------------+--------+---------+
+</codeblock>
+
+        <p>
+          The best case for performance involves a single row lookup using an 
equality comparison on the column
+          defined as the row key:
+        </p>
+
+<codeblock>explain select count(*) from hbase_table where cust_id = 
'[email protected]';
++------------------------------------------------------------------------------------+
+| Explain String                                                               
      |
++------------------------------------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=1.01GB VCores=1                      
      |
+| WARNING: The following tables are missing relevant table and/or column 
statistics. |
+| hbase.hbase_table                                                            
      |
+|                                                                              
      |
+| 03:AGGREGATE [MERGE FINALIZE]                                                
      |
+| |  output: sum(count(*))                                                     
      |
+| |                                                                            
      |
+| 02:EXCHANGE [PARTITION=UNPARTITIONED]                                        
      |
+| |                                                                            
      |
+| 01:AGGREGATE                                                                 
      |
+| |  output: count(*)                                                          
      |
+| |                                                                            
      |
+<b>| 00:SCAN HBASE [hbase.hbase_table]                                         
         |</b>
+<b>|    start key: [email protected]                                       
         |</b>
+<b>|    stop key: [email protected]\0                                      
         |</b>
++------------------------------------------------------------------------------------+
+</codeblock>
+
+        <p>
+          Another type of efficient query involves a range lookup on the row 
key column, using SQL operators such
+          as greater than (or equal), less than (or equal), or 
<codeph>BETWEEN</codeph>. This example also includes
+          an equality test on a non-key column; because that column is a 
<codeph>STRING</codeph>, Impala can let
+          HBase perform that test, indicated by the <codeph>hbase 
filters:</codeph> line in the
+          <codeph>EXPLAIN</codeph> output. Doing the filtering within HBase is 
more efficient than transmitting all
+          the data to Impala and doing the filtering on the Impala side.
+        </p>
+
+<codeblock>explain select count(*) from hbase_table where cust_id between 'a' 
and 'b'
+  and never_logged_on = 'true';
++------------------------------------------------------------------------------------+
+| Explain String                                                               
      |
++------------------------------------------------------------------------------------+
+...
+<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1                  
          |
+| WARNING: The following tables are missing relevant table and/or column 
statistics. |
+| hbase.hbase_table                                                            
      |
+|                                                                              
      |
+| 03:AGGREGATE [MERGE FINALIZE]                                                
      |
+| |  output: sum(count(*))                                                     
      |
+| |                                                                            
      |
+| 02:EXCHANGE [PARTITION=UNPARTITIONED]                                        
      |
+| |                                                                            
      |-->
+| 01:AGGREGATE                                                                 
      |
+| |  output: count(*)                                                          
      |
+| |                                                                            
      |
+<b>| 00:SCAN HBASE [hbase.hbase_table]                                         
         |</b>
+<b>|    start key: a                                                           
         |</b>
+<b>|    stop key: b\0                                                          
         |</b>
+<b>|    hbase filters: cols:never_logged_on EQUAL 'true'                       
         |</b>
++------------------------------------------------------------------------------------+
+</codeblock>
+
+        <p>
+          The query is less efficient if Impala has to evaluate any of the 
predicates, because Impala must scan the
+          entire HBase table. Impala can only push down predicates to HBase 
for columns declared as
+          <codeph>STRING</codeph>. This example tests a column declared as 
<codeph>INT</codeph>, and the
+          <codeph>predicates:</codeph> line in the <codeph>EXPLAIN</codeph> 
output indicates that the test is
+          performed after the data is transmitted to Impala.
+        </p>
+
+<codeblock>explain select count(*) from hbase_table where year_registered = 
2010;
++------------------------------------------------------------------------------------+
+| Explain String                                                               
      |
++------------------------------------------------------------------------------------+
+...
+<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1                  
          |
+| WARNING: The following tables are missing relevant table and/or column 
statistics. |
+| hbase.hbase_table                                                            
      |
+|                                                                              
      |
+| 03:AGGREGATE [MERGE FINALIZE]                                                
      |
+| |  output: sum(count(*))                                                     
      |
+| |                                                                            
      |
+| 02:EXCHANGE [PARTITION=UNPARTITIONED]                                        
      |
+| |                                                                            
      |-->
+| 01:AGGREGATE                                                                 
      |
+| |  output: count(*)                                                          
      |
+| |                                                                            
      |
+<b>| 00:SCAN HBASE [hbase.hbase_table]                                         
         |</b>
+<b>|    predicates: year_registered = 2010                                     
         |</b>
++------------------------------------------------------------------------------------+
+</codeblock>
+
+        <p>
+          The same inefficiency applies if the key column is compared to any 
non-constant value. Here, even though
+          the key column is a <codeph>STRING</codeph>, and is tested using an 
equality operator, Impala must scan
+          the entire HBase table because the key column is compared to another 
column value rather than a constant.
+        </p>
+
+<codeblock>explain select count(*) from hbase_table where cust_id = 
private_email_address;
++------------------------------------------------------------------------------------+
+| Explain String                                                               
      |
++------------------------------------------------------------------------------------+
+...
+<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1                  
          |
+| WARNING: The following tables are missing relevant table and/or column 
statistics. |
+| hbase.hbase_table                                                            
      |
+|                                                                              
      |
+| 03:AGGREGATE [MERGE FINALIZE]                                                
      |
+| |  output: sum(count(*))                                                     
      |
+| |                                                                            
      |
+| 02:EXCHANGE [PARTITION=UNPARTITIONED]                                        
      |
+| |                                                                            
      |-->
+| 01:AGGREGATE                                                                 
      |
+| |  output: count(*)                                                          
      |
+| |                                                                            
      |
+<b>| 00:SCAN HBASE [hbase.hbase_table]                                         
         |</b>
+<b>|    predicates: cust_id = private_email_address                            
        |</b>
++------------------------------------------------------------------------------------+
+</codeblock>
+
+        <p>
+          Currently, tests on the row key using <codeph>OR</codeph> or 
<codeph>IN</codeph> clauses are not
+          optimized into direct lookups either. Such limitations might be 
lifted in the future, so always check the
+          <codeph>EXPLAIN</codeph> output to be sure whether a particular SQL 
construct results in an efficient
+          query or not for HBase tables.
+        </p>
+
+<codeblock>explain select count(*) from hbase_table where
+  cust_id = '[email protected]' or cust_id = '[email protected]';
++----------------------------------------------------------------------------------------+
+| Explain String                                                               
          |
++----------------------------------------------------------------------------------------+
+...
+<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1                  
              |
+| WARNING: The following tables are missing relevant table and/or column 
statistics.     |
+| hbase.hbase_table                                                            
          |
+|                                                                              
          |
+| 03:AGGREGATE [MERGE FINALIZE]                                                
          |
+| |  output: sum(count(*))                                                     
          |
+| |                                                                            
          |
+| 02:EXCHANGE [PARTITION=UNPARTITIONED]                                        
          |
+| |                                                                            
          |-->
+| 01:AGGREGATE                                                                 
          |
+| |  output: count(*)                                                          
          |
+| |                                                                            
          |
+<b>| 00:SCAN HBASE [hbase.hbase_table]                                         
             |</b>
+<b>|    predicates: cust_id = '[email protected]' OR cust_id = 
'[email protected]' |</b>
++----------------------------------------------------------------------------------------+
+
+explain select count(*) from hbase_table where
+  cust_id in ('[email protected]', '[email protected]');
++------------------------------------------------------------------------------------+
+| Explain String                                                               
      |
++------------------------------------------------------------------------------------+
+...
+<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1                  
          |
+| WARNING: The following tables are missing relevant table and/or column 
statistics. |
+| hbase.hbase_table                                                            
      |
+|                                                                              
      |
+| 03:AGGREGATE [MERGE FINALIZE]                                                
      |
+| |  output: sum(count(*))                                                     
      |
+| |                                                                            
      |
+| 02:EXCHANGE [PARTITION=UNPARTITIONED]                                        
      |
+| |                                                                            
      |-->
+| 01:AGGREGATE                                                                 
      |
+| |  output: count(*)                                                          
      |
+| |                                                                            
      |
+<b>| 00:SCAN HBASE [hbase.hbase_table]                                         
         |</b>
+<b>|    predicates: cust_id IN ('[email protected]', 
'[email protected]')      |</b>
++------------------------------------------------------------------------------------+
+</codeblock>
+
+        <p>
+          Either rewrite into separate queries for each value and combine the 
results in the application, or
+          combine the single-row queries using UNION ALL:
+        </p>
+
+<codeblock>select count(*) from hbase_table where cust_id = 
'[email protected]';
+select count(*) from hbase_table where cust_id = '[email protected]';
+
+explain
+  select count(*) from hbase_table where cust_id = '[email protected]'
+  union all
+  select count(*) from hbase_table where cust_id = '[email protected]';
++------------------------------------------------------------------------------------+
+| Explain String                                                               
      |
++------------------------------------------------------------------------------------+
+...
+<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1                  
          |
+| WARNING: The following tables are missing relevant table and/or column 
statistics. |
+| hbase.hbase_table                                                            
      |
+|                                                                              
      |
+| 09:EXCHANGE [PARTITION=UNPARTITIONED]                                        
      |
+| |                                                                            
      |
+| |&minus;&minus;11:MERGE                                                      
                  |
+| |  |                                                                         
      |
+| |  08:AGGREGATE [MERGE FINALIZE]                                             
      |
+| |  |  output: sum(count(*))                                                  
      |
+| |  |                                                                         
      |
+| |  07:EXCHANGE [PARTITION=UNPARTITIONED]                                     
      |
+| |  |                                                                         
      |-->
+| |  04:AGGREGATE                                                              
      |
+| |  |  output: count(*)                                                       
      |
+| |  |                                                                         
      |
+<b>| |  03:SCAN HBASE [hbase.hbase_table]                                      
         |</b>
+<b>| |     start key: [email protected]                                   
         |</b>
+<b>| |     stop key: [email protected]\0                                  
         |</b>
+| |                                                                            
      |
+| 10:MERGE                                                                     
      |
+...
+<!--| |                                                                        
          |
+| 06:AGGREGATE [MERGE FINALIZE]                                                
      |
+| |  output: sum(count(*))                                                     
      |
+| |                                                                            
      |
+| 05:EXCHANGE [PARTITION=UNPARTITIONED]                                        
      |
+| |                                                                            
      |-->
+| 02:AGGREGATE                                                                 
      |
+| |  output: count(*)                                                          
      |
+| |                                                                            
      |
+<b>| 01:SCAN HBASE [hbase.hbase_table]                                         
         |</b>
+<b>|    start key: [email protected]                                       
         |</b>
+<b>|    stop key: [email protected]\0                                      
         |</b>
++------------------------------------------------------------------------------------+
+</codeblock>
+
+      </example>
+
+      <example>
+
+        <title>Configuration Options for Java HBase Applications</title>
+
+        <p> If you have an HBase Java application that calls the
+            <codeph>setCacheBlocks</codeph> or <codeph>setCaching</codeph>
+          methods of the class <xref
+            
href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html";
+            scope="external" format="html"
+            >org.apache.hadoop.hbase.client.Scan</xref>, you can set these same
+          caching behaviors through Impala query options, to control the memory
+          pressure on the HBase RegionServer. For example, when doing queries 
in
+          HBase that result in full-table scans (which by default are
+          inefficient for HBase), you can reduce memory usage and speed up the
+          queries by turning off the <codeph>HBASE_CACHE_BLOCKS</codeph> 
setting
+          and specifying a large number for the <codeph>HBASE_CACHING</codeph>
+          setting.
+          <!-- That recommendation w.r.t. full-table scans comes from the 
Cloudera HBase forum: 
http://community.cloudera.com/t5/Realtime-Random-Access-Apache/How-to-optimise-Full-Table-Scan-FTS-in-HBase/td-p/97
 -->
+        </p>
+
+        <p>
+          To set these options, issue commands like the following in 
<cmdname>impala-shell</cmdname>:
+        </p>
+
+<codeblock>-- Same as calling setCacheBlocks(true) or setCacheBlocks(false).
+set hbase_cache_blocks=true;
+set hbase_cache_blocks=false;
+
+-- Same as calling setCaching(rows).
+set hbase_caching=1000;
+</codeblock>
+
+        <p>
+          Or update the <cmdname>impalad</cmdname> defaults file 
<filepath>/etc/default/impala</filepath> and
+          include settings for <codeph>HBASE_CACHE_BLOCKS</codeph> and/or 
<codeph>HBASE_CACHING</codeph> in the
+          <codeph>-default_query_options</codeph> setting for 
<codeph>IMPALA_SERVER_ARGS</codeph>. See
+          <xref href="impala_config_options.xml#config_options"/> for details.
+        </p>
+
+        <note>
+          In Impala 2.0 and later, these options are settable through the JDBC 
or ODBC interfaces using the
+          <codeph>SET</codeph> statement.
+        </note>
+
+      </example>
+    </conbody>
+  </concept>
+
+  <concept id="hbase_scenarios">
+
+    <title>Use Cases for Querying HBase through Impala</title>
+    <prolog>
+      <metadata>
+        <data name="Category" value="Use Cases"/>
+      </metadata>
+    </prolog>
+
+    <conbody>
+
+      <p>
+        The following are popular use cases for using Impala to query HBase 
tables:
+      </p>
+
+      <ul>
+        <li>
+          Keeping large fact tables in Impala, and smaller dimension tables in 
HBase. The fact tables use Parquet
+          or other binary file format optimized for scan operations. Join 
queries scan through the large Impala
+          fact tables, and cross-reference the dimension tables using 
efficient single-row lookups in HBase.
+        </li>
+
+        <li>
+          Using HBase to store rapidly incrementing counters, such as how many 
times a web page has been viewed, or
+          on a social network, how many connections a user has or how many 
votes a post received. HBase is
+          efficient for capturing such changeable data: the append-only 
storage mechanism is efficient for writing
+          each change to disk, and a query always returns the latest value. An 
application could query specific
+          totals like these from HBase, and combine the results with a broader 
set of data queried from Impala.
+        </li>
+
+        <li>
+          <p>
+            Storing very wide tables in HBase. Wide tables have many columns, 
possibly thousands, typically
+            recording many attributes for an important subject such as a user 
of an online service. These tables
+            are also often sparse, that is, most of the columns values are 
<codeph>NULL</codeph>, 0,
+            <codeph>false</codeph>, empty string, or other blank or 
placeholder value. (For example, any particular
+            web site user might have never used some site feature, filled in a 
certain field in their profile,
+            visited a particular part of the site, and so on.) A typical query 
against this kind of table is to
+            look up a single row to retrieve all the information about a 
specific subject, rather than summing,
+            averaging, or filtering millions of rows as in typical 
Impala-managed tables.
+          </p>
+          <p>
+            Or the HBase table could be joined with a larger Impala-managed 
table. For example, analyze the large
+            Impala table representing web traffic for a site and pick out 50 
users who view the most pages. Join
+            that result with the wide user table in HBase to look up 
attributes of those users. The HBase side of
+            the join would result in 50 efficient single-row lookups in HBase, 
rather than scanning the entire user
+            table.
+          </p>
+        </li>
+      </ul>
+    </conbody>
+  </concept>
+
+  <concept audience="Cloudera" id="hbase_create_new">
+
+    <title>Creating a New HBase Table for Impala to Use</title>
+
+    <conbody>
+
+      <p>
+        You can create an HBase-backed table through a <codeph>CREATE 
TABLE</codeph> statement in the Hive shell,
+        without going into the HBase shell at all:
+      </p>
+
+      <!-- To do:
+        Add example. (Not critical because this subtopic is currently hidden.)
+      -->
+    </conbody>
+  </concept>
+
+  <concept audience="Cloudera" id="hbase_reuse_existing">
+
+    <title>Associate Impala with an Existing HBase Table</title>
+
+    <conbody>
+
+      <p>
+        If you already have some HBase tables created through the HBase shell, 
you can make them accessible to
+        Impala through a <codeph>CREATE TABLE</codeph> statement in the Hive 
shell:
+      </p>
+
+      <!-- To do:
+        Add example. (Not critical because this subtopic is currently hidden.)
+      -->
+    </conbody>
+  </concept>
+
+  <concept audience="Cloudera" id="hbase_column_families">
+
+    <title>Map HBase Columns and Column Families to Impala Columns</title>
+
+    <conbody>
+
+      <p/>
+    </conbody>
+  </concept>
+
+  <concept id="hbase_loading">
+
+    <title>Loading Data into an HBase Table</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="ETL"/>
+      <data name="Category" value="Ingest"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        The Impala <codeph>INSERT</codeph> statement works for HBase tables. 
The <codeph>INSERT ... VALUES</codeph>
+        syntax is ideally suited to HBase tables, because inserting a single 
row is an efficient operation for an
+        HBase table. (For regular Impala tables, with data files in HDFS, the 
tiny data files produced by
+        <codeph>INSERT ... VALUES</codeph> are extremely inefficient, so you 
would not use that technique with
+        tables containing any significant data volume.)
+      </p>
+
+      <!-- To do:
+        Add examples throughout this section.
+      -->
+
+      <p>
+        When you use the <codeph>INSERT ... SELECT</codeph> syntax, the result 
in the HBase table could be fewer
+        rows than you expect. HBase only stores the most recent version of 
each unique row key, so if an
+        <codeph>INSERT ... SELECT</codeph> statement copies over multiple rows 
containing the same value for the
+        key column, subsequent queries will only return one row with each key 
column value:
+      </p>
+
+      <p>
+        Although Impala does not have an <codeph>UPDATE</codeph> statement, 
you can achieve the same effect by
+        doing successive <codeph>INSERT</codeph> statements using the same 
value for the key column each time:
+      </p>
+
+    </conbody>
+  </concept>
+
+  <concept id="hbase_limitations">
+
+    <title>Limitations and Restrictions of the Impala and HBase 
Integration</title>
+
+    <conbody>
+
+      <p>
+        The Impala integration with HBase has the following limitations and 
restrictions, some inherited from the
+        integration between HBase and Hive, and some unique to Impala:
+      </p>
+
+      <ul>
+        <li>
+          <p>
+            If you issue a <codeph>DROP TABLE</codeph> for an internal 
(Impala-managed) table that is mapped to an
+            HBase table, the underlying table is not removed in HBase. The 
Hive <codeph>DROP TABLE</codeph>
+            statement also removes the HBase table in this case.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            The <codeph>INSERT OVERWRITE</codeph> statement is not available 
for HBase tables. You can insert new
+            data, or modify an existing row by inserting a new row with the 
same key value, but not replace the
+            entire contents of the table. You can do an <codeph>INSERT 
OVERWRITE</codeph> in Hive if you need this
+            capability.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            If you issue a <codeph>CREATE TABLE LIKE</codeph> statement for a 
table mapped to an HBase table, the
+            new table is also an HBase table, but inherits the same underlying 
HBase table name as the original.
+            The new table is effectively an alias for the old one, not a new 
table with identical column structure.
+            Avoid using <codeph>CREATE TABLE LIKE</codeph> for HBase tables, 
to avoid any confusion.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Copying data into an HBase table using the Impala <codeph>INSERT 
... SELECT</codeph> syntax might
+            produce fewer new rows than are in the query result set. If the 
result set contains multiple rows with
+            the same value for the key column, each row supercedes any 
previous rows with the same key value.
+            Because the order of the inserted rows is unpredictable, you 
cannot rely on this technique to preserve
+            the <q>latest</q> version of a particular key value.
+          </p>
+        </li>
+        <li rev="2.3.0">
+          <p>
+            Because the complex data types (<codeph>ARRAY</codeph>, 
<codeph>STRUCT</codeph>, and <codeph>MAP</codeph>)
+            available in <keyword keyref="impala23_full"/> and higher are 
currently only supported in Parquet tables, you cannot
+            use these types in HBase tables that are queried through Impala.
+          </p>
+        </li>
+        <li>
+          <p conref="../shared/impala_common.xml#common/hbase_no_load_data"/>
+        </li>
+      </ul>
+    </conbody>
+  </concept>
+
+  <concept id="hbase_queries">
+
+    <title>Examples of Querying HBase Tables from Impala</title>
+
+    <conbody>
+
+      <p>
+        The following examples create an HBase table with four column families,
+        create a corresponding table through Hive,
+        then insert and query the table through Impala.
+      </p>
+      <p>
+        In HBase shell, the table
+        name is quoted in <codeph>CREATE</codeph> and <codeph>DROP</codeph> 
statements. Tables created in HBase
+        begin in <q>enabled</q> state; before dropping them through the HBase 
shell, you must issue a
+        <codeph>disable '<varname>table_name</varname>'</codeph> statement.
+      </p>
+
+<codeblock>$ hbase shell
+15/02/10 16:07:45
+HBase Shell; enter 'help&lt;RETURN>' for list of supported commands.
+Type "exit&lt;RETURN>" to leave the HBase Shell
+Version 0.94.2-cdh4.2.0, rUnknown, Fri Feb 15 11:51:18 PST 2013
+
+hbase(main):001:0> create 'hbasealltypessmall', 'boolsCF', 'intsCF', 
'floatsCF', 'stringsCF'
+0 row(s) in 4.6520 seconds
+
+=> Hbase::Table - hbasealltypessmall
+hbase(main):006:0> quit
+</codeblock>
+
+        <p>
+          Issue the following <codeph>CREATE TABLE</codeph> statement in the 
Hive shell. (The Impala <codeph>CREATE
+          TABLE</codeph> statement currently does not support the 
<codeph>STORED BY</codeph> clause, so you switch into Hive to
+          create the table, then back to Impala and the 
<cmdname>impala-shell</cmdname> interpreter to issue the
+          queries.)
+        </p>
+
+        <p>
+          This example creates an external table mapped to the HBase table, 
usable by both Impala and Hive. It is
+          defined as an external table so that when dropped by Impala or Hive, 
the original HBase table is not touched at all.
+        </p>
+
+        <p>
+          The <codeph>WITH SERDEPROPERTIES</codeph> clause
+          specifies that the first column (<codeph>ID</codeph>) represents the 
row key, and maps the remaining
+          columns of the SQL table to HBase column families. The mapping 
relies on the ordinal order of the
+          columns in the table, not the column names in the <codeph>CREATE 
TABLE</codeph> statement.
+          The first column is defined to be the lookup key; the
+          <codeph>STRING</codeph> data type produces the fastest key-based 
lookups for HBase tables.
+        </p>
+
+        <note>
+          For Impala with HBase tables, the most important aspect to ensure 
good performance is to use a
+          <codeph>STRING</codeph> column as the row key, as shown in this 
example.
+        </note>
+
+<codeblock>$ hive
+Logging initialized using configuration in 
file:/etc/hive/conf.dist/hive-log4j.properties
+Hive history 
file=/tmp/cloudera/hive_job_log_cloudera_201502101610_1980712808.txt
+hive> use hbase;
+OK
+Time taken: 4.095 seconds
+hive> CREATE EXTERNAL TABLE hbasestringids (
+    >   id string,
+    >   bool_col boolean,
+    >   tinyint_col tinyint,
+    >   smallint_col smallint,
+    >   int_col int,
+    >   bigint_col bigint,
+    >   float_col float,
+    >   double_col double,
+    >   date_string_col string,
+    >   string_col string,
+    >   timestamp_col timestamp)
+    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
+    > WITH SERDEPROPERTIES (
+    >   "hbase.columns.mapping" =
+    >   
":key,boolsCF:bool_col,intsCF:tinyint_col,intsCF:smallint_col,intsCF:int_col,intsCF:\
+    >   
bigint_col,floatsCF:float_col,floatsCF:double_col,stringsCF:date_string_col,\
+    >   stringsCF:string_col,stringsCF:timestamp_col"
+    > )
+    > TBLPROPERTIES("hbase.table.name" = "hbasealltypessmall");
+OK
+Time taken: 2.879 seconds
+hive> quit;
+</codeblock>
+
+        <p>
+          Once you have established the mapping to an HBase table, you can 
issue DML statements and queries
+          from Impala. The following example shows a series of 
<codeph>INSERT</codeph>
+          statements followed by a query.
+          The ideal kind of query from a performance standpoint
+          retrieves a row from the table based on a row key
+          mapped to a string column.
+          An initial <codeph>INVALIDATE METADATA 
<varname>table_name</varname></codeph>
+          statement makes the table created through Hive visible to Impala.
+        </p>
+
+<codeblock>$ impala-shell -i localhost -d hbase
+Starting Impala Shell without Kerberos authentication
+Connected to localhost:21000
+Server version: impalad version 2.1.0-cdh4 RELEASE (build 
d520a9cdea2fc97e8d5da9fbb0244e60ee416bfa)
+Welcome to the Impala shell. Press TAB twice to see a list of available 
commands.
+
+Copyright (c) 2012 Cloudera, Inc. All rights reserved.
+
+(Shell build version: Impala Shell v2.1.0-cdh4 (d520a9c) built on Mon Dec  8 
21:41:17 PST 2014)
+Query: use `hbase`
+[localhost:21000] > invalidate metadata hbasestringids;
+Fetched 0 row(s) in 0.09s
+[localhost:21000] > desc hbasestringids;
++-----------------+-----------+---------+
+| name            | type      | comment |
++-----------------+-----------+---------+
+| id              | string    |         |
+| bool_col        | boolean   |         |
+| double_col      | double    |         |
+| float_col       | float     |         |
+| bigint_col      | bigint    |         |
+| int_col         | int       |         |
+| smallint_col    | smallint  |         |
+| tinyint_col     | tinyint   |         |
+| date_string_col | string    |         |
+| string_col      | string    |         |
+| timestamp_col   | timestamp |         |
++-----------------+-----------+---------+
+Fetched 11 row(s) in 0.02s
+[localhost:21000] > insert into hbasestringids values 
('0001',true,3.141,9.94,1234567,32768,4000,76,'2014-12-31','Hello world',now());
+Inserted 1 row(s) in 0.26s
+[localhost:21000] > insert into hbasestringids values 
('0002',false,2.004,6.196,1500,8000,129,127,'2014-01-01','Foo bar',now());
+Inserted 1 row(s) in 0.12s
+[localhost:21000] > select * from hbasestringids where id = '0001';
++------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+
+| id   | bool_col | double_col | float_col         | bigint_col | int_col | 
smallint_col | tinyint_col | date_string_col | string_col  | timestamp_col      
           |
++------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+
+| 0001 | true     | 3.141      | 9.939999580383301 | 1234567    | 32768   | 
4000         | 76          | 2014-12-31      | Hello world | 2015-02-10 
16:36:59.764838000 |
++------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+
+Fetched 1 row(s) in 0.54s
+</codeblock>
+
+        <note 
conref="../shared/impala_common.xml#common/invalidate_metadata_hbase"/>
+<!--      </section> -->
+    </conbody>
+  </concept>
+</concept>


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_hbase_cache_blocks.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_hbase_cache_blocks.xml 
b/docs/topics/impala_hbase_cache_blocks.xml
new file mode 100644
index 0000000..480b31c
--- /dev/null
+++ b/docs/topics/impala_hbase_cache_blocks.xml
@@ -0,0 +1,40 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="hbase_cache_blocks">
+
+  <title>HBASE_CACHE_BLOCKS Query Option</title>
+  <titlealts audience="PDF"><navtitle>HBASE_CACHE_BLOCKS</navtitle></titlealts>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="HBase"/>
+      <data name="Category" value="Impala Query Options"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">HBASE_CACHE_BLOCKS query 
option</indexterm>
+      Setting this option is equivalent to calling the
+        <codeph>setCacheBlocks</codeph> method of the class <xref
+        
href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html";
+        scope="external" format="html"
+        >org.apache.hadoop.hbase.client.Scan</xref>, in an HBase Java
+      application. Helps to control the memory pressure on the HBase
+      RegionServer, in conjunction with the <codeph>HBASE_CACHING</codeph> 
query
+      option. </p>
+
+    <p conref="../shared/impala_common.xml#common/type_boolean"/>
+    <p conref="../shared/impala_common.xml#common/default_false_0"/>
+
+    <p conref="../shared/impala_common.xml#common/related_info"/>
+    <p>
+      <xref href="impala_hbase.xml#impala_hbase"/>,
+      <xref href="impala_hbase_caching.xml#hbase_caching"/>
+    </p>
+
+  </conbody>
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_hbase_caching.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_hbase_caching.xml 
b/docs/topics/impala_hbase_caching.xml
new file mode 100644
index 0000000..03755fd
--- /dev/null
+++ b/docs/topics/impala_hbase_caching.xml
@@ -0,0 +1,45 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="hbase_caching">
+
+  <title>HBASE_CACHING Query Option</title>
+  <titlealts audience="PDF"><navtitle>HBASE_CACHING</navtitle></titlealts>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="HBase"/>
+      <data name="Category" value="Impala Query Options"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">HBASE_CACHING query option</indexterm>
+      Setting this option is equivalent to calling the
+        <codeph>setCaching</codeph> method of the class <xref
+        
href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html";
+        scope="external" format="html"
+        >org.apache.hadoop.hbase.client.Scan</xref>, in an HBase Java
+      application. Helps to control the memory pressure on the HBase
+      RegionServer, in conjunction with the <codeph>HBASE_CACHE_BLOCKS</codeph>
+      query option. </p>
+
+    <p>
+      <b>Type:</b> <codeph>BOOLEAN</codeph>
+    </p>
+
+    <p>
+      <b>Default:</b> 0
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/related_info"/>
+    <p>
+      <xref href="impala_hbase.xml#impala_hbase"/>,
+      <xref href="impala_hbase_cache_blocks.xml#hbase_cache_blocks"/>
+    </p>
+
+  </conbody>
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_hints.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_hints.xml b/docs/topics/impala_hints.xml
new file mode 100644
index 0000000..4c5145f
--- /dev/null
+++ b/docs/topics/impala_hints.xml
@@ -0,0 +1,249 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="hints">
+
+  <title>Query Hints in Impala SELECT Statements</title>
+  <titlealts audience="PDF"><navtitle>Hints</navtitle></titlealts>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="SQL"/>
+      <data name="Category" value="Querying"/>
+      <data name="Category" value="Performance"/>
+      <data name="Category" value="Troubleshooting"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">hints</indexterm>
+      The Impala SQL dialect supports query hints, for fine-tuning the inner 
workings of queries. Specify hints as
+      a temporary workaround for expensive queries, where missing statistics 
or other factors cause inefficient
+      performance.
+    </p>
+
+    <p>
+      Hints are most often used for the most resource-intensive kinds of 
Impala queries:
+    </p>
+
+    <ul>
+      <li>
+        Join queries involving large tables, where intermediate result sets 
are transmitted across the network to
+        evaluate the join conditions.
+      </li>
+
+      <li>
+        Inserting into partitioned Parquet tables, where many memory buffers 
could be allocated on each host to
+        hold intermediate results for each partition.
+      </li>
+    </ul>
+
+    <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+    <p>
+      You can represent the hints as keywords surrounded by 
<codeph>[]</codeph> square brackets; include the
+      brackets in the text of the SQL statement.
+    </p>
+
+<codeblock>SELECT STRAIGHT_JOIN <varname>select_list</varname> FROM
+<varname>join_left_hand_table</varname>
+  JOIN [{BROADCAST|SHUFFLE}]
+<varname>join_right_hand_table</varname>
+<varname>remainder_of_query</varname>;
+
+INSERT <varname>insert_clauses</varname>
+  [{SHUFFLE|NOSHUFFLE}]
+  SELECT <varname>remainder_of_query</varname>;
+</codeblock>
+
+    <p rev="2.0.0">
+      In <keyword keyref="impala20_full"/> and higher, you can also specify 
the hints inside comments that use
+      either the <codeph>/* */</codeph> or <codeph>--</codeph> notation. 
Specify a <codeph>+</codeph> symbol
+      immediately before the hint name.
+    </p>
+
+<codeblock rev="2.0.0">SELECT STRAIGHT_JOIN <varname>select_list</varname> FROM
+<varname>join_left_hand_table</varname>
+  JOIN /* +BROADCAST|SHUFFLE */
+<varname>join_right_hand_table</varname>
+<varname>remainder_of_query</varname>;
+
+SELECT <varname>select_list</varname> FROM
+<varname>join_left_hand_table</varname>
+  JOIN -- +BROADCAST|SHUFFLE
+<varname>join_right_hand_table</varname>
+<varname>remainder_of_query</varname>;
+
+INSERT <varname>insert_clauses</varname>
+  /* +SHUFFLE|NOSHUFFLE */
+  SELECT <varname>remainder_of_query</varname>;
+
+INSERT <varname>insert_clauses</varname>
+  -- +SHUFFLE|NOSHUFFLE
+  SELECT <varname>remainder_of_query</varname>;
+</codeblock>
+
+    <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+    <p>
+      With both forms of hint syntax, include the 
<codeph>STRAIGHT_JOIN</codeph>
+      keyword immediately after the <codeph>SELECT</codeph> keyword to prevent 
Impala from
+      reordering the tables in a way that makes the hint ineffective.
+    </p>
+
+    <p>
+      To reduce the need to use hints, run the <codeph>COMPUTE STATS</codeph> 
statement against all tables involved
+      in joins, or used as the source tables for <codeph>INSERT ... 
SELECT</codeph> operations where the
+      destination is a partitioned Parquet table. Do this operation after 
loading data or making substantial
+      changes to the data within each table. Having up-to-date statistics 
helps Impala choose more efficient query
+      plans without the need for hinting. See <xref 
href="impala_perf_stats.xml#perf_stats"/> for details and
+      examples.
+    </p>
+
+    <p>
+      To see which join strategy is used for a particular query, examine the 
<codeph>EXPLAIN</codeph> output for
+      that query. See <xref href="impala_explain_plan.xml#perf_explain"/> for 
details and examples.
+    </p>
+
+    <p>
+      <b>Hints for join queries:</b>
+    </p>
+
+    <p>
+      The <codeph>[BROADCAST]</codeph> and <codeph>[SHUFFLE]</codeph> hints 
control the execution strategy for join
+      queries. Specify one of the following constructs immediately after the 
<codeph>JOIN</codeph> keyword in a
+      query:
+    </p>
+
+    <ul>
+      <li>
+        <codeph>[SHUFFLE]</codeph> - Makes that join operation use the 
<q>partitioned</q> technique, which divides
+        up corresponding rows from both tables using a hashing algorithm, 
sending subsets of the rows to other
+        nodes for processing. (The keyword <codeph>SHUFFLE</codeph> is used to 
indicate a <q>partitioned join</q>,
+        because that type of join is not related to <q>partitioned 
tables</q>.) Since the alternative
+        <q>broadcast</q> join mechanism is the default when table and index 
statistics are unavailable, you might
+        use this hint for queries where broadcast joins are unsuitable; 
typically, partitioned joins are more
+        efficient for joins between large tables of similar size.
+      </li>
+
+      <li>
+        <codeph>[BROADCAST]</codeph> - Makes that join operation use the 
<q>broadcast</q> technique that sends the
+        entire contents of the right-hand table to all nodes involved in 
processing the join. This is the default
+        mode of operation when table and index statistics are unavailable, so 
you would typically only need it if
+        stale metadata caused Impala to mistakenly choose a partitioned join 
operation. Typically, broadcast joins
+        are more efficient in cases where one table is much smaller than the 
other. (Put the smaller table on the
+        right side of the <codeph>JOIN</codeph> operator.)
+      </li>
+    </ul>
+
+    <p>
+      <b>Hints for INSERT ... SELECT queries:</b>
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/insert_hints"/>
+
+    <p>
+      <b>Suggestions versus directives:</b>
+    </p>
+
+    <p>
+      In early Impala releases, hints were always obeyed and so acted more 
like directives. Once Impala gained join
+      order optimizations, sometimes join queries were automatically reordered 
in a way that made a hint
+      irrelevant. Therefore, the hints act more like suggestions in Impala 
1.2.2 and higher.
+    </p>
+
+    <p>
+      To force Impala to follow the hinted execution mechanism for a join 
query, include the
+      <codeph>STRAIGHT_JOIN</codeph> keyword in the <codeph>SELECT</codeph> 
statement. See
+      <xref href="impala_perf_joins.xml#straight_join"/> for details. When you 
use this technique, Impala does not
+      reorder the joined tables at all, so you must be careful to arrange the 
join order to put the largest table
+      (or subquery result set) first, then the smallest, second smallest, 
third smallest, and so on. This ordering lets Impala do the
+      most I/O-intensive parts of the query using local reads on the 
DataNodes, and then reduce the size of the
+      intermediate result set as much as possible as each subsequent table or 
subquery result set is joined.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
+
+    <p>
+      Queries that include subqueries in the <codeph>WHERE</codeph> clause can 
be rewritten internally as join
+      queries. Currently, you cannot apply hints to the joins produced by 
these types of queries.
+    </p>
+
+    <p>
+      Because hints can prevent queries from taking advantage of new metadata 
or improvements in query planning,
+      use them only when required to work around performance issues, and be 
prepared to remove them when they are
+      no longer required, such as after a new Impala release or bug fix.
+    </p>
+
+    <p>
+      In particular, the <codeph>[BROADCAST]</codeph> and 
<codeph>[SHUFFLE]</codeph> hints are expected to be
+      needed much less frequently in Impala 1.2.2 and higher, because the join 
order optimization feature in
+      combination with the <codeph>COMPUTE STATS</codeph> statement now 
automatically choose join order and join
+      mechanism without the need to rewrite the query and add hints. See
+      <xref href="impala_perf_joins.xml#perf_joins"/> for details.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
+
+    <p rev="2.0.0">
+      The hints embedded within <codeph>--</codeph> comments are compatible 
with Hive queries. The hints embedded
+      within <codeph>/* */</codeph> comments or <codeph>[ ]</codeph> square 
brackets are not recognized by or not
+      compatible with Hive. For example, Hive raises an error for Impala hints 
within <codeph>/* */</codeph>
+      comments because it does not recognize the Impala hint names.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/view_blurb"/>
+
+    <p rev="2.0.0">
+      If you use a hint in the query that defines a view, the hint is 
preserved when you query the view. Impala
+      internally rewrites all hints in views to use the <codeph>--</codeph> 
comment notation, so that Hive can
+      query such views without errors due to unrecognized hint names.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+    <p>
+      For example, this query joins a large customer table with a small lookup 
table of less than 100 rows. The
+      right-hand table can be broadcast efficiently to all nodes involved in 
the join. Thus, you would use the
+      <codeph>[broadcast]</codeph> hint to force a broadcast join strategy:
+    </p>
+
+<codeblock>select straight_join customer.address, state_lookup.state_name
+  from customer join <b>[broadcast]</b> state_lookup
+  on customer.state_id = state_lookup.state_id;</codeblock>
+
+    <p>
+      This query joins two large tables of unpredictable size. You might 
benchmark the query with both kinds of
+      hints and find that it is more efficient to transmit portions of each 
table to other nodes for processing.
+      Thus, you would use the <codeph>[shuffle]</codeph> hint to force a 
partitioned join strategy:
+    </p>
+
+<codeblock>select straight_join weather.wind_velocity, geospatial.altitude
+  from weather join <b>[shuffle]</b> geospatial
+  on weather.lat = geospatial.lat and weather.long = 
geospatial.long;</codeblock>
+
+    <p>
+      For joins involving three or more tables, the hint applies to the tables 
on either side of that specific
+      <codeph>JOIN</codeph> keyword. The <codeph>STRAIGHT_JOIN</codeph> 
keyword ensures that joins are processed
+      in a predictable order from left to right. For example, this query joins
+      <codeph>t1</codeph> and <codeph>t2</codeph> using a partitioned join, 
then joins that result set to
+      <codeph>t3</codeph> using a broadcast join:
+    </p>
+
+<codeblock>select straight_join t1.name, t2.id, t3.price
+  from t1 join <b>[shuffle]</b> t2 join <b>[broadcast]</b> t3
+  on t1.id = t2.id and t2.id = t3.id;</codeblock>
+
+    <!-- To do: This is a good place to add more sample output showing before 
and after EXPLAIN plans. -->
+
+    <p conref="../shared/impala_common.xml#common/related_info"/>
+
+    <p>
+      For more background information about join queries, see <xref 
href="impala_joins.xml#joins"/>. For
+      performance considerations, see <xref 
href="impala_perf_joins.xml#perf_joins"/>.
+    </p>
+  </conbody>
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_howto_rm.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_howto_rm.xml b/docs/topics/impala_howto_rm.xml
new file mode 100644
index 0000000..2a7494b
--- /dev/null
+++ b/docs/topics/impala_howto_rm.xml
@@ -0,0 +1,420 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="howto_impala_rm">
+
+ <title>How to Configure Resource Management for Impala</title>
+ <prolog>
+  <metadata>
+   <data name="Category" value="Impala"/>
+   <data name="Category" value="Admission Control"/>
+   <data name="Category" value="Resource Management"/>
+  </metadata>
+ </prolog>
+ <conbody>
+  <p>Impala includes features that balance and maximize resources in your CDH 
cluster. This topic describes
+  how you can enhance a CDH cluster using Impala to improve efficiency.</p>
+  <p outputclass="toc inpage">A typical deployment uses the following.</p>
+  <ul audience="HTML">
+   <li audience="HTML">Creating Static Service Pools</li>
+   <li audience="HTML">Using Admission Control <ul>
+     <li audience="HTML">Setting Per-query Memory Limits</li>
+     <li audience="HTML">Creating Dynamic Resource Pools</li>
+    </ul></li>
+
+  </ul>
+ </conbody>
+ <concept id="static_service_pools">
+   <title>Creating Static Service Pools</title>
+  <conbody>
+  <p>Use Static Service Pools to allocate dedicated resources for Impala and 
other services to allow
+    for predictable resource availability. </p>
+    <p>Static service pools isolate services from one another, so that high 
load on one service has
+    bounded impact on other services. You can use Cloudera Manager to 
configure static service pools
+    that control memory, CPU and Disk I/O.</p>
+    <p >The following screenshot shows a sample configuration for Static 
Service Pools in
+    Cloudera Manager:</p>
+   <draft-comment author="ddawson">
+    <p>Need accurate numbers - or can we remove services other than HDFS, 
Impala, and YARN? Matt
+     Jacobs is going to run these numbers by someone in the field.</p>
+   </draft-comment>
+   <p>
+    <image href="../images/howto_static_server_pools_config.png" 
placement="break"
+     id="PDF_33" align="center" scale="33" audience="HTML"/>
+    <image href="../images/howto_static_server_pools_config.png" 
placement="break"
+     id="HTML_SCALEFIT" align="center" scalefit="yes" audience="PDF"/>
+   </p>
+   <p>
+    <ul id="ul_tkw_4rs_pw">
+     <li >
+      <p >HDFS always needs to have a minimum of 5-10% of the resources.</p>
+     </li>
+     <li >
+      Generally, YARN and Impala split the rest of the resources.
+     
+     <ul id="ul_ukw_4rs_pw">
+      <li >
+       <p >For mostly batch workloads, you might allocate YARN 60%, Impala 
30%, and HDFS
+        10%. </p>
+      </li>
+      <li >
+       <p >For mostly ad hoc query workloads, you might allocate Impala 60%, 
YARN 30%, and
+        HDFS 10%.</p>
+      </li>
+     </ul></li>
+    </ul>
+   </p>
+   </conbody>
+  </concept>
+ <concept id="enable_admission_control">
+  <title>Using Admission Control</title>
+  <conbody>
+   <p>Within the constraints of the static service pool, you can further 
subdivide Impala's
+    resources using Admission Control. You configure Impala Admission Control 
pools in the Cloudera
+    Manager Dynamic Resource Pools page.</p>
+   <p>You use Admission Control to divide usage between Dynamic Resource Pools 
in multitenant use
+    cases. Allocating resources judiciously allows your most important queries 
to run faster and
+    more reliably.</p>
+   <p>
+    <note>In this context, Impala Dynamic Resource Pools are different than 
the default YARN Dynamic
+     Resource Pools. You can turn on Dynamic Resource Pools that are 
exclusively for use by
+     Impala.</note>
+   </p>
+   <p>Admission Control is enabled by default.</p>
+   <p>A Dynamic Resource Pool has the following properties:<ul 
id="ul_blk_jjg_sw">
+     <li><b>Max Running Queries</b>: Maximum number of concerrently executing 
queries in the pool
+      before incoming queries are queued.</li>
+     <li><b>Max Memory Resources</b>: Maximum memory used by queries in the 
pool before incoming
+      queries are queued. This value is used at the time of admission and is 
not enforced at query
+      runtime.</li>
+     <li><b>Default Query Memory Limit</b>: Defines the maximum amount of 
memory a query can
+      allocate on each node. This is enforced at runtime. If the query 
attempts to use more memory,
+      it is forced to spill, if possible. Otherwise, it is cancelled. The 
total memory that can be
+      used by a query is the <codeph>MEM_LIMIT</codeph> times the number of 
nodes.</li>
+     <li><b>Max Queued Queries</b>: Maximum number of queries that can be 
queued in the pool before
+      additional queries are rejected.</li>
+     <li><b>Queue Timeout</b>: Specifies how long queries can wait in the 
queue before they are
+      cancelled with a timeout error.</li>
+    </ul></p>
+  </conbody>
+ </concept>
+ <concept id="set_per_query_memory_limits">
+  <title>Setting Per-query Memory Limits</title>
+  <conbody>
+   <p>Use per-query memory limits to prevent queries from consuming excessive 
memory resources that
+    impact other queries. <ph rev="upstream">Cloudera</ph> recommends that you 
set the query memory limits whenever
+    possible.</p>
+   <p>If you set the <b>Pool Max Mem Resources</b> for a resource pool, Impala 
attempts to throttle
+    queries if there is not enough memory to run them within the specified 
resources.</p>
+   <p>Only use admission control with maximum memory resources if you can 
ensure there are query
+    memory limits. Set the pool <b>Default Query Memory Limit</b> to be 
certain. You can override
+    this setting with the <codeph>query</codeph> option, if necessary.</p>
+   <p>Typically, you set query memory limits using the <codeph>set 
MEM_LIMIT=Xg;</codeph> query
+    option. When you find the right value for your business case, memory-based 
admission control
+    works well. The potential downside is that queries that attempt to use 
more memory might perform
+    poorly or even be cancelled.</p>
+   <p>To find a reasonable default query memory limit:<ol id="ol_ydt_xhy_pw">
+     <li>Run the workload.</li>
+     <li>In Cloudera Manager, go to <menucascade>
+       <uicontrol>Impala</uicontrol>
+       <uicontrol>Queries</uicontrol>
+      </menucascade>.</li>
+     <li>Click <uicontrol>Select Attributes</uicontrol>.</li>
+     <li>Select <uicontrol>Per Node Peak Memory Usage</uicontrol> and click
+       <uicontrol>Update</uicontrol>.</li>
+     <li>Allow the system time to gather information, then click the 
<uicontrol>Show
+       Histogram</uicontrol> icon to see the results.<image placement="break"
+       href="../images/howto_show_histogram.png" align="center" 
id="image_hmv_xky_pw"/></li>
+     <li>Use the histogram to find a value that accounts for most queries. 
Queries that require more
+      resources than this limit should explicitly set the memory limit to 
ensure they can run to
+       completion.<draft-comment author="ddawson">This chart uses bad sample 
data - we will change
+       the chart when we have real numbers from the sample use 
case.</draft-comment><image
+       placement="break" href="../images/howto_per_node_peak_memory_usage.png" 
align="center"
+       id="image_ehn_hly_pw" scalefit="yes"/></li>
+    </ol></p>
+  </conbody>
+ </concept>
+ <concept id="concept_en4_3sy_pw">
+  <title>Creating Dynamic Resource Pools</title>
+  <conbody>
+   <p>A dynamic resource pool is a named configuration of resources and a 
policy for scheduling the
+    resources among Impala queries running in the pool. Dynamic resource pools 
allow you to schedule
+    and allocate resources to Impala queries based on a user's access to 
specific pools and the
+    resources available to those pools. </p>
+   <p>This example creates both production and development resource pools or 
queues. It assumes you
+    have 3 worker nodes with 24GiB of RAM each for an aggregate memory of 
72000MiB. This pool
+    configuration allocates the Production queue twice the memory resources of 
the Development
+    queue, and a higher number of concurrent queries.</p>
+   <p>To create a Production dynamic resource pool for Impala:</p>
+    <ol>
+     <li>In Cloudera Manager, select <menucascade>
+      <uicontrol>Clusters</uicontrol>
+      <uicontrol>Dynamic Resource Pool Configuration</uicontrol>
+     </menucascade>.</li>
+     <li>Click the <uicontrol>Impala Admission Control</uicontrol> tab.</li>
+     <li>Click <b>Create Resource Pool</b>.</li>
+     <li>Specify a name and resource limits for the Production pool:<ul 
id="ul_rjt_wqv_2v">
+       <li>In the <b>Resource Pool Name</b> field, enter 
<userinput>Production</userinput>.</li>
+       <li>In the <uicontrol>Max Memory</uicontrol> field, enter 
<userinput>48000</userinput>.</li>
+      <li>In the <uicontrol>Default Query Memory Limit</uicontrol> field, enter
+        <userinput>1600</userinput>.</li>
+       <li>In the <uicontrol>Max Running Queries</uicontrol> field, enter
+       <userinput>10</userinput>.</li>
+       <li>In the <uicontrol>Max Queued Queries</uicontrol> field, enter
+       <userinput>200</userinput>.</li>
+      </ul></li>
+     <li>Click <uicontrol>Create</uicontrol>.</li>
+     <li>Click <uicontrol>Refresh Dynamic Resource Pools</uicontrol>.</li>
+    </ol>
+    <p>The Production queue runs up to 10 queries at once. If the total memory 
requested
+    by these queries exceeds 48000 MiB, it holds the next query in the queue 
until the memory is
+    released. It also prevents a query from running if it needs more memory 
than is currently
+    available. Admission Control holds the next query if either Max Running 
Queries is reached, or
+    the pool Max Memory limit is reached.</p>
+   <p>Here, Max Memory resources and Default Query Memory Limit throttle 
throughput to 10 queries,
+    so setting Max Running Queries might not be necessary, though it does not 
hurt to do so. Most
+    users set Max Running Queries when they cannot pick good numbers for 
memory. Since users can
+    override the query option <varname>mem_limit</varname>, setting the Max 
Running Queries property
+    might make sense.</p>
+    <p>To create a Development dynamic resource pool for Impala:</p>
+
+    <ol>
+     <li>In Cloudera Manager, select <menucascade>
+      <uicontrol>Clusters</uicontrol>
+      <uicontrol>Dynamic Resource Pool Configuration</uicontrol>
+     </menucascade>.</li>
+     <li>Click the <uicontrol>Impala Admission Control</uicontrol> tab.</li>
+     <li>Click <b>Create Resource Pool</b>.</li>
+     <li>Specify a name and resource limits for the Development pool:<ul 
id="ul_j42_q3z_pw">
+       <li>In the <b>Resource Pool Name</b> field, enter 
<userinput>Development</userinput>.</li>
+       <li>In the <uicontrol>Max Memory</uicontrol> field, enter 
<userinput>24000</userinput>.</li>
+      <li>In the <uicontrol>Default Query Memory Limit</uicontrol> field, enter
+        <userinput>8000</userinput>.</li>
+       <li>In the <uicontrol>Max Running Queries</uicontrol> field, enter 
1.</li>
+       <li>In the <uicontrol>Max Queued Queries</uicontrol> field, enter 
100.</li>
+      </ul></li>
+     <li>Click <uicontrol>Create</uicontrol>.</li>
+     <li>Click <uicontrol>Refresh Dynamic Resource Pools</uicontrol>.<p>The 
Development queue runs
+      one query at a time. If the total memory required by the query exceeds 
24000 MiB, it holds the
+      query until memory is released.</p></li>
+    </ol>
+  </conbody>
+  <concept id="setting_placement_rules">
+   <title>Understanding Placement Rules</title>
+   <conbody>
+    <p>Placement rules determine how queries are mapped to resource pools. The 
standard settings are
+     to use a specified pool when specified; otherwise, use the default 
pool.</p>
+    <p>For example, you can use the SET statement to select the pool in which 
to run a
+     query.<codeblock>SET REQUEST_POOL=Production;</codeblock></p>
+    <p>If you do not use a <codeph>SET</codeph> statement, queries are run in 
the default pool.</p>
+   </conbody>
+  </concept>
+  <concept id="setting_access_control_on_pools">
+  <title>Setting Access Control on Pools</title>
+  <conbody>
+   <p>You can specify that only cetain users and groups are allowed to use the 
pools you define.</p>
+   <p>To create a Development dynamic resource pool for Impala:</p>
+   <ol>
+    <li>In Cloudera Manager, select <menucascade>
+       <uicontrol>Clusters</uicontrol>
+       <uicontrol>Dynamic Resource Pool Configuration</uicontrol>
+      </menucascade>.</li>
+    <li>Click the <uicontrol>Impala Admission Control</uicontrol> tab.</li>
+    <li>Click the <uicontrol>Edit</uicontrol> button for the Production 
pool.</li>
+    <li>Click the Submission Access Control tab.</li>
+    <li>Select <uicontrol>Allow these users and groups to submit to this 
pool</uicontrol>.</li>
+    <li>Enter a comma-separated list of users who can use the pool.
+     <image placement="break"
+      href="../images/howto_access_control.png" align="center" scalefit="yes"/>
+     </li>
+    
+    <li>Click <uicontrol>Save</uicontrol>.</li>
+   </ol>
+  </conbody>
+ </concept>
+ </concept>
+<concept id="impala_resource_management_example">
+ <title>Impala Resource Management Example</title>
+ <conbody>
+   <p>Anne Chang is administrator for an enterprise data hub that runs a 
number of workloads,
+    including Impala. </p>
+   <p>Anne has a 20-node cluster that uses Cloudera Manager static 
partitioning. Because of the
+    heavy Impala workload, Anne needs to make sure Impala gets enough 
resources. While the best
+    configuration values might not be known in advance, she decides to start 
by allocating 50% of
+    resources to Impala. Each node has 128 GiB dedicated to each impalad. 
Impala has 2560 GiB in
+    aggregate that can be shared across the resource pools she creates.</p>
+   <p>Next, Anne studies the workload in more detail. After some research, she 
might choose to
+    revisit these initial values for static partitioning. </p>
+   <p>To figure out how to further allocate Impalaâs resources, Anne needs 
to consider the workloads
+    and users, and determine their requirements. There are a few main sources 
of Impala queries: <ul
+     id="ul_ml3_sf2_5w">
+     <li>Large reporting queries executed by an external process/tool. These 
are critical business
+      intelligence queries that are important for business decisions. It is 
important that they get
+      the resources they need to run. There typically are not many of these 
queries at a given
+      time.</li>
+     <li>Frequent, small queries generated by a web UI. These queries scan a 
limited amount of data
+      and do not require expensive joins or aggregations. These queries are 
important, but not as
+      critical, perhaps the client tries resending the query or the end user 
refreshes the
+      page.</li>
+     <li>Occasionally, expert users might run ad-hoc queries. The queries can 
vary significantly in
+      their resource requirements. While Anne wants a good experience for 
these users, it is hard to
+      control what they do (for example, submitting inefficient or incorrect 
queries by mistake).
+      Anne restricts these queries by default and tells users to reach out to 
her if they need more
+      resources. </li>
+    </ul></p>
+   <p>To set up admission control for this workload, Anne first runs the 
workloads independently, so
+    that she can observe the workloadâs resource usage in Cloudera Manager. 
If they could not easily
+    be run manually, but had been run in the past, Anne uses the history 
information from Cloudera
+    Manager. It can be helpful to use other search criteria (for example, 
<i>user</i>) to isolate
+    queries by workload. Anne uses the Cloudera Manager chart for Per-Node 
Peak Memory usage to
+    identify the maximum memory requirements for the queries. </p>
+   <p>From this data, Anne observes the following about the queries in the 
groups above:<ul
+     id="ul_amq_ng2_5w">
+     <li> Large reporting queries use up to 32 GiB per node. There are 
typically 1 or 2 queries
+      running at a time. On one occasion, she observed that 3 of these queries 
were running
+      concurrently. Queries can take 3 minutes to complete.</li>
+     <li>Web UI-generated queries use between 100 MiB per node to usually less 
than 4 GiB per node
+      of memory, but occasionally as much as 10 GiB per node. Queries take, on 
average, 5 seconds,
+      and there can be as many as 140 incoming queries per minute.</li>
+     <li>Anne has little data on ad hoc queries, but some are trivial 
(approximately 100 MiB per
+      node), others join several tables (requiring a few GiB per node), and 
one user submitted a
+      huge cross join of all tables that used all system resources (that was 
likely a mistake).</li>
+    </ul></p>
+   <p>Based on these observations, Anne creates the admission control 
configuration with the
+    following pools: </p>
+   <section id="section_yjc_h32_5w">
+    <title>XL_Reporting</title>
+    <p>
+     <table frame="all" rowsep="1" colsep="1" id="XL_Reporting_Table">
+      <tgroup cols="2">
+       <colspec colname="c1" colnum="1" colwidth="1.0*"/>
+       <colspec colname="c2" colnum="2" colwidth="1.0*"/>
+       <thead>
+        <row>
+         <entry>Property</entry>
+         <entry>Value</entry>
+        </row>
+       </thead>
+       <tbody>
+        <row>
+         <entry>Max Memory</entry>
+         <entry>1280 GiB</entry>
+        </row>
+        <row>
+         <entry>Default Query Memory Limit</entry>
+         <entry>32 GiB</entry>
+        </row>
+        <row>
+         <entry>Max Running Queries</entry>
+         <entry>2</entry>
+        </row>
+        <row>
+         <entry>Queue Timeout</entry>
+         <entry>5 minutes</entry>
+        </row>
+       </tbody>
+      </tgroup>
+     </table>
+    </p>
+    <p>This pool is for large reporting queries. In order to support running 2 
queries at a time,
+     the pool memory resources are set to 1280 GiB (aggregate cluster memory). 
This is for 2
+     queries, each with 32 GiB per node, across 20 nodes. Anne sets the 
poolâs Default Query Memory
+     Limit to 32 GiB so that no query uses more than 32 GiB on any given node. 
She sets Max Running
+     Queries to 2 (though it is not necessary she do so). She increases the 
poolâs queue timeout to
+     5 minutes in case a third query comes in and has to wait. She does not 
expect more than 3
+     concurrent queries, and she does not want them to wait that long anyway, 
so she does not
+     increase the queue timeout. If the workload increases in the future, she 
might choose to adjust
+     the configuration or buy more hardware. </p>
+   </section>
+   <section id="section_xm3_j32_5w"><title>HighThroughput_UI</title>
+    <p>
+     <table frame="all" rowsep="1" colsep="1" id="High_Throughput_UI_Table">
+      <tgroup cols="2">
+       <colspec colname="c1" colnum="1" colwidth="1.0*"/>
+       <colspec colname="c2" colnum="2" colwidth="1.0*"/>
+       <thead>
+        <row>
+         <entry>Property</entry>
+         <entry>Value</entry>
+        </row>
+       </thead>
+       <tbody>
+        <row>
+         <entry>Max Memory</entry>
+         <entry>960 GiB (inferred)</entry>
+        </row>
+        <row>
+         <entry>Default Query Memory Limit</entry>
+         <entry>4 GiB</entry>
+        </row>
+        <row>
+         <entry>Max Running Queries</entry>
+         <entry>12</entry>
+        </row>
+        <row>
+         <entry>Queue Timeout</entry>
+         <entry>5 minutes</entry>
+        </row>
+       </tbody>
+      </tgroup>
+     </table>
+    </p>
+    <p>This pool is used for the small, high throughput queries generated by 
the web tool. Anne sets
+     the Default Query Memory Limit to 4 GiB per node, and sets Max Running 
Queries to 12. This
+     implies a maximum amount of memory per node used by the queries in this 
pool: 48 GiB per node
+     (12 queries * 4 GiB per node memory limit).</p><p>Notice that Anne does 
not set the pool memory resources, but does set the poolâs Default Query
+     Memory Limit. This is intentional: admission control processes queries 
faster when a pool uses
+     the Max Running Queries limit instead of the peak memory 
resources.</p><p>This should be enough memory for most queries, since only a 
few go over 4 GiB per node. For those
+     that do require more memory, they can probably still complete with less 
memory (spilling if
+     necessary). If, on occasion, a query cannot run with this much memory and 
it fails, Anne might
+     reconsider this configuration later, or perhaps she does not need to 
worry about a few rare
+     failures from this web UI.</p><p>With regard to throughput, since these 
queries take around 5 seconds and she is allowing 12
+     concurrent queries, the pool should be able to handle approximately 144 
queries per minute,
+     which is enough for the peak maximum expected of 140 queries per minute. 
In case there is a
+     large burst of queries, Anne wants them to queue. The default maximum 
size of the queue is
+     already 200, which should be more than large enough. Anne does not need 
to change it.</p></section>
+   <section id="section_asm_yj2_5w"><title>Default</title>
+    <p>
+     <table frame="all" rowsep="1" colsep="1" id="default_table">
+      <tgroup cols="2">
+       <colspec colname="c1" colnum="1" colwidth="1.0*"/>
+       <colspec colname="c2" colnum="2" colwidth="1.0*"/>
+       <thead>
+        <row>
+         <entry>Property</entry>
+         <entry>Value</entry>
+        </row>
+       </thead>
+       <tbody>
+        <row>
+         <entry>Max Memory</entry>
+         <entry>320 GiB</entry>
+        </row>
+        <row>
+         <entry>Default Query Memory Limit</entry>
+         <entry>4 GiB</entry>
+        </row>
+        <row>
+         <entry>Max Running Queries</entry>
+         <entry>Unlimited</entry>
+        </row>
+        <row>
+         <entry>Queue Timeout</entry>
+         <entry>60 Seconds</entry>
+        </row>
+       </tbody>
+      </tgroup>
+     </table>
+    </p><p>The default pool (which already exists) is a catch all for ad-hoc 
queries. Anne wants to use the
+     remaining memory not used by the first two pools, 16 GiB per node 
(XL_Reporting uses 64 GiB per
+     node, High_Throughput_UI uses 48 GiB per node). For the other pools to 
get the resources they
+     expect, she must still set the Max Memory resources and the Default Query 
Memory Limit. She
+     sets the Max Memory resources to 320 GiB (16 * 20). She sets the Default 
Query Memory Limit to
+     4 GiB per node for now. That is somewhat arbitrary, but satisfies some of 
the ad hoc queries
+     she observed. If someone writes a bad query by mistake, she does not 
actually want it using all
+     the system resources. If a user has a large query to submit, an expert 
user can override the
+     Default Query Memory Limit (up to 16 GiB per node, since that is bound by 
the pool Max Memory
+     resources). If that is still insufficient for this userâs workload, the 
user should work with
+     Anne to adjust the settings and perhaps create a dedicated pool for the 
workload.</p></section>
+  </conbody>
+</concept>
+</concept>

[26/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch.

Reply via email to