5.10 versions.

jrussell Tue, 01 Nov 2016 16:13:43 -0700

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_scalability.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_scalability.xml 
b/docs/topics/impala_scalability.xml
index 9abb1c5..ab4ef36 100644
--- a/docs/topics/impala_scalability.xml
+++ b/docs/topics/impala_scalability.xml
@@ -3,7 +3,19 @@
 
   <title>Scalability Considerations for Impala</title>
   <titlealts audience="PDF"><navtitle>Scalability 
Considerations</navtitle></titlealts>
-  
+  <prolog>
+    <metadata>
+      <data name="Category" value="Performance"/>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Planning"/>
+      <data name="Category" value="Querying"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Memory"/>
+      <data name="Category" value="Scalability"/>
+      <!-- Using domain knowledge about Impala, sizing, etc. to decide what to 
mark as 'Proof of Concept'. -->
+      <data name="Category" value="Proof of Concept"/>
+    </metadata>
+  </prolog>
 
   <conbody>
 
@@ -14,8 +26,808 @@
       scalability issues, such as a single slow node that causes performance 
problems for queries.
     </p>
 
-   
+    <p outputclass="toc inpage"/>
+
+    <p conref="../shared/impala_common.xml#common/cookbook_blurb"/>
+
+  </conbody>
+
+  <concept audience="Cloudera" id="scalability_memory">
+
+    <title>Overview and Guidelines for Impala Memory Usage</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Memory"/>
+      <data name="Category" value="Concepts"/>
+      <data name="Category" value="Best Practices"/>
+      <data name="Category" value="Guidelines"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+<!--
+Outline adapted from Alan Choi's "best practices" and/or "performance 
cookbook" papers.
+-->
+
+<codeblock>Memory Usage â the Basics
+*  Memory is used by:
+*  Hash join â RHS tables after decompression, filtering and projection
+*  Group by â proportional to the #groups
+*  Parquet writer buffer â 1GB per partition
+*  IO buffer (shared across queries)
+*  Metadata cache (no more than 1GB typically)
+*  Memory held and reused by later query
+*  Impala releases memory from time to time starting in 1.4.
+
+Memory Usage â Estimating Memory Usage
+*  Use Explain Plan
+* Requires statistics! Mem estimate without stats is meaningless.
+* Reports per-host memory requirement for this cluster size.
+*  Re-run if youâve re-sized the cluster!
+[image of explain plan]
+
+Memory Usage â Estimating Memory Usage
+*  EXPLAINâs memory estimate issues
+*  Can be way off â much higher or much lower.
+*  group byâs estimate can be particularly off â when thereâs a large 
number of group by columns.
+*  Mem estimate = NDV of group by column 1 * NDV of group by column 2 * ... 
NDV of group by column n
+*  Ignore EXPLAINâs estimate if itâs too high! â¢  Do your own estimate 
for group by
+*  GROUP BY mem usage = (total number of groups * size of each row) + (total 
number of groups * size of each row) / num node
+
+Memory Usage â Finding Actual Memory Usage
+*  Search for âPer Node Peak Memory Usageâ in the profile.
+This is accurate. Use it for production capacity planning.
+
+Memory Usage â Actual Memory Usage
+*  For complex queries, how do I know which part of my query is using too much 
memory?
+*  Use the ExecSummary from the query profile!
+- But is that "Peak Mem" number aggregate or per-node?
+[image of executive summary]
+
+Memory Usage â Hitting Mem-limit
+*  Top causes (in order) of hitting mem-limit even when running a single query:
+1. Lack of statistics
+2. Lots of joins within a single query
+3. Big-table joining big-table
+4. Gigantic group by
+
+Memory Usage â Hitting Mem-limit
+Lack of stats
+*  Wrong join order, wrong join strategy, wrong insert strategy
+*  Explain Plan tells you that!
+[image of explain plan]
+*  Fix: Compute Stats table
+
+Memory Usage â Hitting Mem-limit
+Lots of joins within a single query
+* select...from fact, dim1, dim2,dim3,...dimN where ...
+* Each dim tbl can fit in memory, but not all of them together
+* As of Impala 1.4, Impala might choose the wrong plan â BROADCAST
+FIX 1: use shuffle hint
+select ... from fact join [shuffle] dim1 on ... join dim2 [shuffle] ...
+FIX 2: pre-join the dim tables (if possible)
+- How about an example to illustrate that technique?
+* few join=&gt;better perf!
+
+Memory Usage: Hitting Mem-limit
+Big-table joining big-table
+*  Big-table (after decompression, filtering, and projection) is a table that 
is bigger than total cluster memory size.
+*  Impala 2.0 will do this (via disk-based join). Consider using Hive for now.
+*  (Advanced) For a simple query, you can try this advanced workaround â 
per-partition join
+*  Requires the partition key be part of the join key
+select ... from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and 
a.part_key in (1,2,3)
+   union all
+select ... from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and 
a.part_key in (4,5,6)
+
+Memory Usage: Hitting Mem-limit
+Gigantic group by
+* The total number of distinct groups is huge, such as group by userid.
+* Impala 2.0 will do this (via disk-based agg). Consider using Hive for now.
+- Is this one of the cases where people were unhappy we recommended Hive?
+* (Advanced) For a simple query, you can try this advanced workaround â 
per-partition agg
+*  Requires the partition key be part of the group by
+select part_key, col1, col2, ...agg(..) from tbl where
+       part_key in (1,2,3)
+       Union all
+       Select part_key, col1, col2, ...agg(..) from tbl where
+       part_key in (4,5,6)
+- But where's the GROUP BY in the preceding query? Need a real example.
+
+Memory Usage: Additional Notes
+*  Use explain plan for estimate; use profile for accurate measure
+*  Data skew can use uneven memory usage
+*  Review previous common issues on out-of-memory
+*  Note: Even with disk-based joins, you'll want to review these steps to 
speed up queries and use memory more efficiently
+</codeblock>
+    </conbody>
+  </concept>
+
+  <concept id="scalability_catalog">
+
+    <title>Impact of Many Tables or Partitions on Impala Catalog Performance 
and Memory Usage</title>
+
+    <conbody>
+
+      <p audience="Cloudera">
+        Details to fill in in future: Impact of <q>load catalog in 
background</q> option.
+        Changing timeouts. Related Cloudera Manager settings.
+      </p>
+
+      <p>
+        Because Hadoop I/O is optimized for reading and writing large files, 
Impala is optimized for tables
+        containing relatively few, large data files. Schemas containing 
thousands of tables, or tables containing
+        thousands of partitions, can encounter performance issues during 
startup or during DDL operations such as
+        <codeph>ALTER TABLE</codeph> statements.
+      </p>
+
+      <note type="important" rev="TSB-168">
+      <p>
+        Because of a change in the default heap size for the 
<cmdname>catalogd</cmdname> daemon in
+        CDH 5.7 / Impala 2.5 and higher, the following procedure to increase 
the <cmdname>catalogd</cmdname>
+        memory limit might be required following an upgrade to CDH 5.7 / 
Impala 2.5, even if not
+        needed previously.
+      </p>
+      </note>
+
+      <p 
conref="../shared/impala_common.xml#common/increase_catalogd_heap_size"/>
+
+    </conbody>
+  </concept>
+
+  <concept rev="2.1.0" id="statestore_scalability">
+
+    <title>Scalability Considerations for the Impala Statestore</title>
+
+    <conbody>
+
+      <p>
+        Before CDH 5.3, the statestore sent only one kind of message to its 
subscribers. This message contained all
+        updates for any topics that a subscriber had subscribed to. It also 
served to let subscribers know that the
+        statestore had not failed, and conversely the statestore used the 
success of sending a heartbeat to a
+        subscriber to decide whether or not the subscriber had failed.
+      </p>
+
+      <p>
+        Combining topic updates and failure detection in a single message led 
to bottlenecks in clusters with large
+        numbers of tables, partitions, and HDFS data blocks. When the 
statestore was overloaded with metadata
+        updates to transmit, heartbeat messages were sent less frequently, 
sometimes causing subscribers to time
+        out their connection with the statestore. Increasing the subscriber 
timeout and decreasing the frequency of
+        statestore heartbeats worked around the problem, but reduced 
responsiveness when the statestore failed or
+        restarted.
+      </p>
+
+      <p>
+        As of CDH 5.3, the statestore now sends topic updates and heartbeats 
in separate messages. This allows the
+        statestore to send and receive a steady stream of lightweight 
heartbeats, and removes the requirement to
+        send topic updates according to a fixed schedule, reducing statestore 
network overhead.
+      </p>
+
+      <p>
+        The statestore now has the following relevant configuration flags for 
the <cmdname>statestored</cmdname>
+        daemon:
+      </p>
+
+      <dl>
+        <dlentry id="statestore_num_update_threads">
+
+          <dt>
+            <codeph>-statestore_num_update_threads</codeph>
+          </dt>
+
+          <dd>
+            The number of threads inside the statestore dedicated to sending 
topic updates. You should not
+            typically need to change this value.
+            <p>
+              <b>Default:</b> 10
+            </p>
+          </dd>
+
+        </dlentry>
+
+        <dlentry id="statestore_update_frequency_ms">
+
+          <dt>
+            <codeph>-statestore_update_frequency_ms</codeph>
+          </dt>
+
+          <dd>
+            The frequency, in milliseconds, with which the statestore tries to 
send topic updates to each
+            subscriber. This is a best-effort value; if the statestore is 
unable to meet this frequency, it sends
+            topic updates as fast as it can. You should not typically need to 
change this value.
+            <p>
+              <b>Default:</b> 2000
+            </p>
+          </dd>
+
+        </dlentry>
+
+        <dlentry id="statestore_num_heartbeat_threads">
+
+          <dt>
+            <codeph>-statestore_num_heartbeat_threads</codeph>
+          </dt>
+
+          <dd>
+            The number of threads inside the statestore dedicated to sending 
heartbeats. You should not typically
+            need to change this value.
+            <p>
+              <b>Default:</b> 10
+            </p>
+          </dd>
+
+        </dlentry>
+
+        <dlentry id="statestore_heartbeat_frequency_ms">
+
+          <dt>
+            <codeph>-statestore_heartbeat_frequency_ms</codeph>
+          </dt>
+
+          <dd>
+            The frequency, in milliseconds, with which the statestore tries to 
send heartbeats to each subscriber.
+            This value should be good for large catalogs and clusters up to 
approximately 150 nodes. Beyond that,
+            you might need to increase this value to make the interval longer 
between heartbeat messages.
+            <p>
+              <b>Default:</b> 1000 (one heartbeat message every second)
+            </p>
+          </dd>
+
+        </dlentry>
+      </dl>
+
+      <p>
+        As of CDH 5.3, not all of these flags are present in the Cloudera 
Manager user interface. Some must be set
+        using the <uicontrol>Advanced Configuration Snippet</uicontrol> fields 
for the statestore component.
+      </p>
+
+      <p>
+        If it takes a very long time for a cluster to start up, and 
<cmdname>impala-shell</cmdname> consistently
+        displays <codeph>This Impala daemon is not ready to accept user 
requests</codeph>, the statestore might be
+        taking too long to send the entire catalog topic to the cluster. In 
this case, consider adding
+        <codeph>--load_catalog_in_background=false</codeph> to your catalog 
service configuration. This setting
+        stops the statestore from loading the entire catalog into memory at 
cluster startup. Instead, metadata for
+        each table is loaded when the table is accessed for the first time.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept audience="Cloudera" id="scalability_cluster_size">
+
+    <title>Scalability Considerations for Impala Cluster Size and 
Topology</title>
+
+    <conbody>
+
+      <p>
+      </p>
+    </conbody>
+  </concept>
+
+  <concept audience="Cloudera" id="concurrent_connections">
+
+    <title>Scaling the Number of Concurrent Connections</title>
+
+    <conbody>
+
+      <p></p>
+    </conbody>
+  </concept>
+
+  <concept rev="2.0.0" id="spill_to_disk">
+
+    <title>SQL Operations that Spill to Disk</title>
+
+    <conbody>
+
+      <p>
+        Certain memory-intensive operations write temporary data to disk 
(known as <term>spilling</term> to disk)
+        when Impala is close to exceeding its memory limit on a particular 
host.
+      </p>
+
+      <p>
+        The result is a query that completes successfully, rather than failing 
with an out-of-memory error. The
+        tradeoff is decreased performance due to the extra disk I/O to write 
the temporary data and read it back
+        in. The slowdown could be potentially be significant. Thus, while this 
feature improves reliability,
+        you should optimize your queries, system parameters, and hardware 
configuration to make this spilling a rare occurrence.
+      </p>
+
+      <p>
+        <b>What kinds of queries might spill to disk:</b>
+      </p>
+
+      <p>
+        Several SQL clauses and constructs require memory allocations that 
could activat the spilling mechanism:
+      </p>
+      <ul>
+        <li>
+          <p>
+            when a query uses a <codeph>GROUP BY</codeph> clause for columns
+            with millions or billions of distinct values, Impala keeps a
+            similar number of temporary results in memory, to accumulate the
+            aggregate results for each value in the group.
+          </p>
+        </li>
+        <li>
+          <p>
+            When large tables are joined together, Impala keeps the values of
+            the join columns from one table in memory, to compare them to
+            incoming values from the other table.
+          </p>
+        </li>
+        <li>
+          <p>
+            When a large result set is sorted by the <codeph>ORDER BY</codeph>
+            clause, each node sorts its portion of the result set in memory.
+          </p>
+        </li>
+        <li>
+          <p>
+            The <codeph>DISTINCT</codeph> and <codeph>UNION</codeph> operators
+            build in-memory data structures to represent all values found so
+            far, to eliminate duplicates as the query progresses.
+          </p>
+        </li>
+        <!-- JIRA still in open state as of 5.8 / 2.6, commenting out.
+        <li>
+          <p rev="IMPALA-3471">
+            In CDH 5.8 / Impala 2.6 and higher, <term>top-N</term> queries 
(those with
+            <codeph>ORDER BY</codeph> and <codeph>LIMIT</codeph> clauses) can 
also spill.
+            Impala allocates enough memory to hold as many rows as specified 
by the <codeph>LIMIT</codeph>
+            clause, plus enough memory to hold as many rows as specified by 
any <codeph>OFFSET</codeph> clause.
+          </p>
+        </li>
+        -->
+      </ul>
+
+      <p 
conref="../shared/impala_common.xml#common/spill_to_disk_vs_dynamic_partition_pruning"/>
+
+      <p>
+        <b>How Impala handles scratch disk space for spilling:</b>
+      </p>
+
+      <p rev="obwl" 
conref="../shared/impala_common.xml#common/order_by_scratch_dir"/>
+
+      <p>
+        <b>Memory usage for SQL operators:</b>
+      </p>
+
+      <p>
+        The infrastructure of the spilling feature affects the way the 
affected SQL operators, such as
+        <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use 
memory.
+        On each host that participates in the query, each such operator in a 
query accumulates memory
+        while building the data structure to process the aggregation or join 
operation. The amount
+        of memory used depends on the portion of the data being handled by 
that host, and thus might
+        be different from one host to another. When the amount of memory being 
used for the operator
+        on a particular host reaches a threshold amount, Impala reserves an 
additional memory buffer
+        to use as a work area in case that operator causes the query to exceed 
the memory limit for
+        that host. After allocating the memory buffer, the memory used by that 
operator remains
+        essentially stable or grows only slowly, until the point where the 
memory limit is reached
+        and the query begins writing temporary data to disk.
+      </p>
+
+      <p rev="2.2.0">
+        Prior to Impala 2.2 (CDH 5.4), the extra memory buffer for an operator 
that might spill to disk
+        was allocated when the data structure used by the applicable SQL 
operator reaches 16 MB in size,
+        and the memory buffer itself was 512 MB. In Impala 2.2, these values 
are halved: the threshold value
+        is 8 MB and the memory buffer is 256 MB. <ph rev="2.3.0">In Impala 2.3 
/ CDH 5.5 and higher, the memory for the buffer
+        is allocated in pieces, only as needed, to avoid sudden large jumps in 
memory usage.</ph> A query that uses
+        multiple such operators might allocate multiple such memory buffers, 
as the size of the data structure
+        for each operator crosses the threshold on a particular host.
+      </p>
+
+      <p>
+        Therefore, a query that processes a relatively small amount of data on 
each host would likely
+        never reach the threshold for any operator, and would never allocate 
any extra memory buffers. A query
+        that did process millions of groups, distinct values, join keys, and 
so on might cross the threshold,
+        causing its memory requirement to rise suddenly and then flatten out. 
The larger the cluster, less data is processed
+        on any particular host, thus reducing the chance of requiring the 
extra memory allocation.
+      </p>
+
+      <p>
+        <b>Added in:</b> This feature was added to the <codeph>ORDER 
BY</codeph> clause in Impala 1.4 for CDH 4,
+        and in CDH 5.1. This feature was extended to cover join queries, 
aggregation functions, and analytic
+        functions in Impala 2.0 for CDH 4, and in CDH 5.2. The size of the 
memory work area required by
+        each operator that spills was reduced from 512 megabytes to 256 
megabytes in Impala 2.2 (CDH 5.4).
+      </p>
+
+      <p>
+        <b>Avoiding queries that spill to disk:</b>
+      </p>
+
+      <p>
+        Because the extra I/O can impose significant performance overhead on 
these types of queries, try to avoid
+        this situation by using the following steps:
+      </p>
+
+      <ol>
+        <li>
+          Detect how often queries spill to disk, and how much temporary data 
is written. Refer to the following
+          sources:
+          <ul>
+            <li>
+              The output of the <codeph>PROFILE</codeph> command in the 
<cmdname>impala-shell</cmdname>
+              interpreter. This data shows the memory usage for each host and 
in total across the cluster. The
+              <codeph>BlockMgr.BytesWritten</codeph> counter reports how much 
data was written to disk during the
+              query.
+            </li>
+
+            <li>
+              The <uicontrol>Impala Queries</uicontrol> dialog in Cloudera 
Manager. You can see the peak memory
+              usage for a query, combined across all nodes in the cluster.
+            </li>
+
+            <li>
+              The <uicontrol>Queries</uicontrol> tab in the Impala debug web 
user interface. Select the query to
+              examine and click the corresponding 
<uicontrol>Profile</uicontrol> link. This data breaks down the
+              memory usage for a single host within the cluster, the host 
whose web interface you are connected to.
+            </li>
+          </ul>
+        </li>
+
+        <li>
+          Use one or more techniques to reduce the possibility of the queries 
spilling to disk:
+          <ul>
+            <li>
+              Increase the Impala memory limit if practical, for example, if 
you can increase the available memory
+              by more than the amount of temporary data written to disk on a 
particular node. Remember that in
+              Impala 2.0 and later, you can issue <codeph>SET 
MEM_LIMIT</codeph> as a SQL statement, which lets you
+              fine-tune the memory usage for queries from JDBC and ODBC 
applications.
+            </li>
+
+            <li>
+              Increase the number of nodes in the cluster, to increase the 
aggregate memory available to Impala and
+              reduce the amount of memory required on each node.
+            </li>
+
+            <li>
+              Increase the overall memory capacity of each DataNode at the 
hardware level.
+            </li>
+
+            <li>
+              On a cluster with resources shared between Impala and other 
Hadoop components, use resource
+              management features to allocate more memory for Impala. See
+              <xref 
href="impala_resource_management.xml#resource_management"/> for details.
+            </li>
+
+            <li>
+              If the memory pressure is due to running many concurrent queries 
rather than a few memory-intensive
+              ones, consider using the Impala admission control feature to 
lower the limit on the number of
+              concurrent queries. By spacing out the most resource-intensive 
queries, you can avoid spikes in
+              memory usage and improve overall response times. See
+              <xref href="impala_admission.xml#admission_control"/> for 
details.
+            </li>
+
+            <li>
+              Tune the queries with the highest memory requirements, using one 
or more of the following techniques:
+              <ul>
+                <li>
+                  Run the <codeph>COMPUTE STATS</codeph> statement for all 
tables involved in large-scale joins and
+                  aggregation queries.
+                </li>
+
+                <li>
+                  Minimize your use of <codeph>STRING</codeph> columns in join 
columns. Prefer numeric values
+                  instead.
+                </li>
+
+                <li>
+                  Examine the <codeph>EXPLAIN</codeph> plan to understand the 
execution strategy being used for the
+                  most resource-intensive queries. See <xref 
href="impala_explain_plan.xml#perf_explain"/> for
+                  details.
+                </li>
+
+                <li>
+                  If Impala still chooses a suboptimal execution strategy even 
with statistics available, or if it
+                  is impractical to keep the statistics up to date for huge or 
rapidly changing tables, add hints
+                  to the most resource-intensive queries to select the right 
execution strategy. See
+                  <xref href="impala_hints.xml#hints"/> for details.
+                </li>
+              </ul>
+            </li>
+
+            <li>
+              If your queries experience substantial performance overhead due 
to spilling, enable the
+              <codeph>DISABLE_UNSAFE_SPILLS</codeph> query option. This option 
prevents queries whose memory usage
+              is likely to be exorbitant from spilling to disk. See
+              <xref 
href="impala_disable_unsafe_spills.xml#disable_unsafe_spills"/> for details. As 
you tune
+              problematic queries using the preceding steps, fewer and fewer 
will be cancelled by this option
+              setting.
+            </li>
+          </ul>
+        </li>
+      </ol>
+
+      <p>
+        <b>Testing performance implications of spilling to disk:</b>
+      </p>
+
+      <p>
+        To artificially provoke spilling, to test this feature and understand 
the performance implications, use a
+        test environment with a memory limit of at least 2 GB. Issue the 
<codeph>SET</codeph> command with no
+        arguments to check the current setting for the 
<codeph>MEM_LIMIT</codeph> query option. Set the query
+        option <codeph>DISABLE_UNSAFE_SPILLS=true</codeph>. This option limits 
the spill-to-disk feature to prevent
+        runaway disk usage from queries that are known in advance to be 
suboptimal. Within
+        <cmdname>impala-shell</cmdname>, run a query that you expect to be 
memory-intensive, based on the criteria
+        explained earlier. A self-join of a large table is a good candidate:
+      </p>
+
+<codeblock>select count(*) from big_table a join big_table b using 
(column_with_many_values);
+</codeblock>
+
+      <p>
+        Issue the <codeph>PROFILE</codeph> command to get a detailed breakdown 
of the memory usage on each node
+        during the query. The crucial part of the profile output concerning 
memory is the <codeph>BlockMgr</codeph>
+        portion. For example, this profile shows that the query did not quite 
exceed the memory limit.
+      </p>
+
+<codeblock>BlockMgr:
+   - BlockWritesIssued: 1
+   - BlockWritesOutstanding: 0
+   - BlocksCreated: 24
+   - BlocksRecycled: 1
+   - BufferedPins: 0
+   - MaxBlockSize: 8.00 MB (8388608)
+   <b>- MemoryLimit: 200.00 MB (209715200)</b>
+   <b>- PeakMemoryUsage: 192.22 MB (201555968)</b>
+   - TotalBufferWaitTime: 0ns
+   - TotalEncryptionTime: 0ns
+   - TotalIntegrityCheckTime: 0ns
+   - TotalReadBlockTime: 0ns
+</codeblock>
+
+      <p>
+        In this case, because the memory limit was already below any 
recommended value, I increased the volume of
+        data for the query rather than reducing the memory limit any further.
+      </p>
+
+      <p>
+        Set the <codeph>MEM_LIMIT</codeph> query option to a value that is 
smaller than the peak memory usage
+        reported in the profile output. Do not specify a memory limit lower 
than about 300 MB, because with such a
+        low limit, queries could fail to start for other reasons. Now try the 
memory-intensive query again.
+      </p>
+
+      <p>
+        Check if the query fails with a message like the following:
+      </p>
+
+<codeblock>WARNINGS: Spilling has been disabled for plans that do not have 
stats and are not hinted
+to prevent potentially bad plans from using too many cluster resources. 
Compute stats on
+these tables, hint the plan or disable this behavior via query options to 
enable spilling.
+</codeblock>
+
+      <p>
+        If so, the query could have consumed substantial temporary disk space, 
slowing down so much that it would
+        not complete in any reasonable time. Rather than rely on the 
spill-to-disk feature in this case, issue the
+        <codeph>COMPUTE STATS</codeph> statement for the table or tables in 
your sample query. Then run the query
+        again, check the peak memory usage again in the 
<codeph>PROFILE</codeph> output, and adjust the memory
+        limit again if necessary to be lower than the peak memory usage.
+      </p>
+
+      <p>
+        At this point, you have a query that is memory-intensive, but Impala 
can optimize it efficiently so that
+        the memory usage is not exorbitant. You have set an artificial 
constraint through the
+        <codeph>MEM_LIMIT</codeph> option so that the query would normally 
fail with an out-of-memory error. But
+        the automatic spill-to-disk feature means that the query should 
actually succeed, at the expense of some
+        extra disk I/O to read and write temporary work data.
+      </p>
+
+      <p>
+        Try the query again, and confirm that it succeeds. Examine the 
<codeph>PROFILE</codeph> output again. This
+        time, look for lines of this form:
+      </p>
+
+<codeblock>- SpilledPartitions: <varname>N</varname>
+</codeblock>
+
+      <p>
+        If you see any such lines with <varname>N</varname> greater than 0, 
that indicates the query would have
+        failed in Impala releases prior to 2.0, but now it succeeded because 
of the spill-to-disk feature. Examine
+        the total time taken by the <codeph>AGGREGATION_NODE</codeph> or other 
query fragments containing non-zero
+        <codeph>SpilledPartitions</codeph> values. Compare the times to 
similar fragments that did not spill, for
+        example in the <codeph>PROFILE</codeph> output when the same query is 
run with a higher memory limit. This
+        gives you an idea of the performance penalty of the spill operation 
for a particular query with a
+        particular memory limit. If you make the memory limit just a little 
lower than the peak memory usage, the
+        query only needs to write a small amount of temporary data to disk. 
The lower you set the memory limit, the
+        more temporary data is written and the slower the query becomes.
+      </p>
+
+      <p>
+        Now repeat this procedure for actual queries used in your environment. 
Use the
+        <codeph>DISABLE_UNSAFE_SPILLS</codeph> setting to identify cases where 
queries used more memory than
+        necessary due to lack of statistics on the relevant tables and 
columns, and issue <codeph>COMPUTE
+        STATS</codeph> where necessary.
+      </p>
+
+      <p>
+        <b>When to use DISABLE_UNSAFE_SPILLS:</b>
+      </p>
+
+      <p>
+        You might wonder, why not leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> 
turned on all the time. Whether and
+        how frequently to use this option depends on your system environment 
and workload.
+      </p>
+
+      <p>
+        <codeph>DISABLE_UNSAFE_SPILLS</codeph> is suitable for an environment 
with ad hoc queries whose performance
+        characteristics and memory usage are not known in advance. It prevents 
<q>worst-case scenario</q> queries
+        that use large amounts of memory unnecessarily. Thus, you might turn 
this option on within a session while
+        developing new SQL code, even though it is turned off for existing 
applications.
+      </p>
+
+      <p>
+        Organizations where table and column statistics are generally 
up-to-date might leave this option turned on
+        all the time, again to avoid worst-case scenarios for untested queries 
or if a problem in the ETL pipeline
+        results in a table with no statistics. Turning on 
<codeph>DISABLE_UNSAFE_SPILLS</codeph> lets you <q>fail
+        fast</q> in this case and immediately gather statistics or tune the 
problematic queries.
+      </p>
+
+      <p>
+        Some organizations might leave this option turned off. For example, 
you might have tables large enough that
+        the <codeph>COMPUTE STATS</codeph> takes substantial time to run, 
making it impractical to re-run after
+        loading new data. If you have examined the <codeph>EXPLAIN</codeph> 
plans of your queries and know that
+        they are operating efficiently, you might leave 
<codeph>DISABLE_UNSAFE_SPILLS</codeph> turned off. In that
+        case, you know that any queries that spill will not go overboard with 
their memory consumption.
+      </p>
+
+<!--
+      <p>
+        <b>Turning off the spill-to-disk feature: (CDH 5.6 / Impala 2.4 and 
lower only)</b>
+      </p>
+
+      <p>
+        Prior to CDH 5.7 / Impala 2.5, certain conditions...
+      </p>
+
+      <p>
+        You might turn off the spill-to-disk feature if you are in an 
environment with constraints on disk space,
+        or if you prefer for queries that exceed the memory capacity in your 
cluster to <q>fail fast</q> so that
+        you can tune and retry them.
+      </p>
+
+      <p>
+        To turn off this feature, set the following configuration options for 
each <cmdname>impalad</cmdname>
+        daemon, either through the <cmdname>impalad</cmdname> advanced 
configuration snippet in Cloudera Manager,
+        or during <cmdname>impalad</cmdname> startup on each DataNode on 
systems not managed by Cloudera Manager:
+      </p>
+
+<codeblock>&minus;&minus;enable_partitioned_aggregation=false
+&minus;&minus;enable_partitioned_hash_join=false
+</codeblock>
+-->
+
     </conbody>
+  </concept>
+
+<concept id="complex_query">
+<title>Limits on Query Size and Complexity</title>
+<conbody>
+<p>
+There are hardcoded limits on the maximum size and complexity of queries.
+Currently, the maximum number of expressions in a query is 2000.
+You might exceed the limits with large or deeply nested queries
+produced by business intelligence tools or other query generators.
+</p>
+<p>
+If you have the ability to customize such queries or the query generation
+logic that produces them, replace sequences of repetitive expressions
+with single operators such as <codeph>IN</codeph> or <codeph>BETWEEN</codeph>
+that can represent multiple values or ranges.
+For example, instead of a large number of <codeph>OR</codeph> clauses:
+</p>
+<codeblock>WHERE val = 1 OR val = 2 OR val = 6 OR val = 100 ...
+</codeblock>
+<p>
+use a single <codeph>IN</codeph> clause:
+</p>
+<codeblock>WHERE val IN (1,2,6,100,...)</codeblock>
+</conbody>
+</concept>
+
+<concept id="scalability_io">
+<title>Scalability Considerations for Impala I/O</title>
+<conbody>
+<p>
+Impala parallelizes its I/O operations aggressively,
+therefore the more disks you can attach to each host, the better.
+Impala retrieves data from disk so quickly using
+bulk read operations on large blocks, that most queries
+are CPU-bound rather than I/O-bound.
+</p>
+<p>
+Because the kind of sequential scanning typically done by
+Impala queries does not benefit much from the random-access
+capabilities of SSDs, spinning disks typically provide
+the most cost-effective kind of storage for Impala data,
+with little or no performance penalty as compared to SSDs.
+</p>
+<p>
+Resource management features such as YARN, Llama, and admission control
+typically constrain the amount of memory, CPU, or overall number of
+queries in a high-concurrency environment.
+Currently, there is no throttling mechanism for Impala I/O.
+</p>
+</conbody>
+</concept>
+
+<concept id="big_tables">
+<title>Scalability Considerations for Table Layout</title>
+<conbody>
+<p>
+Due to the overhead of retrieving and updating table metadata
+in the metastore database, try to limit the number of columns
+in a table to a maximum of approximately 2000.
+Although Impala can handle wider tables than this, the metastore overhead
+can become significant, leading to query performance that is slower
+than expected based on the actual data volume.
+</p>
+<p>
+To minimize overhead related to the metastore database and Impala query 
planning,
+try to limit the number of partitions for any partitioned table to a few tens 
of thousands.
+</p>
+</conbody>
+</concept>
+
+<concept rev="CDH-38321" id="kerberos_overhead_cluster_size">
+<title>Kerberos-Related Network Overhead for Large Clusters</title>
+<conbody>
+<p>
+When Impala starts up, or after each <codeph>kinit</codeph> refresh, Impala 
sends a number of
+simultaneous requests to the KDC. For a cluster with 100 hosts, the KDC might 
be able to process
+all the requests within roughly 5 seconds. For a cluster with 1000 hosts, the 
time to process
+the requests would be roughly 500 seconds. Impala also makes a number of DNS 
requests at the same
+time as these Kerberos-related requests.
+</p>
+<p>
+While these authentication requests are being processed, any submitted Impala 
queries will fail.
+During this period, the KDC and DNS may be slow to respond to requests from 
components other than Impala,
+so other secure services might be affected temporarily.
+</p>
+<p>
+To reduce the frequency of the <codeph>kinit</codeph> renewal that initiates a 
new set of
+authentication requests, increase the 
<codeph>kerberos_reinit_interval</codeph> configuration setting
+for the <cmdname>impalad</cmdname> daemons. Currently, the default for a 
cluster not managed by
+Cloudera Manager is 60 minutes, while the default under Cloudera Manager is 10 
minutes.
+Consider using a higher value such as 360 (6 hours).
+</p>
+</conbody>
+</concept>
 
+  <concept id="scalability_hotspots" rev="2.5.0 IMPALA-2696">
+    <title>Avoiding CPU Hotspots for HDFS Cached Data</title>
+    <conbody>
+      <p>
+        You can use the HDFS caching feature, described in <xref 
href="impala_perf_hdfs_caching.xml#hdfs_caching"/>,
+        with Impala to reduce I/O and memory-to-memory copying for frequently 
accessed tables or partitions.
+      </p>
+      <p>
+        In the early days of this feature, you might have found that enabling 
HDFS caching
+        resulted in little or no performance improvement, because it could 
result in
+        <q>hotspots</q>: instead of the I/O to read the table data being 
parallelized across
+        the cluster, the I/O was reduced but the CPU load to process the data 
blocks
+        might be concentrated on a single host.
+      </p>
+      <p>
+        To avoid hotspots, include the <codeph>WITH REPLICATION</codeph> 
clause with the
+        <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> 
statements for tables that use HDFS caching.
+        This clause allows more than one host to cache the relevant data 
blocks, so the CPU load
+        can be shared, reducing the load on any one host.
+        See <xref href="impala_create_table.xml#create_table"/> and <xref 
href="impala_alter_table.xml#alter_table"/>
+        for details.
+      </p>
+      <p>
+        Hotspots with high CPU load for HDFS cached data could still arise in 
some cases, due to
+        the way that Impala schedules the work of processing data blocks on 
different hosts.
+        In CDH 5.7 / Impala 2.5 and higher, scheduling improvements mean that 
the work for
+        HDFS cached data is divided better among all the hosts that have 
cached replicas
+        for a particular data block. When more than one host has a cached 
replica for a data block,
+        Impala assigns the work of processing that block to whichever host has 
done the least work
+        (in terms of number of bytes read) for the current query. If hotspots 
persist even with this
+        load-based scheduling algorithm, you can enable the query option 
<codeph>SCHEDULE_RANDOM_REPLICA=TRUE</codeph>
+        to further distribute the CPU load. This setting causes Impala to 
randomly pick a host to process a cached
+        data block if the scheduling algorithm encounters a tie when deciding 
which host has done the
+        least work.
+      </p>
+    </conbody>
+  </concept>
 
 </concept>


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_schema_objects.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_schema_objects.xml 
b/docs/topics/impala_schema_objects.xml
index d8abe12..a9ddfcc 100644
--- a/docs/topics/impala_schema_objects.xml
+++ b/docs/topics/impala_schema_objects.xml
@@ -3,7 +3,7 @@
 <concept id="schema_objects">
 
   <title>Impala Schema Objects and Object Names</title>
-  <titlealts><navtitle>Schema Objects and Object Names</navtitle></titlealts>
+  <titlealts audience="PDF"><navtitle>Schema Objects and Object 
Names</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_security.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_security.xml b/docs/topics/impala_security.xml
index 5c98fb1..bf4606d 100644
--- a/docs/topics/impala_security.xml
+++ b/docs/topics/impala_security.xml
@@ -3,7 +3,19 @@
 <concept id="security">
 
   <title><ph audience="standalone">Impala Security</ph><ph 
audience="integrated">Overview of Impala Security</ph></title>
-  
+  <prolog>
+    <metadata>
+      <data name="Category" value="Security"/>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Concepts"/>
+      <data name="Category" value="Auditing"/>
+      <data name="Category" value="Governance"/>
+      <data name="Category" value="Authentication"/>
+      <data name="Category" value="Authorization"/>
+      <data name="Category" value="Administrators"/>
+    </metadata>
+  </prolog>
+
   <conbody>
 
     <p>
@@ -16,6 +28,96 @@
       reports. The auditing feature was added in Impala 1.1.1.
     </p>
 
-    
+    <p>
+      The Impala security features have several objectives. At the most basic 
level, security prevents
+      accidents or mistakes that could disrupt application processing, delete 
or corrupt data, or reveal data to
+      unauthorized users. More advanced security features and practices can 
harden the system against malicious
+      users trying to gain unauthorized access or perform other disallowed 
operations. The auditing feature
+      provides a way to confirm that no unauthorized access occurred, and 
detect whether any such attempts were
+      made. This is a critical set of features for production deployments in 
large organizations that handle
+      important or sensitive data. It sets the stage for multi-tenancy, where 
multiple applications run
+      concurrently and are prevented from interfering with each other.
+    </p>
+
+    <p>
+      The material in this section presumes that you are already familiar with 
administering secure Linux systems.
+      That is, you should know the general security practices for Linux and 
Hadoop, and their associated commands
+      and configuration files. For example, you should know how to create 
Linux users and groups, manage Linux
+      group membership, set Linux and HDFS file permissions and ownership, and 
designate the default permissions
+      and ownership for new files. You should be familiar with the 
configuration of the nodes in your Hadoop
+      cluster, and know how to apply configuration changes or run a set of 
commands across all the nodes.
+    </p>
+
+    <p>
+      The security features are divided into these broad categories:
+    </p>
+
+    <dl>
+      <dlentry>
+
+        <dt>
+          authorization
+        </dt>
+
+        <dd>
+          Which users are allowed to access which resources, and what 
operations are they allowed to perform?
+          Impala relies on the open source Sentry project for authorization. 
By default (when authorization is not
+          enabled), Impala does all read and write operations with the 
privileges of the <codeph>impala</codeph>
+          user, which is suitable for a development/test environment but not 
for a secure production environment.
+          When authorization is enabled, Impala uses the OS user ID of the 
user who runs
+          <cmdname>impala-shell</cmdname> or other client program, and 
associates various privileges with each
+          user. See <xref href="impala_authorization.xml#authorization"/> for 
details about setting up and managing
+          authorization.
+        </dd>
+
+      </dlentry>
+
+      <dlentry>
+
+        <dt>
+          authentication
+        </dt>
+
+        <dd>
+          How does Impala verify the identity of the user to confirm that they 
really are allowed to exercise the
+          privileges assigned to that user? Impala relies on the Kerberos 
subsystem for authentication. See
+          <xref href="impala_kerberos.xml#kerberos"/> for details about 
setting up and managing authentication.
+        </dd>
+
+      </dlentry>
+
+      <dlentry>
+
+        <dt>
+          auditing
+        </dt>
+
+        <dd>
+          What operations were attempted, and did they succeed or not? This 
feature provides a way to look back and
+          diagnose whether attempts were made to perform unauthorized 
operations. You use this information to track
+          down suspicious activity, and to see where changes are needed in 
authorization policies. The audit data
+          produced by this feature is collected by the Cloudera Manager 
product and then presented in a
+          user-friendly form by the Cloudera Manager product. See <xref 
href="impala_auditing.xml#auditing"/> for
+          details about setting up and managing auditing.
+        </dd>
+
+      </dlentry>
+    </dl>
+
+    <p outputclass="toc"/>
+
+    <p audience="integrated">
+      These other topics in the <cite>Security Guide</cite> cover how Impala 
integrates with security frameworks
+      such as Kerberos, LDAP, and Sentry:
+      <ul>
+        <li>
+          <xref href="impala_authentication.xml#authentication"/>
+        </li>
+
+        <li>
+          <xref href="impala_authorization.xml#authorization"/>
+        </li>
+      </ul>
+    </p>
   </conbody>
 </concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_select.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_select.xml b/docs/topics/impala_select.xml
index db63f71..24ed896 100644
--- a/docs/topics/impala_select.xml
+++ b/docs/topics/impala_select.xml
@@ -3,7 +3,7 @@
 <concept id="select">
 
   <title>SELECT Statement</title>
-  <titlealts><navtitle>SELECT</navtitle></titlealts>
+  <titlealts audience="PDF"><navtitle>SELECT</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
@@ -155,6 +155,9 @@ LIMIT <i>expression</i> [OFFSET <i>expression</i>]
     <p conref="../shared/impala_common.xml#common/security_blurb"/>
     <p conref="../shared/impala_common.xml#common/redaction_yes"/>
 
+    <p conref="../shared/impala_common.xml#common/s3_blurb"/>
+    <p conref="../shared/impala_common.xml#common/s3_block_splitting"/>
+
     <p conref="../shared/impala_common.xml#common/cancel_blurb_yes"/>
 
     <p conref="../shared/impala_common.xml#common/permissions_blurb"/>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_set.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_set.xml b/docs/topics/impala_set.xml
index afa6777..2594af9 100644
--- a/docs/topics/impala_set.xml
+++ b/docs/topics/impala_set.xml
@@ -3,24 +3,34 @@
 <concept rev="2.0.0" id="set">
 
   <title>SET Statement</title>
-  <titlealts><navtitle>SET</navtitle></titlealts>
+  <titlealts audience="PDF"><navtitle>SET</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="SQL"/>
       <data name="Category" value="Querying"/>
       <data name="Category" value="Configuring"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>
 
   <conbody>
 
-    <p>
+    <p rev="2.0.0">
       <indexterm audience="Cloudera">SET statement</indexterm>
       Specifies values for query options that control the runtime behavior of 
other statements within the same
       session.
     </p>
 
+    <p rev="2.5.0 IMPALA-2180">
+      In CDH 5.7 / Impala 2.5 and higher, <codeph>SET</codeph> also defines 
user-specified substitution variables for
+      the <cmdname>impala-shell</cmdname> interpreter. This feature uses the 
<codeph>SET</codeph> command
+      built into <cmdname>impala-shell</cmdname> instead of the SQL 
<codeph>SET</codeph> statement.
+      Therefore the substitution mechanism only works with queries processed 
by <cmdname>impala-shell</cmdname>,
+      not with queries submitted through JDBC or ODBC.
+    </p>
+
     <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
 
 <codeblock>SET 
[<varname>query_option</varname>=<varname>option_value</varname>]
@@ -43,6 +53,98 @@
       <xref href="impala_query_options.xml#query_options"/> for the details of 
each query option.
     </p>
 
+    <p>
+      <b>User-specified substitution variables:</b>
+    </p>
+
+    <p rev="2.5.0 IMPALA-2180">
+      In CDH 5.7 / Impala 2.5 and higher, you can specify your own names and 
string substitution values
+      within the <cmdname>impala-shell</cmdname> interpreter. Once a 
substitution variable is set up,
+      its value is inserted into any SQL statement in that same 
<cmdname>impala-shell</cmdname> session
+      that contains the notation 
<codeph>${var:<varname>varname</varname>}</codeph>.
+      Using <codeph>SET</codeph> in an interactive 
<cmdname>impala-shell</cmdname> session overrides
+      any value for that same variable passed in through the 
<codeph>--var=<varname>varname</varname>=<varname>value</varname></codeph>
+      command-line option.
+    </p>
+
+    <p rev="2.5.0 IMPALA-2180">
+      For example, to set up some default parameters for report queries, but 
then override those default
+      within an <cmdname>impala-shell</cmdname> session, you might issue 
commands and statements such as
+      the following:
+    </p>
+
+<codeblock rev="2.5.0 IMPALA-2180">
+-- Initial setup for this example.
+create table staging_table (s string);
+insert into staging_table values ('foo'), ('bar'), ('bletch');
+
+create table production_table (s string);
+insert into production_table values ('North America'), ('EMEA'), ('Asia');
+quit;
+
+-- Start impala-shell with user-specified substitution variables,
+-- run a query, then override the variables with SET and run the query again.
+$ impala-shell --var=table_name=staging_table --var=cutoff=2
+... <varname>banner message</varname> ...
+[localhost:21000] > select s from ${var:table_name} order by s limit 
${var:cutoff};
+Query: select s from staging_table order by s limit 2
++--------+
+| s      |
++--------+
+| bar    |
+| bletch |
++--------+
+Fetched 2 row(s) in 1.06s
+
+[localhost:21000] > set var:table_name=production_table;
+Variable TABLE_NAME set to production_table
+[localhost:21000] > set var:cutoff=3;
+Variable CUTOFF set to 3
+
+[localhost:21000] > select s from ${var:table_name} order by s limit 
${var:cutoff};
+Query: select s from production_table order by s limit 3
++---------------+
+| s             |
++---------------+
+| Asia          |
+| EMEA          |
+| North America |
++---------------+
+</codeblock>
+
+    <p rev="2.5.0 IMPALA-2180">
+      The following example shows how <codeph>SET</codeph> with no parameters 
displays
+      all user-specified substitution variables, and how 
<codeph>UNSET</codeph> removes
+      the substitution variable entirely:
+    </p>
+
+<codeblock rev="2.5.0 IMPALA-2180">
+[localhost:21000] > set;
+Query options (defaults shown in []):
+  ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
+  ...
+  V_CPU_CORES: [0]
+
+Shell Options
+  LIVE_PROGRESS: False
+  LIVE_SUMMARY: False
+
+Variables:
+  CUTOFF: 3
+  TABLE_NAME: staging_table
+
+[localhost:21000] > unset var:cutoff;
+Unsetting variable CUTOFF
+[localhost:21000] > select s from ${var:table_name} order by s limit 
${var:cutoff};
+Error: Unknown variable CUTOFF
+</codeblock>
+
+    <p rev="2.5.0 IMPALA-2180">
+      See <xref href="impala_shell_running_commands.xml"/> for more examples 
of using the
+      <codeph>--var</codeph>, <codeph>SET</codeph>, and 
<codeph>${var:<varname>varname</varname>}</codeph>
+      substitution technique in <cmdname>impala-shell</cmdname>.
+    </p>
+
     <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
 
     <p>
@@ -74,7 +176,7 @@ insert overwrite parquet_table select c1, c2, count(c3) from 
text_table group by
       a SQL statement lets you use this feature in client applications through 
the JDBC and ODBC APIs.
     </p>
 
-<!-- <p conref="/Content/impala_common_xi44078.xml#common/jdbc_blurb"/> -->
+<!-- <p conref="../shared/impala_common.xml#common/jdbc_blurb"/> -->
 
     <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
 

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_shell_options.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_shell_options.xml 
b/docs/topics/impala_shell_options.xml
index 9f0bf60..26f7364 100644
--- a/docs/topics/impala_shell_options.xml
+++ b/docs/topics/impala_shell_options.xml
@@ -4,7 +4,15 @@
 
   <title>impala-shell Configuration Options</title>
   <titlealts audience="PDF"><navtitle>Configuration 
Options</navtitle></titlealts>
-  
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Configuring"/>
+      <data name="Category" value="impala-shell"/>
+      <data name="Category" value="Data Analysts"/>
+      <data name="Category" value="Developers"/>
+    </metadata>
+  </prolog>
 
   <conbody>
 
@@ -14,7 +22,559 @@
       line, or through the <filepath>$HOME/.impalarc</filepath> configuration 
file.
     </p>
 
-    
+    <note>
+      <p>
+        These options are different than the configuration options for the 
<codeph>impalad</codeph> daemon itself.
+        For the <codeph>impalad</codeph> options, see <xref 
href="impala_config_options.xml#config_options"/>.
+      </p>
+    </note>
+
+    <p outputclass="toc inpage"/>
+  </conbody>
+
+  <concept id="shell_option_summary">
+
+    <title>Summary of impala-shell Configuration Options</title>
+
+    <conbody>
+
+      <p>
+        The following table shows the names and allowed arguments for the 
<cmdname>impala-shell</cmdname>
+        configuration options. You can specify options on the command line, or 
in a configuration file as described
+        in <xref href="impala_shell_options.xml#shell_config_file"/>.
+      </p>
+
+      <table>
+        <tgroup cols="3">
+          <colspec colname="1" colwidth="10*"/>
+          <colspec colname="2" colwidth="10*"/>
+          <colspec colname="3" colwidth="20*"/>
+          <thead>
+            <row>
+              <entry>
+                Command-Line Option
+              </entry>
+              <entry rev="2.0.0">
+                Configuration File Setting
+              </entry>
+              <entry>
+                Explanation
+              </entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>
+                <p>
+                  -B or --delimited
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  write_delimited=true
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  Causes all query results to be printed in plain format as a 
delimited text file. Useful for
+                  producing data files to be used with other Hadoop 
components. Also useful for avoiding the
+                  performance overhead of pretty-printing all output, 
especially when running benchmark tests using
+                  queries returning large result sets. Specify the delimiter 
character with the
+                  <codeph>--output_delimiter</codeph> option. Store all query 
results in a file rather than
+                  printing to the screen with the <codeph>-B</codeph> option. 
Added in Impala 1.0.1.
+                </p>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <p>
+                  --print_header
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  print_header=true
+                </p>
+              </entry>
+              <entry>
+                <p/>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <p>
+                  -o <varname>filename</varname> or --output_file 
<varname>filename</varname>
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  output_file=<varname>filename</varname>
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  Stores all query results in the specified file. Typically 
used to store the results of a single
+                  query issued from the command line with the 
<codeph>-q</codeph> option. Also works for
+                  interactive sessions; you see the messages such as number of 
rows fetched, but not the actual
+                  result set. To suppress these incidental messages when 
combining the <codeph>-q</codeph> and
+                  <codeph>-o</codeph> options, redirect 
<codeph>stderr</codeph> to <codeph>/dev/null</codeph>.
+                  Added in Impala 1.0.1.
+                </p>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <p>
+                  --output_delimiter=<varname>character</varname>
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  output_delimiter=<varname>character</varname>
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  Specifies the character to use as a delimiter between fields 
when query results are printed in
+                  plain format by the <codeph>-B</codeph> option. Defaults to 
tab (<codeph>'\t'</codeph>). If an
+                  output value contains the delimiter character, that field is 
quoted, escaped by doubling quotation marks, or both. Added in
+                  Impala 1.0.1.
+                </p>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <p>
+                  -p or --show_profiles
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  show_profiles=true
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  Displays the query execution plan (same output as the 
<codeph>EXPLAIN</codeph> statement) and a
+                  more detailed low-level breakdown of execution steps, for 
every query executed by the shell.
+                </p>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <p>
+                  -h or --help
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  N/A
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  Displays help information.
+                </p>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <p>
+                  -i <varname>hostname</varname> or
+                  
--impalad=<varname>hostname</varname>[:<varname>portnum</varname>]
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  
impalad=<varname>hostname</varname>[:<varname>portnum</varname>]
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  Connects to the <codeph>impalad</codeph> daemon on the 
specified host. The default port of 21000
+                  is assumed unless you provide another value. You can connect 
to any host in your cluster that is
+                  running <codeph>impalad</codeph>. If you connect to an 
instance of <codeph>impalad</codeph> that
+                  was started with an alternate port specified by the 
<codeph>--fe_port</codeph> flag, provide that
+                  alternative port.
+                </p>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <p>
+                  -q <varname>query</varname> or 
--query=<varname>query</varname>
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  query=<varname>query</varname>
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  Passes a query or other <cmdname>impala-shell</cmdname> 
command from the command line. The
+                  <cmdname>impala-shell</cmdname> interpreter immediately 
exits after processing the statement. It
+                  is limited to a single statement, which could be a 
<codeph>SELECT</codeph>, <codeph>CREATE
+                  TABLE</codeph>, <codeph>SHOW TABLES</codeph>, or any other 
statement recognized in
+                  <codeph>impala-shell</codeph>. Because you cannot pass a 
<codeph>USE</codeph> statement and
+                  another query, fully qualify the names for any tables 
outside the <codeph>default</codeph>
+                  database. (Or use the <codeph>-f</codeph> option to pass a 
file with a <codeph>USE</codeph>
+                  statement followed by other queries.)
+                </p>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <p>
+                  -f <varname>query_file</varname> or 
--query_file=<varname>query_file</varname>
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  query_file=<varname>path_to_query_file</varname>
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  Passes a SQL query from a file. Multiple statements must be 
semicolon (;) delimited.
+                  <ph rev="2.3.0">In CDH 5.5 / Impala 2.3 and higher, you can 
specify a filename of <codeph>-</codeph>
+                  to represent standard input. This feature makes it 
convenient to use <cmdname>impala-shell</cmdname>
+                  as part of a Unix pipeline where SQL statements are 
generated dynamically by other tools.</ph>
+                </p>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <p>
+                  -k or --kerberos
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  use_kerberos=true
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  Kerberos authentication is used when the shell connects to 
<codeph>impalad</codeph>. If Kerberos
+                  is not enabled on the instance of <codeph>impalad</codeph> 
to which you are connecting, errors
+                  are displayed.
+                </p>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <p>
+                  -s <varname>kerberos_service_name</varname> or 
--kerberos_service_name=<varname>name</varname>
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  kerberos_service_name=<varname>name</varname>
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  Instructs <codeph>impala-shell</codeph> to authenticate to a 
particular <codeph>impalad</codeph>
+                  service principal. If a 
<varname>kerberos_service_name</varname> is not specified,
+                  <codeph>impala</codeph> is used by default. If this option 
is used in conjunction with a
+                  connection in which Kerberos is not supported, errors are 
returned.
+                </p>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <p>
+                  -V or --verbose
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  verbose=true
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  Enables verbose output.
+                </p>
+              </entry>
+            </row>
+            <row>
+<!-- Confirm verbose=true/false really is the same as verbose vs. quiet. -->
+              <entry>
+                <p>
+                  --quiet
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  verbose=false
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  Disables verbose output.
+                </p>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <p>
+                  -v or --version
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  version=true
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  Displays version information.
+                </p>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <p>
+                  -c
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  ignore_query_failure=true
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  Continues on query failure.
+                </p>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <p>
+                  -r or --refresh_after_connect
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  refresh_after_connect=true
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  Updates Impala metadata upon connection. Same as running the
+                  <codeph><xref 
href="impala_invalidate_metadata.xml#invalidate_metadata">INVALIDATE
+                  METADATA</xref></codeph> statement after connecting. (This 
option was originally named when the
+                  <codeph>REFRESH</codeph> statement did the extensive 
metadata updates now performed by
+                  <codeph>INVALIDATE METADATA</codeph>.)
+                </p>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <p>
+                  -d <varname>default_db</varname> or 
--database=<varname>default_db</varname>
+                </p>
+              </entry>
+              <entry rev="2.0.0">
+                <p>
+                  default_db=<varname>default_db</varname>
+                </p>
+              </entry>
+              <entry>
+                <p>
+                  Specifies the database to be used on startup. Same as 
running the
+                  <codeph><xref href="impala_use.xml#use">USE</xref></codeph> 
statement after connecting. If not
+                  specified, a database named <codeph>DEFAULT</codeph> is used.
+                </p>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                -ssl
+              </entry>
+              <entry rev="2.0.0">
+                ssl=true
+              </entry>
+              <entry>
+                Enables TLS/SSL for <cmdname>impala-shell</cmdname>.
+              </entry>
+            </row>
+            <row>
+              <entry>
+                --ca_cert=<varname>path_to_certificate</varname>
+              </entry>
+              <entry rev="2.0.0">
+                ca_cert=<varname>path_to_certificate</varname>
+              </entry>
+              <entry>
+                The local pathname pointing to the third-party CA certificate, 
or to a copy of the server
+                certificate for self-signed server certificates. If 
<codeph>--ca_cert</codeph> is not set,
+                <cmdname>impala-shell</cmdname> enables TLS/SSL, but does not 
validate the server certificate. This is
+                useful for connecting to a known-good Impala that is only 
running over TLS/SSL, when a copy of the
+                certificate is not available (such as when debugging customer 
installations).
+              </entry>
+            </row>
+            <row rev="1.2.2">
+              <entry>
+                -l
+              </entry>
+              <entry rev="2.0.0">
+                use_ldap=true
+              </entry>
+              <entry>
+                Enables LDAP authentication.
+              </entry>
+            </row>
+            <row rev="1.2.2">
+              <entry>
+                -u
+              </entry>
+              <entry rev="2.0.0">
+                user=<varname>user_name</varname>
+              </entry>
+              <entry>
+                Supplies the username, when LDAP authentication is enabled by 
the <codeph>-l</codeph> option.
+                (Specify the short username, not the full LDAP distinguished 
name.) The shell then prompts
+                interactively for the password.
+              </entry>
+            </row>
+            <row rev="2.5.0 IMPALA-1934">
+              <entry>
+                --ldap_password_cmd=<varname>command</varname>
+              </entry>
+              <entry>
+                N/A
+              </entry>
+              <entry>
+                Specifies a command to run to retrieve the LDAP password,
+                when LDAP authentication is enabled by the <codeph>-l</codeph> 
option.
+                If the command includes space-separated arguments, enclose the 
command and
+                its arguments in quotation marks.
+              </entry>
+            </row>
+            <row rev="2.0.0">
+              <entry>
+                --config_file=<varname>path_to_config_file</varname>
+              </entry>
+              <entry>
+                N/A
+              </entry>
+              <entry>
+                Specifies the path of the file containing 
<cmdname>impala-shell</cmdname> configuration settings.
+                The default is <filepath>$HOME/.impalarc</filepath>. This 
setting can only be specified on the
+                command line.
+              </entry>
+            </row>
+            <row rev="2.3.0">
+              <entry>--live_progress</entry>
+              <entry>N/A</entry>
+              <entry>Prints a progress bar showing roughly the percentage 
complete for each query.
+              The information is updated interactively as the query progresses.
+              See <xref 
href="impala_live_progress.xml#live_progress"/>.</entry>
+            </row>
+            <row rev="2.3.0">
+              <entry>--live_summary</entry>
+              <entry>N/A</entry>
+              <entry>Prints a detailed report, similar to the 
<codeph>SUMMARY</codeph> command, showing progress details for each phase of 
query execution.
+              The information is updated interactively as the query progresses.
+              See <xref href="impala_live_summary.xml#live_summary"/>.</entry>
+            </row>
+            <row rev="2.5.0 IMPALA-1079">
+              
<entry>--var=<varname>variable_name</varname>=<varname>value</varname></entry>
+              <entry>N/A</entry>
+              <entry>
+                Defines a substitution variable that can be used within the 
<cmdname>impala-shell</cmdname> session.
+                The variable can be substituted into statements processed by 
the <codeph>-q</codeph> or <codeph>-f</codeph> options,
+                or in an interactive shell session.
+                Within a SQL statement, you substitute the value by using the 
notation <codeph>${var:<varname>variable_name</varname>}</codeph>.
+                This feature is available in CDH 5.7 / Impala 2.5 and higher.
+              </entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </table>
     </conbody>
   </concept>
 
+  <concept id="shell_config_file">
+
+    <title>impala-shell Configuration File</title>
+
+    <conbody>
+
+      <p>
+        You can define a set of default options for your 
<cmdname>impala-shell</cmdname> environment, stored in the
+        file <filepath>$HOME/.impalarc</filepath>. This file consists of 
key-value pairs, one option per line.
+        Everything after a <codeph>#</codeph> character on a line is treated 
as a comment and ignored.
+      </p>
+
+      <p>
+        The configuration file must contain a header label 
<codeph>[impala]</codeph>, followed by the options
+        specific to <cmdname>impala-shell</cmdname>. (This standard convention 
for configuration files lets you
+        use a single file to hold configuration options for multiple 
applications.)
+      </p>
+
+      <p>
+        To specify a different filename or path for the configuration file, 
specify the argument
+        <codeph>--config_file=<varname>path_to_config_file</varname></codeph> 
on the
+        <cmdname>impala-shell</cmdname> command line.
+      </p>
+
+      <p>
+        The names of the options in the configuration file are similar 
(although not necessarily identical) to the
+        long-form command-line arguments to the 
<cmdname>impala-shell</cmdname> command. For the names to use, see
+        <xref href="impala_shell_options.xml#shell_option_summary"/>.
+      </p>
+
+      <p>
+        Any options you specify on the <cmdname>impala-shell</cmdname> command 
line override any corresponding
+        options within the configuration file.
+      </p>
+
+      <p>
+        The following example shows a configuration file that you might use 
during benchmarking tests. It sets
+        verbose mode, so that the output from each SQL query is followed by 
timing information.
+        <cmdname>impala-shell</cmdname> starts inside the database containing 
the tables with the benchmark data,
+        avoiding the need to issue a <codeph>USE</codeph> statement or use 
fully qualified table names.
+      </p>
+
+      <p>
+        In this example, the query output is formatted as delimited text 
rather than enclosed in ASCII art boxes,
+        and is stored in a file rather than printed to the screen. Those 
options are appropriate for benchmark
+        situations, so that the overhead of <cmdname>impala-shell</cmdname> 
formatting and printing the result set
+        does not factor into the timing measurements. It also enables the 
<codeph>show_profiles</codeph> option.
+        That option prints detailed performance information after each query, 
which might be valuable in
+        understanding the performance of benchmark queries.
+      </p>
+
+<codeblock>[impala]
+verbose=true
+default_db=tpc_benchmarking
+write_delimited=true
+output_delimiter=,
+output_file=/home/tester1/benchmark_results.csv
+show_profiles=true
+</codeblock>
+
+      <p>
+        The following example shows a configuration file that connects to a 
specific remote Impala node, runs a
+        single query within a particular database, then exits. You would 
typically use this kind of single-purpose
+        configuration setting with the <cmdname>impala-shell</cmdname> 
command-line option
+        <codeph>--config_file=<varname>path_to_config_file</varname></codeph>, 
to easily select between many
+        predefined queries that could be run against different databases, 
hosts, or even different clusters. To run
+        a sequence of statements instead of a single query, specify the 
configuration option
+        <codeph>query_file=<varname>path_to_query_file</varname></codeph> 
instead.
+      </p>
+
+<codeblock>[impala]
+impalad=impala-test-node1.example.com
+default_db=site_stats
+# Issue a predefined query and immediately exit.
+query=select count(*) from web_traffic where event_date = trunc(now(),'dd')
+</codeblock>
+    </conbody>
+  </concept>
+</concept>

[06/23] incubator-impala git commit: Update all impala* files to the latest CDH 5.9/5.10 versions.

Reply via email to