[impala] 05/05: IMPALA-8105: [DOCS] Document cache_remote_file_handles flag

arodoni Fri, 08 Feb 2019 17:49:58 -0800

This is an automated email from the ASF dual-hosted git repository.

arodoni pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git


commit 022ba2bbdb6feec0ba944847e92d5a051cedc9a1
Author: Alex Rodoni <arod...@cloudera.com>
AuthorDate: Mon Feb 4 17:41:23 2019 -0800

    IMPALA-8105: [DOCS] Document cache_remote_file_handles flag
    
    Change-Id: Id649e733324f55a80a0199302dfa3b627ad183cf
    Reviewed-on: http://gerrit.cloudera.org:8080/12362
    Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
    Reviewed-by: Joe McDonnell <joemcdonn...@cloudera.com>
---
 docs/topics/impala_scalability.xml | 880 +++++++++++++++++++++----------------
 1 file changed, 498 insertions(+), 382 deletions(-)

diff --git a/docs/topics/impala_scalability.xml 
b/docs/topics/impala_scalability.xml
index 94c2fdb..71b8425 100644
--- a/docs/topics/impala_scalability.xml
+++ b/docs/topics/impala_scalability.xml
@@ -1,4 +1,5 @@
-<?xml version="1.0" encoding="UTF-8"?><!--
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
@@ -20,7 +21,13 @@ under the License.
 <concept id="scalability">
 
   <title>Scalability Considerations for Impala</title>
-  <titlealts audience="PDF"><navtitle>Scalability 
Considerations</navtitle></titlealts>
+
+  <titlealts audience="PDF">
+
+    <navtitle>Scalability Considerations</navtitle>
+
+  </titlealts>
+
   <prolog>
     <metadata>
       <data name="Category" value="Performance"/>
@@ -30,7 +37,7 @@ under the License.
       <data name="Category" value="Developers"/>
       <data name="Category" value="Memory"/>
       <data name="Category" value="Scalability"/>
-      <!-- Using domain knowledge about Impala, sizing, etc. to decide what to 
mark as 'Proof of Concept'. -->
+<!-- Using domain knowledge about Impala, sizing, etc. to decide what to mark 
as 'Proof of Concept'. -->
       <data name="Category" value="Proof of Concept"/>
     </metadata>
   </prolog>
@@ -38,10 +45,11 @@ under the License.
   <conbody>
 
     <p>
-      This section explains how the size of your cluster and the volume of 
data influences SQL performance and
-      schema design for Impala tables. Typically, adding more cluster capacity 
reduces problems due to memory
-      limits or disk throughput. On the other hand, larger clusters are more 
likely to have other kinds of
-      scalability issues, such as a single slow node that causes performance 
problems for queries.
+      This section explains how the size of your cluster and the volume of 
data influences SQL
+      performance and schema design for Impala tables. Typically, adding more 
cluster capacity
+      reduces problems due to memory limits or disk throughput. On the other 
hand, larger
+      clusters are more likely to have other kinds of scalability issues, such 
as a single slow
+      node that causes performance problems for queries.
     </p>
 
     <p outputclass="toc inpage"/>
@@ -53,14 +61,15 @@ under the License.
   <concept audience="hidden" id="scalability_memory">
 
     <title>Overview and Guidelines for Impala Memory Usage</title>
-  <prolog>
-    <metadata>
-      <data name="Category" value="Memory"/>
-      <data name="Category" value="Concepts"/>
-      <data name="Category" value="Best Practices"/>
-      <data name="Category" value="Guidelines"/>
-    </metadata>
-  </prolog>
+
+    <prolog>
+      <metadata>
+        <data name="Category" value="Memory"/>
+        <data name="Category" value="Concepts"/>
+        <data name="Category" value="Best Practices"/>
+        <data name="Category" value="Guidelines"/>
+      </metadata>
+    </prolog>
 
     <conbody>
 
@@ -158,7 +167,9 @@ Memory Usage: Additional Notes
 *  Review previous common issues on out-of-memory
 *  Note: Even with disk-based joins, you'll want to review these steps to 
speed up queries and use memory more efficiently
 </codeblock>
+
     </conbody>
+
   </concept>
 
   <concept id="scalability_catalog">
@@ -173,24 +184,25 @@ Memory Usage: Additional Notes
       </p>
 
       <p>
-        Because Hadoop I/O is optimized for reading and writing large files, 
Impala is optimized for tables
-        containing relatively few, large data files. Schemas containing 
thousands of tables, or tables containing
-        thousands of partitions, can encounter performance issues during 
startup or during DDL operations such as
-        <codeph>ALTER TABLE</codeph> statements.
+        Because Hadoop I/O is optimized for reading and writing large files, 
Impala is optimized
+        for tables containing relatively few, large data files. Schemas 
containing thousands of
+        tables, or tables containing thousands of partitions, can encounter 
performance issues
+        during startup or during DDL operations such as <codeph>ALTER 
TABLE</codeph> statements.
       </p>
 
       <note type="important" rev="TSB-168">
-      <p>
-        Because of a change in the default heap size for the 
<cmdname>catalogd</cmdname> daemon in
-        <keyword keyref="impala25_full"/> and higher, the following procedure 
to increase the <cmdname>catalogd</cmdname>
-        memory limit might be required following an upgrade to <keyword 
keyref="impala25_full"/> even if not
-        needed previously.
-      </p>
+        <p>
+          Because of a change in the default heap size for the 
<cmdname>catalogd</cmdname>
+          daemon in <keyword keyref="impala25_full"/> and higher, the 
following procedure to
+          increase the <cmdname>catalogd</cmdname> memory limit might be 
required following an
+          upgrade to <keyword keyref="impala25_full"/> even if not needed 
previously.
+        </p>
       </note>
 
       <p 
conref="../shared/impala_common.xml#common/increase_catalogd_heap_size"/>
 
     </conbody>
+
   </concept>
 
   <concept rev="2.1.0" id="statestore_scalability">
@@ -200,30 +212,33 @@ Memory Usage: Additional Notes
     <conbody>
 
       <p>
-        Before <keyword keyref="impala21_full"/>, the statestore sent only one 
kind of message to its subscribers. This message contained all
-        updates for any topics that a subscriber had subscribed to. It also 
served to let subscribers know that the
-        statestore had not failed, and conversely the statestore used the 
success of sending a heartbeat to a
+        Before <keyword keyref="impala21_full"/>, the statestore sent only one 
kind of message
+        to its subscribers. This message contained all updates for any topics 
that a subscriber
+        had subscribed to. It also served to let subscribers know that the 
statestore had not
+        failed, and conversely the statestore used the success of sending a 
heartbeat to a
         subscriber to decide whether or not the subscriber had failed.
       </p>
 
       <p>
-        Combining topic updates and failure detection in a single message led 
to bottlenecks in clusters with large
-        numbers of tables, partitions, and HDFS data blocks. When the 
statestore was overloaded with metadata
-        updates to transmit, heartbeat messages were sent less frequently, 
sometimes causing subscribers to time
-        out their connection with the statestore. Increasing the subscriber 
timeout and decreasing the frequency of
-        statestore heartbeats worked around the problem, but reduced 
responsiveness when the statestore failed or
-        restarted.
+        Combining topic updates and failure detection in a single message led 
to bottlenecks in
+        clusters with large numbers of tables, partitions, and HDFS data 
blocks. When the
+        statestore was overloaded with metadata updates to transmit, heartbeat 
messages were
+        sent less frequently, sometimes causing subscribers to time out their 
connection with
+        the statestore. Increasing the subscriber timeout and decreasing the 
frequency of
+        statestore heartbeats worked around the problem, but reduced 
responsiveness when the
+        statestore failed or restarted.
       </p>
 
       <p>
-        As of <keyword keyref="impala21_full"/>, the statestore now sends 
topic updates and heartbeats in separate messages. This allows the
-        statestore to send and receive a steady stream of lightweight 
heartbeats, and removes the requirement to
-        send topic updates according to a fixed schedule, reducing statestore 
network overhead.
+        As of <keyword keyref="impala21_full"/>, the statestore now sends 
topic updates and
+        heartbeats in separate messages. This allows the statestore to send 
and receive a steady
+        stream of lightweight heartbeats, and removes the requirement to send 
topic updates
+        according to a fixed schedule, reducing statestore network overhead.
       </p>
 
       <p>
-        The statestore now has the following relevant configuration flags for 
the <cmdname>statestored</cmdname>
-        daemon:
+        The statestore now has the following relevant configuration flags for 
the
+        <cmdname>statestored</cmdname> daemon:
       </p>
 
       <dl>
@@ -234,8 +249,8 @@ Memory Usage: Additional Notes
           </dt>
 
           <dd>
-            The number of threads inside the statestore dedicated to sending 
topic updates. You should not
-            typically need to change this value.
+            The number of threads inside the statestore dedicated to sending 
topic updates. You
+            should not typically need to change this value.
             <p>
               <b>Default:</b> 10
             </p>
@@ -250,9 +265,10 @@ Memory Usage: Additional Notes
           </dt>
 
           <dd>
-            The frequency, in milliseconds, with which the statestore tries to 
send topic updates to each
-            subscriber. This is a best-effort value; if the statestore is 
unable to meet this frequency, it sends
-            topic updates as fast as it can. You should not typically need to 
change this value.
+            The frequency, in milliseconds, with which the statestore tries to 
send topic
+            updates to each subscriber. This is a best-effort value; if the 
statestore is unable
+            to meet this frequency, it sends topic updates as fast as it can. 
You should not
+            typically need to change this value.
             <p>
               <b>Default:</b> 2000
             </p>
@@ -267,8 +283,8 @@ Memory Usage: Additional Notes
           </dt>
 
           <dd>
-            The number of threads inside the statestore dedicated to sending 
heartbeats. You should not typically
-            need to change this value.
+            The number of threads inside the statestore dedicated to sending 
heartbeats. You
+            should not typically need to change this value.
             <p>
               <b>Default:</b> 10
             </p>
@@ -283,9 +299,10 @@ Memory Usage: Additional Notes
           </dt>
 
           <dd>
-            The frequency, in milliseconds, with which the statestore tries to 
send heartbeats to each subscriber.
-            This value should be good for large catalogs and clusters up to 
approximately 150 nodes. Beyond that,
-            you might need to increase this value to make the interval longer 
between heartbeat messages.
+            The frequency, in milliseconds, with which the statestore tries to 
send heartbeats
+            to each subscriber. This value should be good for large catalogs 
and clusters up to
+            approximately 150 nodes. Beyond that, you might need to increase 
this value to make
+            the interval longer between heartbeat messages.
             <p>
               <b>Default:</b> 1000 (one heartbeat message every second)
             </p>
@@ -295,46 +312,54 @@ Memory Usage: Additional Notes
       </dl>
 
       <p>
-        If it takes a very long time for a cluster to start up, and 
<cmdname>impala-shell</cmdname> consistently
-        displays <codeph>This Impala daemon is not ready to accept user 
requests</codeph>, the statestore might be
-        taking too long to send the entire catalog topic to the cluster. In 
this case, consider adding
-        <codeph>--load_catalog_in_background=false</codeph> to your catalog 
service configuration. This setting
-        stops the statestore from loading the entire catalog into memory at 
cluster startup. Instead, metadata for
-        each table is loaded when the table is accessed for the first time.
+        If it takes a very long time for a cluster to start up, and
+        <cmdname>impala-shell</cmdname> consistently displays <codeph>This 
Impala daemon is not
+        ready to accept user requests</codeph>, the statestore might be taking 
too long to send
+        the entire catalog topic to the cluster. In this case, consider adding
+        <codeph>--load_catalog_in_background=false</codeph> to your catalog 
service
+        configuration. This setting stops the statestore from loading the 
entire catalog into
+        memory at cluster startup. Instead, metadata for each table is loaded 
when the table is
+        accessed for the first time.
       </p>
+
     </conbody>
+
   </concept>
 
   <concept id="scalability_buffer_pool" rev="2.10.0 IMPALA-3200">
+
     <title>Effect of Buffer Pool on Memory Usage (<keyword 
keyref="impala210"/> and higher)</title>
+
     <conbody>
+
       <p>
-        The buffer pool feature, available in <keyword keyref="impala210"/> 
and higher, changes the
-        way Impala allocates memory during a query. Most of the memory needed 
is reserved at the
-        beginning of the query, avoiding cases where a query might run for a 
long time before failing
-        with an out-of-memory error. The actual memory estimates and memory 
buffers are typically
-        smaller than before, so that more queries can run concurrently or 
process larger volumes
-        of data than previously.
+        The buffer pool feature, available in <keyword keyref="impala210"/> 
and higher, changes
+        the way Impala allocates memory during a query. Most of the memory 
needed is reserved at
+        the beginning of the query, avoiding cases where a query might run for 
a long time
+        before failing with an out-of-memory error. The actual memory 
estimates and memory
+        buffers are typically smaller than before, so that more queries can 
run concurrently or
+        process larger volumes of data than previously.
       </p>
+
       <p>
         The buffer pool feature includes some query options that you can 
fine-tune:
-        <xref keyref="buffer_pool_limit"/>,
-        <xref keyref="default_spillable_buffer_size"/>,
-        <xref keyref="max_row_size"/>, and
-        <xref keyref="min_spillable_buffer_size"/>.
-      </p>
-      <p>
-        Most of the effects of the buffer pool are transparent to you as an 
Impala user.
-        Memory use during spilling is now steadier and more predictable, 
instead of
-        increasing rapidly as more data is spilled to disk. The main change 
from a user
-        perspective is the need to increase the <codeph>MAX_ROW_SIZE</codeph> 
query option
-        setting when querying tables with columns containing long strings, 
many columns,
-        or other combinations of factors that produce very large rows. If 
Impala encounters
-        rows that are too large to process with the default query option 
settings, the query
-        fails with an error message suggesting to increase the 
<codeph>MAX_ROW_SIZE</codeph>
-        setting.
+        <xref keyref="buffer_pool_limit"/>, <xref 
keyref="default_spillable_buffer_size"/>,
+        <xref keyref="max_row_size"/>, and <xref 
keyref="min_spillable_buffer_size"/>.
       </p>
+
+      <p>
+        Most of the effects of the buffer pool are transparent to you as an 
Impala user. Memory
+        use during spilling is now steadier and more predictable, instead of 
increasing rapidly
+        as more data is spilled to disk. The main change from a user 
perspective is the need to
+        increase the <codeph>MAX_ROW_SIZE</codeph> query option setting when 
querying tables
+        with columns containing long strings, many columns, or other 
combinations of factors
+        that produce very large rows. If Impala encounters rows that are too 
large to process
+        with the default query option settings, the query fails with an error 
message suggesting
+        to increase the <codeph>MAX_ROW_SIZE</codeph> setting.
+      </p>
+
     </conbody>
+
   </concept>
 
   <concept audience="hidden" id="scalability_cluster_size">
@@ -343,9 +368,10 @@ Memory Usage: Additional Notes
 
     <conbody>
 
-      <p>
-      </p>
+      <p></p>
+
     </conbody>
+
   </concept>
 
   <concept audience="hidden" id="concurrent_connections">
@@ -355,7 +381,9 @@ Memory Usage: Additional Notes
     <conbody>
 
       <p></p>
+
     </conbody>
+
   </concept>
 
   <concept rev="2.0.0" id="spill_to_disk">
@@ -365,22 +393,26 @@ Memory Usage: Additional Notes
     <conbody>
 
       <p>
-        Certain memory-intensive operations write temporary data to disk 
(known as <term>spilling</term> to disk)
-        when Impala is close to exceeding its memory limit on a particular 
host.
+        Certain memory-intensive operations write temporary data to disk 
(known as
+        <term>spilling</term> to disk) when Impala is close to exceeding its 
memory limit on a
+        particular host.
       </p>
 
       <p>
-        The result is a query that completes successfully, rather than failing 
with an out-of-memory error. The
-        tradeoff is decreased performance due to the extra disk I/O to write 
the temporary data and read it back
-        in. The slowdown could be potentially be significant. Thus, while this 
feature improves reliability,
-        you should optimize your queries, system parameters, and hardware 
configuration to make this spilling a rare occurrence.
+        The result is a query that completes successfully, rather than failing 
with an
+        out-of-memory error. The tradeoff is decreased performance due to the 
extra disk I/O to
+        write the temporary data and read it back in. The slowdown could be 
potentially be
+        significant. Thus, while this feature improves reliability, you should 
optimize your
+        queries, system parameters, and hardware configuration to make this 
spilling a rare
+        occurrence.
       </p>
 
       <note rev="2.10.0 IMPALA-3200">
         <p>
-          In <keyword keyref="impala210"/> and higher, also see <xref 
keyref="scalability_buffer_pool"/> for
-          changes to Impala memory allocation that might change the details of 
which queries spill to disk,
-          and how much memory and disk space is involved in the spilling 
operation.
+          In <keyword keyref="impala210"/> and higher, also see
+          <xref keyref="scalability_buffer_pool"/> for changes to Impala 
memory allocation that
+          might change the details of which queries spill to disk, and how 
much memory and disk
+          space is involved in the spilling operation.
         </p>
       </note>
 
@@ -389,38 +421,42 @@ Memory Usage: Additional Notes
       </p>
 
       <p>
-        Several SQL clauses and constructs require memory allocations that 
could activat the spilling mechanism:
+        Several SQL clauses and constructs require memory allocations that 
could activat the
+        spilling mechanism:
       </p>
+
       <ul>
         <li>
           <p>
-            when a query uses a <codeph>GROUP BY</codeph> clause for columns
-            with millions or billions of distinct values, Impala keeps a
-            similar number of temporary results in memory, to accumulate the
-            aggregate results for each value in the group.
+            when a query uses a <codeph>GROUP BY</codeph> clause for columns 
with millions or
+            billions of distinct values, Impala keeps a similar number of 
temporary results in
+            memory, to accumulate the aggregate results for each value in the 
group.
           </p>
         </li>
+
         <li>
           <p>
-            When large tables are joined together, Impala keeps the values of
-            the join columns from one table in memory, to compare them to
-            incoming values from the other table.
+            When large tables are joined together, Impala keeps the values of 
the join columns
+            from one table in memory, to compare them to incoming values from 
the other table.
           </p>
         </li>
+
         <li>
           <p>
-            When a large result set is sorted by the <codeph>ORDER BY</codeph>
-            clause, each node sorts its portion of the result set in memory.
+            When a large result set is sorted by the <codeph>ORDER BY</codeph> 
clause, each node
+            sorts its portion of the result set in memory.
           </p>
         </li>
+
         <li>
           <p>
-            The <codeph>DISTINCT</codeph> and <codeph>UNION</codeph> operators
-            build in-memory data structures to represent all values found so
-            far, to eliminate duplicates as the query progresses.
+            The <codeph>DISTINCT</codeph> and <codeph>UNION</codeph> operators 
build in-memory
+            data structures to represent all values found so far, to eliminate 
duplicates as the
+            query progresses.
           </p>
         </li>
-        <!-- JIRA still in open state as of 5.8 / 2.6, commenting out.
+
+<!-- JIRA still in open state as of 5.8 / 2.6, commenting out.
         <li>
           <p rev="IMPALA-3471">
             In <keyword keyref="impala26_full"/> and higher, 
<term>top-N</term> queries (those with
@@ -445,30 +481,32 @@ Memory Usage: Additional Notes
       </p>
 
       <p rev="2.10.0 IMPALA-3200">
-        In <keyword keyref="impala210_full"/> and higher, the way SQL 
operators such as <codeph>GROUP BY</codeph>,
-        <codeph>DISTINCT</codeph>, and joins, transition between using 
additional memory or activating the
-        spill-to-disk feature is changed. The memory required to spill to disk 
is reserved up front, and you can
-        examine it in the <codeph>EXPLAIN</codeph> plan when the 
<codeph>EXPLAIN_LEVEL</codeph> query option is
+        In <keyword keyref="impala210_full"/> and higher, the way SQL 
operators such as
+        <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, 
transition between
+        using additional memory or activating the spill-to-disk feature is 
changed. The memory
+        required to spill to disk is reserved up front, and you can examine it 
in the
+        <codeph>EXPLAIN</codeph> plan when the <codeph>EXPLAIN_LEVEL</codeph> 
query option is
         set to 2 or higher.
       </p>
 
-     <p>
-       The infrastructure of the spilling feature affects the way the affected 
SQL operators, such as
-       <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use 
memory.
-       On each host that participates in the query, each such operator in a 
query requires memory
-       to store rows of data and other data structures. Impala reserves a 
certain amount of memory
-       up front for each operator that supports spill-to-disk that is 
sufficient to execute the
-       operator. If an operator accumulates more data than can fit in the 
reserved memory, it
-       can either reserve more memory to continue processing data in memory or 
start spilling
-       data to temporary scratch files on disk. Thus, operators with 
spill-to-disk support
-       can adapt to different memory constraints by using however much memory 
is available
-       to speed up execution, yet tolerate low memory conditions by spilling 
data to disk.
-     </p>
-     
-     <p>
-       The amount data depends on the portion of the data being handled by 
that host, and thus
-       the operator may end up consuming different amounts of memory on 
different hosts.
-     </p>
+      <p>
+        The infrastructure of the spilling feature affects the way the 
affected SQL operators,
+        such as <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and 
joins, use memory. On
+        each host that participates in the query, each such operator in a 
query requires memory
+        to store rows of data and other data structures. Impala reserves a 
certain amount of
+        memory up front for each operator that supports spill-to-disk that is 
sufficient to
+        execute the operator. If an operator accumulates more data than can 
fit in the reserved
+        memory, it can either reserve more memory to continue processing data 
in memory or start
+        spilling data to temporary scratch files on disk. Thus, operators with 
spill-to-disk
+        support can adapt to different memory constraints by using however 
much memory is
+        available to speed up execution, yet tolerate low memory conditions by 
spilling data to
+        disk.
+      </p>
+
+      <p>
+        The amount data depends on the portion of the data being handled by 
that host, and thus
+        the operator may end up consuming different amounts of memory on 
different hosts.
+      </p>
 
 <!--
       <p>
@@ -505,12 +543,13 @@ Memory Usage: Additional Notes
 -->
 
       <p>
-        <b>Added in:</b> This feature was added to the <codeph>ORDER 
BY</codeph> clause in Impala 1.4.
-        This feature was extended to cover join queries, aggregation 
functions, and analytic
-        functions in Impala 2.0. The size of the memory work area required by
-        each operator that spills was reduced from 512 megabytes to 256 
megabytes in Impala 2.2.
-        <ph rev="2.10.0 IMPALA-3200">The spilling mechanism was reworked to 
take advantage of the
-        Impala buffer pool feature and be more predictable and stable in 
<keyword keyref="impala210_full"/>.</ph>
+        <b>Added in:</b> This feature was added to the <codeph>ORDER 
BY</codeph> clause in
+        Impala 1.4. This feature was extended to cover join queries, 
aggregation functions, and
+        analytic functions in Impala 2.0. The size of the memory work area 
required by each
+        operator that spills was reduced from 512 megabytes to 256 megabytes 
in Impala 2.2.
+        <ph rev="2.10.0 IMPALA-3200">The spilling mechanism was reworked to 
take advantage of
+        the Impala buffer pool feature and be more predictable and stable in
+        <keyword keyref="impala210_full"/>.</ph>
       </p>
 
       <p>
@@ -518,28 +557,30 @@ Memory Usage: Additional Notes
       </p>
 
       <p>
-        Because the extra I/O can impose significant performance overhead on 
these types of queries, try to avoid
-        this situation by using the following steps:
+        Because the extra I/O can impose significant performance overhead on 
these types of
+        queries, try to avoid this situation by using the following steps:
       </p>
 
       <ol>
         <li>
-          Detect how often queries spill to disk, and how much temporary data 
is written. Refer to the following
-          sources:
+          Detect how often queries spill to disk, and how much temporary data 
is written. Refer
+          to the following sources:
           <ul>
             <li>
-              The output of the <codeph>PROFILE</codeph> command in the 
<cmdname>impala-shell</cmdname>
-              interpreter. This data shows the memory usage for each host and 
in total across the cluster. The
-              <codeph>WriteIoBytes</codeph> counter reports how much data was 
written to disk for each operator
-              during the query. (In <keyword keyref="impala29_full"/>, the 
counter was named
-              <codeph>ScratchBytesWritten</codeph>; in <keyword 
keyref="impala28_full"/> and earlier, it was named
-              <codeph>BytesWritten</codeph>.)
+              The output of the <codeph>PROFILE</codeph> command in the
+              <cmdname>impala-shell</cmdname> interpreter. This data shows the 
memory usage for
+              each host and in total across the cluster. The 
<codeph>WriteIoBytes</codeph>
+              counter reports how much data was written to disk for each 
operator during the
+              query. (In <keyword keyref="impala29_full"/>, the counter was 
named
+              <codeph>ScratchBytesWritten</codeph>; in <keyword 
keyref="impala28_full"/> and
+              earlier, it was named <codeph>BytesWritten</codeph>.)
             </li>
 
             <li>
-              The <uicontrol>Queries</uicontrol> tab in the Impala debug web 
user interface. Select the query to
-              examine and click the corresponding 
<uicontrol>Profile</uicontrol> link. This data breaks down the
-              memory usage for a single host within the cluster, the host 
whose web interface you are connected to.
+              The <uicontrol>Queries</uicontrol> tab in the Impala debug web 
user interface.
+              Select the query to examine and click the corresponding
+              <uicontrol>Profile</uicontrol> link. This data breaks down the 
memory usage for a
+              single host within the cluster, the host whose web interface you 
are connected to.
             </li>
           </ul>
         </li>
@@ -548,15 +589,16 @@ Memory Usage: Additional Notes
           Use one or more techniques to reduce the possibility of the queries 
spilling to disk:
           <ul>
             <li>
-              Increase the Impala memory limit if practical, for example, if 
you can increase the available memory
-              by more than the amount of temporary data written to disk on a 
particular node. Remember that in
-              Impala 2.0 and later, you can issue <codeph>SET 
MEM_LIMIT</codeph> as a SQL statement, which lets you
-              fine-tune the memory usage for queries from JDBC and ODBC 
applications.
+              Increase the Impala memory limit if practical, for example, if 
you can increase
+              the available memory by more than the amount of temporary data 
written to disk on
+              a particular node. Remember that in Impala 2.0 and later, you 
can issue
+              <codeph>SET MEM_LIMIT</codeph> as a SQL statement, which lets 
you fine-tune the
+              memory usage for queries from JDBC and ODBC applications.
             </li>
 
             <li>
-              Increase the number of nodes in the cluster, to increase the 
aggregate memory available to Impala and
-              reduce the amount of memory required on each node.
+              Increase the number of nodes in the cluster, to increase the 
aggregate memory
+              available to Impala and reduce the amount of memory required on 
each node.
             </li>
 
             <li>
@@ -564,54 +606,57 @@ Memory Usage: Additional Notes
             </li>
 
             <li>
-              On a cluster with resources shared between Impala and other 
Hadoop components, use resource
-              management features to allocate more memory for Impala. See
+              On a cluster with resources shared between Impala and other 
Hadoop components, use
+              resource management features to allocate more memory for Impala. 
See
               <xref 
href="impala_resource_management.xml#resource_management"/> for details.
             </li>
 
             <li>
-              If the memory pressure is due to running many concurrent queries 
rather than a few memory-intensive
-              ones, consider using the Impala admission control feature to 
lower the limit on the number of
-              concurrent queries. By spacing out the most resource-intensive 
queries, you can avoid spikes in
-              memory usage and improve overall response times. See
-              <xref href="impala_admission.xml#admission_control"/> for 
details.
+              If the memory pressure is due to running many concurrent queries 
rather than a few
+              memory-intensive ones, consider using the Impala admission 
control feature to
+              lower the limit on the number of concurrent queries. By spacing 
out the most
+              resource-intensive queries, you can avoid spikes in memory usage 
and improve
+              overall response times. See <xref 
href="impala_admission.xml#admission_control"/>
+              for details.
             </li>
 
             <li>
-              Tune the queries with the highest memory requirements, using one 
or more of the following techniques:
+              Tune the queries with the highest memory requirements, using one 
or more of the
+              following techniques:
               <ul>
                 <li>
-                  Run the <codeph>COMPUTE STATS</codeph> statement for all 
tables involved in large-scale joins and
-                  aggregation queries.
+                  Run the <codeph>COMPUTE STATS</codeph> statement for all 
tables involved in
+                  large-scale joins and aggregation queries.
                 </li>
 
                 <li>
-                  Minimize your use of <codeph>STRING</codeph> columns in join 
columns. Prefer numeric values
-                  instead.
+                  Minimize your use of <codeph>STRING</codeph> columns in join 
columns. Prefer
+                  numeric values instead.
                 </li>
 
                 <li>
-                  Examine the <codeph>EXPLAIN</codeph> plan to understand the 
execution strategy being used for the
-                  most resource-intensive queries. See <xref 
href="impala_explain_plan.xml#perf_explain"/> for
-                  details.
+                  Examine the <codeph>EXPLAIN</codeph> plan to understand the 
execution strategy
+                  being used for the most resource-intensive queries. See
+                  <xref href="impala_explain_plan.xml#perf_explain"/> for 
details.
                 </li>
 
                 <li>
-                  If Impala still chooses a suboptimal execution strategy even 
with statistics available, or if it
-                  is impractical to keep the statistics up to date for huge or 
rapidly changing tables, add hints
-                  to the most resource-intensive queries to select the right 
execution strategy. See
+                  If Impala still chooses a suboptimal execution strategy even 
with statistics
+                  available, or if it is impractical to keep the statistics up 
to date for huge
+                  or rapidly changing tables, add hints to the most 
resource-intensive queries
+                  to select the right execution strategy. See
                   <xref href="impala_hints.xml#hints"/> for details.
                 </li>
               </ul>
             </li>
 
             <li>
-              If your queries experience substantial performance overhead due 
to spilling, enable the
-              <codeph>DISABLE_UNSAFE_SPILLS</codeph> query option. This option 
prevents queries whose memory usage
-              is likely to be exorbitant from spilling to disk. See
-              <xref 
href="impala_disable_unsafe_spills.xml#disable_unsafe_spills"/> for details. As 
you tune
-              problematic queries using the preceding steps, fewer and fewer 
will be cancelled by this option
-              setting.
+              If your queries experience substantial performance overhead due 
to spilling,
+              enable the <codeph>DISABLE_UNSAFE_SPILLS</codeph> query option. 
This option
+              prevents queries whose memory usage is likely to be exorbitant 
from spilling to
+              disk. See <xref 
href="impala_disable_unsafe_spills.xml#disable_unsafe_spills"/>
+              for details. As you tune problematic queries using the preceding 
steps, fewer and
+              fewer will be cancelled by this option setting.
             </li>
           </ul>
         </li>
@@ -622,22 +667,24 @@ Memory Usage: Additional Notes
       </p>
 
       <p>
-        To artificially provoke spilling, to test this feature and understand 
the performance implications, use a
-        test environment with a memory limit of at least 2 GB. Issue the 
<codeph>SET</codeph> command with no
-        arguments to check the current setting for the 
<codeph>MEM_LIMIT</codeph> query option. Set the query
-        option <codeph>DISABLE_UNSAFE_SPILLS=true</codeph>. This option limits 
the spill-to-disk feature to prevent
-        runaway disk usage from queries that are known in advance to be 
suboptimal. Within
-        <cmdname>impala-shell</cmdname>, run a query that you expect to be 
memory-intensive, based on the criteria
-        explained earlier. A self-join of a large table is a good candidate:
+        To artificially provoke spilling, to test this feature and understand 
the performance
+        implications, use a test environment with a memory limit of at least 2 
GB. Issue the
+        <codeph>SET</codeph> command with no arguments to check the current 
setting for the
+        <codeph>MEM_LIMIT</codeph> query option. Set the query option
+        <codeph>DISABLE_UNSAFE_SPILLS=true</codeph>. This option limits the 
spill-to-disk
+        feature to prevent runaway disk usage from queries that are known in 
advance to be
+        suboptimal. Within <cmdname>impala-shell</cmdname>, run a query that 
you expect to be
+        memory-intensive, based on the criteria explained earlier. A self-join 
of a large table
+        is a good candidate:
       </p>
 
 <codeblock>select count(*) from big_table a join big_table b using 
(column_with_many_values);
 </codeblock>
 
       <p>
-        Issue the <codeph>PROFILE</codeph> command to get a detailed breakdown 
of the memory usage on each node
-        during the query.
-        <!--
+        Issue the <codeph>PROFILE</codeph> command to get a detailed breakdown 
of the memory
+        usage on each node during the query.
+<!--
         The crucial part of the profile output concerning memory is the 
<codeph>BlockMgr</codeph>
         portion. For example, this profile shows that the query did not quite 
exceed the memory limit.
         -->
@@ -668,8 +715,8 @@ Memory Usage: Additional Notes
 -->
 
       <p>
-        Set the <codeph>MEM_LIMIT</codeph> query option to a value that is 
smaller than the peak memory usage
-        reported in the profile output. Now try the memory-intensive query 
again.
+        Set the <codeph>MEM_LIMIT</codeph> query option to a value that is 
smaller than the peak
+        memory usage reported in the profile output. Now try the 
memory-intensive query again.
       </p>
 
       <p>
@@ -682,46 +729,50 @@ these tables, hint the plan or disable this behavior via 
query options to enable
 </codeblock>
 
       <p>
-        If so, the query could have consumed substantial temporary disk space, 
slowing down so much that it would
-        not complete in any reasonable time. Rather than rely on the 
spill-to-disk feature in this case, issue the
-        <codeph>COMPUTE STATS</codeph> statement for the table or tables in 
your sample query. Then run the query
-        again, check the peak memory usage again in the 
<codeph>PROFILE</codeph> output, and adjust the memory
-        limit again if necessary to be lower than the peak memory usage.
+        If so, the query could have consumed substantial temporary disk space, 
slowing down so
+        much that it would not complete in any reasonable time. Rather than 
rely on the
+        spill-to-disk feature in this case, issue the <codeph>COMPUTE 
STATS</codeph> statement
+        for the table or tables in your sample query. Then run the query 
again, check the peak
+        memory usage again in the <codeph>PROFILE</codeph> output, and adjust 
the memory limit
+        again if necessary to be lower than the peak memory usage.
       </p>
 
       <p>
-        At this point, you have a query that is memory-intensive, but Impala 
can optimize it efficiently so that
-        the memory usage is not exorbitant. You have set an artificial 
constraint through the
-        <codeph>MEM_LIMIT</codeph> option so that the query would normally 
fail with an out-of-memory error. But
-        the automatic spill-to-disk feature means that the query should 
actually succeed, at the expense of some
-        extra disk I/O to read and write temporary work data.
+        At this point, you have a query that is memory-intensive, but Impala 
can optimize it
+        efficiently so that the memory usage is not exorbitant. You have set 
an artificial
+        constraint through the <codeph>MEM_LIMIT</codeph> option so that the 
query would
+        normally fail with an out-of-memory error. But the automatic 
spill-to-disk feature means
+        that the query should actually succeed, at the expense of some extra 
disk I/O to read
+        and write temporary work data.
       </p>
 
       <p>
-        Try the query again, and confirm that it succeeds. Examine the 
<codeph>PROFILE</codeph> output again. This
-        time, look for lines of this form:
+        Try the query again, and confirm that it succeeds. Examine the 
<codeph>PROFILE</codeph>
+        output again. This time, look for lines of this form:
       </p>
 
 <codeblock>- SpilledPartitions: <varname>N</varname>
 </codeblock>
 
       <p>
-        If you see any such lines with <varname>N</varname> greater than 0, 
that indicates the query would have
-        failed in Impala releases prior to 2.0, but now it succeeded because 
of the spill-to-disk feature. Examine
-        the total time taken by the <codeph>AGGREGATION_NODE</codeph> or other 
query fragments containing non-zero
-        <codeph>SpilledPartitions</codeph> values. Compare the times to 
similar fragments that did not spill, for
-        example in the <codeph>PROFILE</codeph> output when the same query is 
run with a higher memory limit. This
-        gives you an idea of the performance penalty of the spill operation 
for a particular query with a
-        particular memory limit. If you make the memory limit just a little 
lower than the peak memory usage, the
-        query only needs to write a small amount of temporary data to disk. 
The lower you set the memory limit, the
+        If you see any such lines with <varname>N</varname> greater than 0, 
that indicates the
+        query would have failed in Impala releases prior to 2.0, but now it 
succeeded because of
+        the spill-to-disk feature. Examine the total time taken by the
+        <codeph>AGGREGATION_NODE</codeph> or other query fragments containing 
non-zero
+        <codeph>SpilledPartitions</codeph> values. Compare the times to 
similar fragments that
+        did not spill, for example in the <codeph>PROFILE</codeph> output when 
the same query is
+        run with a higher memory limit. This gives you an idea of the 
performance penalty of the
+        spill operation for a particular query with a particular memory limit. 
If you make the
+        memory limit just a little lower than the peak memory usage, the query 
only needs to
+        write a small amount of temporary data to disk. The lower you set the 
memory limit, the
         more temporary data is written and the slower the query becomes.
       </p>
 
       <p>
         Now repeat this procedure for actual queries used in your environment. 
Use the
-        <codeph>DISABLE_UNSAFE_SPILLS</codeph> setting to identify cases where 
queries used more memory than
-        necessary due to lack of statistics on the relevant tables and 
columns, and issue <codeph>COMPUTE
-        STATS</codeph> where necessary.
+        <codeph>DISABLE_UNSAFE_SPILLS</codeph> setting to identify cases where 
queries used more
+        memory than necessary due to lack of statistics on the relevant tables 
and columns, and
+        issue <codeph>COMPUTE STATS</codeph> where necessary.
       </p>
 
       <p>
@@ -729,242 +780,307 @@ these tables, hint the plan or disable this behavior 
via query options to enable
       </p>
 
       <p>
-        You might wonder, why not leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> 
turned on all the time. Whether and
-        how frequently to use this option depends on your system environment 
and workload.
+        You might wonder, why not leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> 
turned on all the
+        time. Whether and how frequently to use this option depends on your 
system environment
+        and workload.
       </p>
 
       <p>
-        <codeph>DISABLE_UNSAFE_SPILLS</codeph> is suitable for an environment 
with ad hoc queries whose performance
-        characteristics and memory usage are not known in advance. It prevents 
<q>worst-case scenario</q> queries
-        that use large amounts of memory unnecessarily. Thus, you might turn 
this option on within a session while
-        developing new SQL code, even though it is turned off for existing 
applications.
+        <codeph>DISABLE_UNSAFE_SPILLS</codeph> is suitable for an environment 
with ad hoc
+        queries whose performance characteristics and memory usage are not 
known in advance. It
+        prevents <q>worst-case scenario</q> queries that use large amounts of 
memory
+        unnecessarily. Thus, you might turn this option on within a session 
while developing new
+        SQL code, even though it is turned off for existing applications.
       </p>
 
       <p>
-        Organizations where table and column statistics are generally 
up-to-date might leave this option turned on
-        all the time, again to avoid worst-case scenarios for untested queries 
or if a problem in the ETL pipeline
-        results in a table with no statistics. Turning on 
<codeph>DISABLE_UNSAFE_SPILLS</codeph> lets you <q>fail
-        fast</q> in this case and immediately gather statistics or tune the 
problematic queries.
+        Organizations where table and column statistics are generally 
up-to-date might leave
+        this option turned on all the time, again to avoid worst-case 
scenarios for untested
+        queries or if a problem in the ETL pipeline results in a table with no 
statistics.
+        Turning on <codeph>DISABLE_UNSAFE_SPILLS</codeph> lets you <q>fail 
fast</q> in this case
+        and immediately gather statistics or tune the problematic queries.
       </p>
 
       <p>
-        Some organizations might leave this option turned off. For example, 
you might have tables large enough that
-        the <codeph>COMPUTE STATS</codeph> takes substantial time to run, 
making it impractical to re-run after
-        loading new data. If you have examined the <codeph>EXPLAIN</codeph> 
plans of your queries and know that
-        they are operating efficiently, you might leave 
<codeph>DISABLE_UNSAFE_SPILLS</codeph> turned off. In that
-        case, you know that any queries that spill will not go overboard with 
their memory consumption.
+        Some organizations might leave this option turned off. For example, 
you might have
+        tables large enough that the <codeph>COMPUTE STATS</codeph> takes 
substantial time to
+        run, making it impractical to re-run after loading new data. If you 
have examined the
+        <codeph>EXPLAIN</codeph> plans of your queries and know that they are 
operating
+        efficiently, you might leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> 
turned off. In that
+        case, you know that any queries that spill will not go overboard with 
their memory
+        consumption.
       </p>
 
     </conbody>
+
   </concept>
 
-<concept id="complex_query">
-<title>Limits on Query Size and Complexity</title>
-<conbody>
-<p>
-There are hardcoded limits on the maximum size and complexity of queries.
-Currently, the maximum number of expressions in a query is 2000.
-You might exceed the limits with large or deeply nested queries
-produced by business intelligence tools or other query generators.
-</p>
-<p>
-If you have the ability to customize such queries or the query generation
-logic that produces them, replace sequences of repetitive expressions
-with single operators such as <codeph>IN</codeph> or <codeph>BETWEEN</codeph>
-that can represent multiple values or ranges.
-For example, instead of a large number of <codeph>OR</codeph> clauses:
-</p>
+  <concept id="complex_query">
+
+    <title>Limits on Query Size and Complexity</title>
+
+    <conbody>
+
+      <p>
+        There are hardcoded limits on the maximum size and complexity of 
queries. Currently, the
+        maximum number of expressions in a query is 2000. You might exceed the 
limits with large
+        or deeply nested queries produced by business intelligence tools or 
other query
+        generators.
+      </p>
+
+      <p>
+        If you have the ability to customize such queries or the query 
generation logic that
+        produces them, replace sequences of repetitive expressions with single 
operators such as
+        <codeph>IN</codeph> or <codeph>BETWEEN</codeph> that can represent 
multiple values or
+        ranges. For example, instead of a large number of <codeph>OR</codeph> 
clauses:
+      </p>
+
 <codeblock>WHERE val = 1 OR val = 2 OR val = 6 OR val = 100 ...
 </codeblock>
-<p>
-use a single <codeph>IN</codeph> clause:
-</p>
+
+      <p>
+        use a single <codeph>IN</codeph> clause:
+      </p>
+
 <codeblock>WHERE val IN (1,2,6,100,...)</codeblock>
-</conbody>
-</concept>
 
-<concept id="scalability_io">
-<title>Scalability Considerations for Impala I/O</title>
-<conbody>
-<p>
-Impala parallelizes its I/O operations aggressively,
-therefore the more disks you can attach to each host, the better.
-Impala retrieves data from disk so quickly using
-bulk read operations on large blocks, that most queries
-are CPU-bound rather than I/O-bound.
-</p>
-<p>
-Because the kind of sequential scanning typically done by
-Impala queries does not benefit much from the random-access
-capabilities of SSDs, spinning disks typically provide
-the most cost-effective kind of storage for Impala data,
-with little or no performance penalty as compared to SSDs.
-</p>
-<p>
-Resource management features such as YARN, Llama, and admission control
-typically constrain the amount of memory, CPU, or overall number of
-queries in a high-concurrency environment.
-Currently, there is no throttling mechanism for Impala I/O.
-</p>
-</conbody>
-</concept>
+    </conbody>
+
+  </concept>
+
+  <concept id="scalability_io">
+
+    <title>Scalability Considerations for Impala I/O</title>
+
+    <conbody>
+
+      <p>
+        Impala parallelizes its I/O operations aggressively, therefore the 
more disks you can
+        attach to each host, the better. Impala retrieves data from disk so 
quickly using bulk
+        read operations on large blocks, that most queries are CPU-bound 
rather than I/O-bound.
+      </p>
+
+      <p>
+        Because the kind of sequential scanning typically done by Impala 
queries does not
+        benefit much from the random-access capabilities of SSDs, spinning 
disks typically
+        provide the most cost-effective kind of storage for Impala data, with 
little or no
+        performance penalty as compared to SSDs.
+      </p>
+
+      <p>
+        Resource management features such as YARN, Llama, and admission 
control typically
+        constrain the amount of memory, CPU, or overall number of queries in a 
high-concurrency
+        environment. Currently, there is no throttling mechanism for Impala 
I/O.
+      </p>
+
+    </conbody>
+
+  </concept>
 
   <concept id="big_tables">
+
     <title>Scalability Considerations for Table Layout</title>
+
     <conbody>
+
       <p>
-        Due to the overhead of retrieving and updating table metadata
-        in the metastore database, try to limit the number of columns
-        in a table to a maximum of approximately 2000.
-        Although Impala can handle wider tables than this, the metastore 
overhead
-        can become significant, leading to query performance that is slower
-        than expected based on the actual data volume.
+        Due to the overhead of retrieving and updating table metadata in the 
metastore database,
+        try to limit the number of columns in a table to a maximum of 
approximately 2000.
+        Although Impala can handle wider tables than this, the metastore 
overhead can become
+        significant, leading to query performance that is slower than expected 
based on the
+        actual data volume.
       </p>
+
       <p>
-        To minimize overhead related to the metastore database and Impala 
query planning,
-        try to limit the number of partitions for any partitioned table to a 
few tens of thousands.
+        To minimize overhead related to the metastore database and Impala 
query planning, try to
+        limit the number of partitions for any partitioned table to a few tens 
of thousands.
       </p>
+
       <p rev="IMPALA-5309">
-        If the volume of data within a table makes it impractical to run 
exploratory
-        queries, consider using the <codeph>TABLESAMPLE</codeph> clause to 
limit query processing
-        to only a percentage of data within the table. This technique reduces 
the overhead
-        for query startup, I/O to read the data, and the amount of network, 
CPU, and memory
-        needed to process intermediate results during the query. See <xref 
keyref="tablesample"/>
-        for details.
+        If the volume of data within a table makes it impractical to run 
exploratory queries,
+        consider using the <codeph>TABLESAMPLE</codeph> clause to limit query 
processing to only
+        a percentage of data within the table. This technique reduces the 
overhead for query
+        startup, I/O to read the data, and the amount of network, CPU, and 
memory needed to
+        process intermediate results during the query. See <xref 
keyref="tablesample"/> for
+        details.
       </p>
+
     </conbody>
+
   </concept>
 
-<concept rev="" id="kerberos_overhead_cluster_size">
-<title>Kerberos-Related Network Overhead for Large Clusters</title>
-<conbody>
-<p>
-When Impala starts up, or after each <codeph>kinit</codeph> refresh, Impala 
sends a number of
-simultaneous requests to the KDC. For a cluster with 100 hosts, the KDC might 
be able to process
-all the requests within roughly 5 seconds. For a cluster with 1000 hosts, the 
time to process
-the requests would be roughly 500 seconds. Impala also makes a number of DNS 
requests at the same
-time as these Kerberos-related requests.
-</p>
-<p>
-While these authentication requests are being processed, any submitted Impala 
queries will fail.
-During this period, the KDC and DNS may be slow to respond to requests from 
components other than Impala,
-so other secure services might be affected temporarily.
-</p>
-  <p>
-    In <keyword keyref="impala212_full"/> or earlier, to reduce the
-    frequency of the <codeph>kinit</codeph> renewal that initiates a new set
-    of authentication requests, increase the 
<codeph>kerberos_reinit_interval</codeph>
-    configuration setting for the <codeph>impalad</codeph> daemons. Currently,
-    the default is 60 minutes. Consider using a higher value such as 360 (6 
hours).
-      </p>
-  <p>
-    The <codeph>kerberos_reinit_interval</codeph> configuration setting is 
removed
-    in <keyword keyref="impala30_full"/>, and the above step is no longer 
needed.
-  </p>
-
-</conbody>
-</concept>
+  <concept rev="" id="kerberos_overhead_cluster_size">
+
+    <title>Kerberos-Related Network Overhead for Large Clusters</title>
+
+    <conbody>
+
+      <p>
+        When Impala starts up, or after each <codeph>kinit</codeph> refresh, 
Impala sends a
+        number of simultaneous requests to the KDC. For a cluster with 100 
hosts, the KDC might
+        be able to process all the requests within roughly 5 seconds. For a 
cluster with 1000
+        hosts, the time to process the requests would be roughly 500 seconds. 
Impala also makes
+        a number of DNS requests at the same time as these Kerberos-related 
requests.
+      </p>
+
+      <p>
+        While these authentication requests are being processed, any submitted 
Impala queries
+        will fail. During this period, the KDC and DNS may be slow to respond 
to requests from
+        components other than Impala, so other secure services might be 
affected temporarily.
+      </p>
+
+      <p>
+        In <keyword keyref="impala212_full"/> or earlier, to reduce the 
frequency of the
+        <codeph>kinit</codeph> renewal that initiates a new set of 
authentication requests,
+        increase the <codeph>kerberos_reinit_interval</codeph> configuration 
setting for the
+        <codeph>impalad</codeph> daemons. Currently, the default is 60 
minutes. Consider using a
+        higher value such as 360 (6 hours).
+      </p>
+
+      <p>
+        The <codeph>kerberos_reinit_interval</codeph> configuration setting is 
removed in
+        <keyword keyref="impala30_full"/>, and the above step is no longer 
needed.
+      </p>
+
+    </conbody>
+
+  </concept>
 
   <concept id="scalability_hotspots" rev="2.5.0 IMPALA-2696">
+
     <title>Avoiding CPU Hotspots for HDFS Cached Data</title>
+
     <conbody>
+
       <p>
-        You can use the HDFS caching feature, described in <xref 
href="impala_perf_hdfs_caching.xml#hdfs_caching"/>,
-        with Impala to reduce I/O and memory-to-memory copying for frequently 
accessed tables or partitions.
+        You can use the HDFS caching feature, described in
+        <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/>, with Impala 
to reduce I/O and
+        memory-to-memory copying for frequently accessed tables or partitions.
       </p>
+
       <p>
         In the early days of this feature, you might have found that enabling 
HDFS caching
         resulted in little or no performance improvement, because it could 
result in
-        <q>hotspots</q>: instead of the I/O to read the table data being 
parallelized across
-        the cluster, the I/O was reduced but the CPU load to process the data 
blocks
-        might be concentrated on a single host.
+        <q>hotspots</q>: instead of the I/O to read the table data being 
parallelized across the
+        cluster, the I/O was reduced but the CPU load to process the data 
blocks might be
+        concentrated on a single host.
       </p>
+
       <p>
         To avoid hotspots, include the <codeph>WITH REPLICATION</codeph> 
clause with the
-        <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> 
statements for tables that use HDFS caching.
-        This clause allows more than one host to cache the relevant data 
blocks, so the CPU load
-        can be shared, reducing the load on any one host.
-        See <xref href="impala_create_table.xml#create_table"/> and <xref 
href="impala_alter_table.xml#alter_table"/>
-        for details.
+        <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> 
statements for tables that
+        use HDFS caching. This clause allows more than one host to cache the 
relevant data
+        blocks, so the CPU load can be shared, reducing the load on any one 
host. See
+        <xref href="impala_create_table.xml#create_table"/> and
+        <xref href="impala_alter_table.xml#alter_table"/> for details.
       </p>
+
       <p>
         Hotspots with high CPU load for HDFS cached data could still arise in 
some cases, due to
-        the way that Impala schedules the work of processing data blocks on 
different hosts.
-        In <keyword keyref="impala25_full"/> and higher, scheduling 
improvements mean that the work for
-        HDFS cached data is divided better among all the hosts that have 
cached replicas
-        for a particular data block. When more than one host has a cached 
replica for a data block,
-        Impala assigns the work of processing that block to whichever host has 
done the least work
-        (in terms of number of bytes read) for the current query. If hotspots 
persist even with this
-        load-based scheduling algorithm, you can enable the query option 
<codeph>SCHEDULE_RANDOM_REPLICA=TRUE</codeph>
-        to further distribute the CPU load. This setting causes Impala to 
randomly pick a host to process a cached
-        data block if the scheduling algorithm encounters a tie when deciding 
which host has done the
-        least work.
+        the way that Impala schedules the work of processing data blocks on 
different hosts. In
+        <keyword keyref="impala25_full"/> and higher, scheduling improvements 
mean that the work
+        for HDFS cached data is divided better among all the hosts that have 
cached replicas for
+        a particular data block. When more than one host has a cached replica 
for a data block,
+        Impala assigns the work of processing that block to whichever host has 
done the least
+        work (in terms of number of bytes read) for the current query. If 
hotspots persist even
+        with this load-based scheduling algorithm, you can enable the query 
option
+        <codeph>SCHEDULE_RANDOM_REPLICA=TRUE</codeph> to further distribute 
the CPU load. This
+        setting causes Impala to randomly pick a host to process a cached data 
block if the
+        scheduling algorithm encounters a tie when deciding which host has 
done the least work.
       </p>
+
     </conbody>
+
   </concept>
 
   <concept id="scalability_file_handle_cache" rev="2.10.0 IMPALA-4623">
+
     <title>Scalability Considerations for NameNode Traffic with File Handle 
Caching</title>
+
     <conbody>
+
       <p>
         One scalability aspect that affects heavily loaded clusters is the 
load on the HDFS
-        NameNode, from looking up the details as each HDFS file is opened. 
Impala queries
-        often access many different HDFS files, for example if a query does a 
full table scan
-        on a table with thousands of partitions, each partition containing 
multiple data files.
-        Accessing each column of a Parquet file also involves a separate 
<q>open</q> call,
-        further increasing the load on the NameNode. High NameNode overhead 
can add startup time
-        (that is, increase latency) to Impala queries, and reduce overall 
throughput for non-Impala
-        workloads that also require accessing HDFS files.
-      </p>
-      <p> In <keyword keyref="impala210_full"/> and higher, you can reduce
-        NameNode overhead by enabling a caching feature for HDFS file handles.
-        Data files that are accessed by different queries, or even multiple
-        times within the same query, can be accessed without a new <q>open</q>
-        call and without fetching the file details again from the NameNode. 
</p>
-      <p>
-        Because this feature only involves HDFS data files, it does not apply 
to non-HDFS tables,
-        such as Kudu or HBase tables, or tables that store their data on cloud 
services such as
-        S3 or ADLS. Any read operations that perform remote reads also skip 
the cached file handles.
-      </p>
-      <p> The feature is enabled by default with 20,000 file handles to be
-        cached. To change the value, set the configuration option
-          <codeph>max_cached_file_handles</codeph> to a non-zero value for each
-          <cmdname>impalad</cmdname> daemon. From the initial default value of
-        20000, adjust upward if NameNode request load is still significant, or
-        downward if it is more important to reduce the extra memory usage on
-        each host. Each cache entry consumes 6 KB, meaning that caching 20,000
-        file handles requires up to 120 MB on each Impala executor. The exact
-        memory usage varies depending on how many file handles have actually
-        been cached; memory is freed as file handles are evicted from the 
cache. </p>
-      <p>
-        If a manual HDFS operation moves a file to the HDFS Trashcan while the 
file handle is cached,
-        Impala still accesses the contents of that file. This is a change from 
prior behavior. Previously,
-        accessing a file that was in the trashcan would cause an error. This 
behavior only applies to
-        non-Impala methods of removing HDFS files, not the Impala mechanisms 
such as <codeph>TRUNCATE TABLE</codeph>
-        or <codeph>DROP TABLE</codeph>.
-      </p>
-      <p>
-        If files are removed, replaced, or appended by HDFS operations outside 
of Impala, the way to bring the
-        file information up to date is to run the <codeph>REFRESH</codeph> 
statement on the table.
-      </p>
-      <p>
-        File handle cache entries are evicted as the cache fills up, or based 
on a timeout period
-        when they have not been accessed for some time.
-      </p>
-      <p>
-        To evaluate the effectiveness of file handle caching for a particular 
workload, issue the
-        <codeph>PROFILE</codeph> statement in <cmdname>impala-shell</cmdname> 
or examine query
-        profiles in the Impala web UI. Look for the ratio of 
<codeph>CachedFileHandlesHitCount</codeph>
-        (ideally, should be high) to 
<codeph>CachedFileHandlesMissCount</codeph> (ideally, should be low).
-        Before starting any evaluation, run some representative queries to 
<q>warm up</q> the cache,
-        because the first time each data file is accessed is always recorded 
as a cache miss.
+        NameNode from looking up the details as each HDFS file is opened. 
Impala queries often
+        access many different HDFS files. For example, a query that does a 
full table scan on a
+        partitioned table may need to read thousands of partitions, each 
partition containing
+        multiple data files. Accessing each column of a Parquet file also 
involves a separate
+        <q>open</q> call, further increasing the load on the NameNode. High 
NameNode overhead
+        can add startup time (that is, increase latency) to Impala queries, 
and reduce overall
+        throughput for non-Impala workloads that also require accessing HDFS 
files.
+      </p>
+
+      <p>
+        In <keyword keyref="impala210_full"/> and higher, you can reduce 
NameNode overhead by
+        enabling a caching feature for HDFS file handles. Data files that are 
accessed by
+        different queries, or even multiple times within the same query, can 
be accessed without
+        a new <q>open</q> call and without fetching the file details again 
from the NameNode.
+      </p>
+
+      <p>
+        In Impala 3.2 and higher, file handle caching also applies to remote 
HDFS file handles.
+        This is controlled by the <codeph>cache_remote_file_handles</codeph> 
flag for an
+        <codeph>impalad</codeph>. It is recommended that you use the default 
value of
+        <codeph>true</codeph> as this caching prevents your NameNode from 
overloading when your
+        cluster has many remote HDFS reads.
+      </p>
+
+      <p>
+        Because this feature only involves HDFS data files, it does not apply 
to non-HDFS
+        tables, such as Kudu or HBase tables, or tables that store their data 
on cloud services
+        such as S3 or ADLS.
+      </p>
+
+      <p>
+        The feature is enabled by default with 20,000 file handles to be 
cached. To change the
+        value, set the configuration option 
<codeph>max_cached_file_handles</codeph> to a
+        non-zero value for each <cmdname>impalad</cmdname> daemon. From the 
initial default
+        value of 20000, adjust upward if NameNode request load is still 
significant, or downward
+        if it is more important to reduce the extra memory usage on each host. 
Each cache entry
+        consumes 6 KB, meaning that caching 20,000 file handles requires up to 
120 MB on each
+        Impala executor. The exact memory usage varies depending on how many 
file handles have
+        actually been cached; memory is freed as file handles are evicted from 
the cache.
+      </p>
+
+      <p>
+        If a manual HDFS operation moves a file to the HDFS Trashcan while the 
file handle is
+        cached, Impala still accesses the contents of that file. This is a 
change from prior
+        behavior. Previously, accessing a file that was in the trashcan would 
cause an error.
+        This behavior only applies to non-Impala methods of removing HDFS 
files, not the Impala
+        mechanisms such as <codeph>TRUNCATE TABLE</codeph> or <codeph>DROP 
TABLE</codeph>.
+      </p>
+
+      <p>
+        If files are removed, replaced, or appended by HDFS operations outside 
of Impala, the
+        way to bring the file information up to date is to run the 
<codeph>REFRESH</codeph>
+        statement on the table.
+      </p>
+
+      <p>
+        File handle cache entries are evicted as the cache fills up, or based 
on a timeout
+        period when they have not been accessed for some time.
+      </p>
+
+      <p>
+        To evaluate the effectiveness of file handle caching for a particular 
workload, issue
+        the <codeph>PROFILE</codeph> statement in 
<cmdname>impala-shell</cmdname> or examine
+        query profiles in the Impala Web UI. Look for the ratio of
+        <codeph>CachedFileHandlesHitCount</codeph> (ideally, should be high) to
+        <codeph>CachedFileHandlesMissCount</codeph> (ideally, should be low). 
Before starting
+        any evaluation, run several representative queries to <q>warm up</q> 
the cache because
+        the first time each data file is accessed is always recorded as a 
cache miss.
+      </p>
+
+      <p>
         To see metrics about file handle caching for each 
<cmdname>impalad</cmdname> instance,
-        examine the <uicontrol>/metrics</uicontrol> page in the Impala web UI, 
in particular the fields
-        
<uicontrol>impala-server.io.mgr.cached-file-handles-miss-count</uicontrol>,
+        examine the <uicontrol>/metrics</uicontrol> page in the Impala Web UI, 
in particular the
+        fields 
<uicontrol>impala-server.io.mgr.cached-file-handles-miss-count</uicontrol>,
         
<uicontrol>impala-server.io.mgr.cached-file-handles-hit-count</uicontrol>, and
         <uicontrol>impala-server.io.mgr.num-cached-file-handles</uicontrol>.
       </p>
+
     </conbody>
+
   </concept>
 
 </concept>

[impala] 05/05: IMPALA-8105: [DOCS] Document cache_remote_file_handles flag

Reply via email to