[09/51] [partial] incubator-impala git commit: IMPALA-4181 [DOCS] Publish rendered Impala documentation to ASF site

jbapple Wed, 12 Apr 2017 11:25:32 -0700

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_scalability.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_scalability.html 
b/docs/build/html/topics/impala_scalability.html
new file mode 100644
index 0000000..a850d35
--- /dev/null
+++ b/docs/build/html/topics/impala_scalability.html
@@ -0,0 +1,711 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta 
name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta 
name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta 
name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta 
name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta 
name="prodname" content="Impala"><meta name="version" content="Impala 
2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="version" content="Impala 2.8.x"><meta name="version" content="Impala 
2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" 
content="Impala 2.
 8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" 
content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta 
name="DC.Identifier" content="scalability"><link rel="stylesheet" 
type="text/css" href="../commonltr.css"><title>Scalability Considerations for 
Impala</title></head><body id="scalability"><main role="main"><article 
role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Scalability Considerations 
for Impala</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      This section explains how the size of your cluster and the volume of 
data influences SQL performance and
+      schema design for Impala tables. Typically, adding more cluster capacity 
reduces problems due to memory
+      limits or disk throughput. On the other hand, larger clusters are more 
likely to have other kinds of
+      scalability issues, such as a single slow node that causes performance 
problems for queries.
+    </p>
+
+    <p class="p toc inpage"></p>
+
+    <p class="p">
+        A good source of tips related to scalability and performance tuning is 
the
+        <a class="xref" 
href="http://www.slideshare.net/cloudera/the-impala-cookbook-42530186"; 
target="_blank">Impala Cookbook</a>
+        presentation. These slides are updated periodically as new features 
come out and new benchmarks are performed.
+      </p>
+
+  </div>
+
+  
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title2" 
id="scalability__scalability_catalog">
+
+    <h2 class="title topictitle2" id="ariaid-title2">Impact of Many Tables or 
Partitions on Impala Catalog Performance and Memory Usage</h2>
+
+    <div class="body conbody">
+
+      
+
+      <p class="p">
+        Because Hadoop I/O is optimized for reading and writing large files, 
Impala is optimized for tables
+        containing relatively few, large data files. Schemas containing 
thousands of tables, or tables containing
+        thousands of partitions, can encounter performance issues during 
startup or during DDL operations such as
+        <code class="ph codeph">ALTER TABLE</code> statements.
+      </p>
+
+      <div class="note important note_important"><span class="note__title 
importanttitle">Important:</span> 
+      <p class="p">
+        Because of a change in the default heap size for the <span 
class="keyword cmdname">catalogd</span> daemon in
+        <span class="keyword">Impala 2.5</span> and higher, the following 
procedure to increase the <span class="keyword cmdname">catalogd</span>
+        memory limit might be required following an upgrade to <span 
class="keyword">Impala 2.5</span> even if not
+        needed previously.
+      </p>
+      </div>
+
+      <div class="p">
+        For schemas with large numbers of tables, partitions, and data files, 
the <span class="keyword cmdname">catalogd</span>
+        daemon might encounter an out-of-memory error. To increase the memory 
limit for the
+        <span class="keyword cmdname">catalogd</span> daemon:
+
+        <ol class="ol">
+          <li class="li">
+            <p class="p">
+              Check current memory usage for the <span class="keyword 
cmdname">catalogd</span> daemon by running the
+              following commands on the host where that daemon runs on your 
cluster:
+            </p>
+  <pre class="pre codeblock"><code>
+  jcmd <var class="keyword varname">catalogd_pid</var> VM.flags
+  jmap -heap <var class="keyword varname">catalogd_pid</var>
+  </code></pre>
+          </li>
+          <li class="li">
+            <p class="p">
+              Decide on a large enough value for the <span class="keyword 
cmdname">catalogd</span> heap.
+              You express it as an environment variable value as follows:
+            </p>
+  <pre class="pre codeblock"><code>
+  JAVA_TOOL_OPTIONS="-Xmx8g"
+  </code></pre>
+          </li>
+          <li class="li">
+            <p class="p">
+              On systems not using cluster management software, put this 
environment variable setting into the
+              startup script for the <span class="keyword 
cmdname">catalogd</span> daemon, then restart the <span class="keyword 
cmdname">catalogd</span>
+              daemon.
+            </p>
+          </li>
+          <li class="li">
+            <p class="p">
+              Use the same <span class="keyword cmdname">jcmd</span> and <span 
class="keyword cmdname">jmap</span> commands as earlier to
+              verify that the new settings are in effect.
+            </p>
+          </li>
+        </ol>
+      </div>
+
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title3" 
id="scalability__statestore_scalability">
+
+    <h2 class="title topictitle2" id="ariaid-title3">Scalability 
Considerations for the Impala Statestore</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        Before <span class="keyword">Impala 2.1</span>, the statestore sent 
only one kind of message to its subscribers. This message contained all
+        updates for any topics that a subscriber had subscribed to. It also 
served to let subscribers know that the
+        statestore had not failed, and conversely the statestore used the 
success of sending a heartbeat to a
+        subscriber to decide whether or not the subscriber had failed.
+      </p>
+
+      <p class="p">
+        Combining topic updates and failure detection in a single message led 
to bottlenecks in clusters with large
+        numbers of tables, partitions, and HDFS data blocks. When the 
statestore was overloaded with metadata
+        updates to transmit, heartbeat messages were sent less frequently, 
sometimes causing subscribers to time
+        out their connection with the statestore. Increasing the subscriber 
timeout and decreasing the frequency of
+        statestore heartbeats worked around the problem, but reduced 
responsiveness when the statestore failed or
+        restarted.
+      </p>
+
+      <p class="p">
+        As of <span class="keyword">Impala 2.1</span>, the statestore now 
sends topic updates and heartbeats in separate messages. This allows the
+        statestore to send and receive a steady stream of lightweight 
heartbeats, and removes the requirement to
+        send topic updates according to a fixed schedule, reducing statestore 
network overhead.
+      </p>
+
+      <p class="p">
+        The statestore now has the following relevant configuration flags for 
the <span class="keyword cmdname">statestored</span>
+        daemon:
+      </p>
+
+      <dl class="dl">
+        
+
+          <dt class="dt dlterm" 
id="statestore_scalability__statestore_num_update_threads">
+            <code class="ph codeph">-statestore_num_update_threads</code>
+          </dt>
+
+          <dd class="dd">
+            The number of threads inside the statestore dedicated to sending 
topic updates. You should not
+            typically need to change this value.
+            <p class="p">
+              <strong class="ph b">Default:</strong> 10
+            </p>
+          </dd>
+
+        
+
+        
+
+          <dt class="dt dlterm" 
id="statestore_scalability__statestore_update_frequency_ms">
+            <code class="ph codeph">-statestore_update_frequency_ms</code>
+          </dt>
+
+          <dd class="dd">
+            The frequency, in milliseconds, with which the statestore tries to 
send topic updates to each
+            subscriber. This is a best-effort value; if the statestore is 
unable to meet this frequency, it sends
+            topic updates as fast as it can. You should not typically need to 
change this value.
+            <p class="p">
+              <strong class="ph b">Default:</strong> 2000
+            </p>
+          </dd>
+
+        
+
+        
+
+          <dt class="dt dlterm" 
id="statestore_scalability__statestore_num_heartbeat_threads">
+            <code class="ph codeph">-statestore_num_heartbeat_threads</code>
+          </dt>
+
+          <dd class="dd">
+            The number of threads inside the statestore dedicated to sending 
heartbeats. You should not typically
+            need to change this value.
+            <p class="p">
+              <strong class="ph b">Default:</strong> 10
+            </p>
+          </dd>
+
+        
+
+        
+
+          <dt class="dt dlterm" 
id="statestore_scalability__statestore_heartbeat_frequency_ms">
+            <code class="ph codeph">-statestore_heartbeat_frequency_ms</code>
+          </dt>
+
+          <dd class="dd">
+            The frequency, in milliseconds, with which the statestore tries to 
send heartbeats to each subscriber.
+            This value should be good for large catalogs and clusters up to 
approximately 150 nodes. Beyond that,
+            you might need to increase this value to make the interval longer 
between heartbeat messages.
+            <p class="p">
+              <strong class="ph b">Default:</strong> 1000 (one heartbeat 
message every second)
+            </p>
+          </dd>
+
+        
+      </dl>
+
+      <p class="p">
+        If it takes a very long time for a cluster to start up, and <span 
class="keyword cmdname">impala-shell</span> consistently
+        displays <code class="ph codeph">This Impala daemon is not ready to 
accept user requests</code>, the statestore might be
+        taking too long to send the entire catalog topic to the cluster. In 
this case, consider adding
+        <code class="ph codeph">--load_catalog_in_background=false</code> to 
your catalog service configuration. This setting
+        stops the statestore from loading the entire catalog into memory at 
cluster startup. Instead, metadata for
+        each table is loaded when the table is accessed for the first time.
+      </p>
+    </div>
+  </article>
+
+  
+
+  
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title4" 
id="scalability__spill_to_disk">
+
+    <h2 class="title topictitle2" id="ariaid-title4">SQL Operations that Spill 
to Disk</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        Certain memory-intensive operations write temporary data to disk 
(known as <dfn class="term">spilling</dfn> to disk)
+        when Impala is close to exceeding its memory limit on a particular 
host.
+      </p>
+
+      <p class="p">
+        The result is a query that completes successfully, rather than failing 
with an out-of-memory error. The
+        tradeoff is decreased performance due to the extra disk I/O to write 
the temporary data and read it back
+        in. The slowdown could be potentially be significant. Thus, while this 
feature improves reliability,
+        you should optimize your queries, system parameters, and hardware 
configuration to make this spilling a rare occurrence.
+      </p>
+
+      <p class="p">
+        <strong class="ph b">What kinds of queries might spill to 
disk:</strong>
+      </p>
+
+      <p class="p">
+        Several SQL clauses and constructs require memory allocations that 
could activat the spilling mechanism:
+      </p>
+      <ul class="ul">
+        <li class="li">
+          <p class="p">
+            when a query uses a <code class="ph codeph">GROUP BY</code> clause 
for columns
+            with millions or billions of distinct values, Impala keeps a
+            similar number of temporary results in memory, to accumulate the
+            aggregate results for each value in the group.
+          </p>
+        </li>
+        <li class="li">
+          <p class="p">
+            When large tables are joined together, Impala keeps the values of
+            the join columns from one table in memory, to compare them to
+            incoming values from the other table.
+          </p>
+        </li>
+        <li class="li">
+          <p class="p">
+            When a large result set is sorted by the <code class="ph 
codeph">ORDER BY</code>
+            clause, each node sorts its portion of the result set in memory.
+          </p>
+        </li>
+        <li class="li">
+          <p class="p">
+            The <code class="ph codeph">DISTINCT</code> and <code class="ph 
codeph">UNION</code> operators
+            build in-memory data structures to represent all values found so
+            far, to eliminate duplicates as the query progresses.
+          </p>
+        </li>
+        
+      </ul>
+
+      <p class="p">
+        When the spill-to-disk feature is activated for a join node within a 
query, Impala does not
+        produce any runtime filters for that join operation on that host. 
Other join nodes within
+        the query are not affected.
+      </p>
+
+      <p class="p">
+        <strong class="ph b">How Impala handles scratch disk space for 
spilling:</strong>
+      </p>
+
+      <p class="p">
+        By default, intermediate files used during large sort, join, 
aggregation, or analytic function operations
+        are stored in the directory <span class="ph 
filepath">/tmp/impala-scratch</span> . These files are removed when the
+        operation finishes. (Multiple concurrent queries can perform 
operations that use the <span class="q">"spill to disk"</span>
+        technique, without any name conflicts for these temporary files.) You 
can specify a different location by
+        starting the <span class="keyword cmdname">impalad</span> daemon with 
the
+        <code class="ph codeph">--scratch_dirs="<var class="keyword 
varname">path_to_directory</var>"</code> configuration option.
+        You can specify a single directory, or a comma-separated list of 
directories. The scratch directories must
+        be on the local filesystem, not in HDFS. You might specify different 
directory paths for different hosts,
+        depending on the capacity and speed
+        of the available storage devices. In <span class="keyword">Impala 
2.3</span> or higher, Impala successfully starts (with a warning
+        Impala successfully starts (with a warning written to the log) if it 
cannot create or read and write files
+        in one of the scratch directories. If there is less than 1 GB free on 
the filesystem where that directory resides,
+        Impala still runs, but writes a warning message to its log.  If Impala 
encounters an error reading or writing
+        files in a scratch directory during a query, Impala logs the error and 
the query fails.
+      </p>
+
+      <p class="p">
+        <strong class="ph b">Memory usage for SQL operators:</strong>
+      </p>
+
+      <p class="p">
+        The infrastructure of the spilling feature affects the way the 
affected SQL operators, such as
+        <code class="ph codeph">GROUP BY</code>, <code class="ph 
codeph">DISTINCT</code>, and joins, use memory.
+        On each host that participates in the query, each such operator in a 
query accumulates memory
+        while building the data structure to process the aggregation or join 
operation. The amount
+        of memory used depends on the portion of the data being handled by 
that host, and thus might
+        be different from one host to another. When the amount of memory being 
used for the operator
+        on a particular host reaches a threshold amount, Impala reserves an 
additional memory buffer
+        to use as a work area in case that operator causes the query to exceed 
the memory limit for
+        that host. After allocating the memory buffer, the memory used by that 
operator remains
+        essentially stable or grows only slowly, until the point where the 
memory limit is reached
+        and the query begins writing temporary data to disk.
+      </p>
+
+      <p class="p">
+        Prior to Impala 2.2, the extra memory buffer for an operator that 
might spill to disk
+        was allocated when the data structure used by the applicable SQL 
operator reaches 16 MB in size,
+        and the memory buffer itself was 512 MB. In Impala 2.2, these values 
are halved: the threshold value
+        is 8 MB and the memory buffer is 256 MB. <span class="ph">In <span 
class="keyword">Impala 2.3</span> and higher, the memory for the buffer
+        is allocated in pieces, only as needed, to avoid sudden large jumps in 
memory usage.</span> A query that uses
+        multiple such operators might allocate multiple such memory buffers, 
as the size of the data structure
+        for each operator crosses the threshold on a particular host.
+      </p>
+
+      <p class="p">
+        Therefore, a query that processes a relatively small amount of data on 
each host would likely
+        never reach the threshold for any operator, and would never allocate 
any extra memory buffers. A query
+        that did process millions of groups, distinct values, join keys, and 
so on might cross the threshold,
+        causing its memory requirement to rise suddenly and then flatten out. 
The larger the cluster, less data is processed
+        on any particular host, thus reducing the chance of requiring the 
extra memory allocation.
+      </p>
+
+      <p class="p">
+        <strong class="ph b">Added in:</strong> This feature was added to the 
<code class="ph codeph">ORDER BY</code> clause in Impala 1.4.
+        This feature was extended to cover join queries, aggregation 
functions, and analytic
+        functions in Impala 2.0. The size of the memory work area required by
+        each operator that spills was reduced from 512 megabytes to 256 
megabytes in Impala 2.2.
+      </p>
+
+      <p class="p">
+        <strong class="ph b">Avoiding queries that spill to disk:</strong>
+      </p>
+
+      <p class="p">
+        Because the extra I/O can impose significant performance overhead on 
these types of queries, try to avoid
+        this situation by using the following steps:
+      </p>
+
+      <ol class="ol">
+        <li class="li">
+          Detect how often queries spill to disk, and how much temporary data 
is written. Refer to the following
+          sources:
+          <ul class="ul">
+            <li class="li">
+              The output of the <code class="ph codeph">PROFILE</code> command 
in the <span class="keyword cmdname">impala-shell</span>
+              interpreter. This data shows the memory usage for each host and 
in total across the cluster. The
+              <code class="ph codeph">BlockMgr.BytesWritten</code> counter 
reports how much data was written to disk during the
+              query.
+            </li>
+
+            <li class="li">
+              The <span class="ph uicontrol">Queries</span> tab in the Impala 
debug web user interface. Select the query to
+              examine and click the corresponding <span class="ph 
uicontrol">Profile</span> link. This data breaks down the
+              memory usage for a single host within the cluster, the host 
whose web interface you are connected to.
+            </li>
+          </ul>
+        </li>
+
+        <li class="li">
+          Use one or more techniques to reduce the possibility of the queries 
spilling to disk:
+          <ul class="ul">
+            <li class="li">
+              Increase the Impala memory limit if practical, for example, if 
you can increase the available memory
+              by more than the amount of temporary data written to disk on a 
particular node. Remember that in
+              Impala 2.0 and later, you can issue <code class="ph codeph">SET 
MEM_LIMIT</code> as a SQL statement, which lets you
+              fine-tune the memory usage for queries from JDBC and ODBC 
applications.
+            </li>
+
+            <li class="li">
+              Increase the number of nodes in the cluster, to increase the 
aggregate memory available to Impala and
+              reduce the amount of memory required on each node.
+            </li>
+
+            <li class="li">
+              Increase the overall memory capacity of each DataNode at the 
hardware level.
+            </li>
+
+            <li class="li">
+              On a cluster with resources shared between Impala and other 
Hadoop components, use resource
+              management features to allocate more memory for Impala. See
+              <a class="xref" 
href="impala_resource_management.html#resource_management">Resource Management 
for Impala</a> for details.
+            </li>
+
+            <li class="li">
+              If the memory pressure is due to running many concurrent queries 
rather than a few memory-intensive
+              ones, consider using the Impala admission control feature to 
lower the limit on the number of
+              concurrent queries. By spacing out the most resource-intensive 
queries, you can avoid spikes in
+              memory usage and improve overall response times. See
+              <a class="xref" 
href="impala_admission.html#admission_control">Admission Control and Query 
Queuing</a> for details.
+            </li>
+
+            <li class="li">
+              Tune the queries with the highest memory requirements, using one 
or more of the following techniques:
+              <ul class="ul">
+                <li class="li">
+                  Run the <code class="ph codeph">COMPUTE STATS</code> 
statement for all tables involved in large-scale joins and
+                  aggregation queries.
+                </li>
+
+                <li class="li">
+                  Minimize your use of <code class="ph codeph">STRING</code> 
columns in join columns. Prefer numeric values
+                  instead.
+                </li>
+
+                <li class="li">
+                  Examine the <code class="ph codeph">EXPLAIN</code> plan to 
understand the execution strategy being used for the
+                  most resource-intensive queries. See <a class="xref" 
href="impala_explain_plan.html#perf_explain">Using the EXPLAIN Plan for 
Performance Tuning</a> for
+                  details.
+                </li>
+
+                <li class="li">
+                  If Impala still chooses a suboptimal execution strategy even 
with statistics available, or if it
+                  is impractical to keep the statistics up to date for huge or 
rapidly changing tables, add hints
+                  to the most resource-intensive queries to select the right 
execution strategy. See
+                  <a class="xref" href="impala_hints.html#hints">Query Hints 
in Impala SELECT Statements</a> for details.
+                </li>
+              </ul>
+            </li>
+
+            <li class="li">
+              If your queries experience substantial performance overhead due 
to spilling, enable the
+              <code class="ph codeph">DISABLE_UNSAFE_SPILLS</code> query 
option. This option prevents queries whose memory usage
+              is likely to be exorbitant from spilling to disk. See
+              <a class="xref" 
href="impala_disable_unsafe_spills.html#disable_unsafe_spills">DISABLE_UNSAFE_SPILLS
 Query Option (Impala 2.0 or higher only)</a> for details. As you tune
+              problematic queries using the preceding steps, fewer and fewer 
will be cancelled by this option
+              setting.
+            </li>
+          </ul>
+        </li>
+      </ol>
+
+      <p class="p">
+        <strong class="ph b">Testing performance implications of spilling to 
disk:</strong>
+      </p>
+
+      <p class="p">
+        To artificially provoke spilling, to test this feature and understand 
the performance implications, use a
+        test environment with a memory limit of at least 2 GB. Issue the <code 
class="ph codeph">SET</code> command with no
+        arguments to check the current setting for the <code class="ph 
codeph">MEM_LIMIT</code> query option. Set the query
+        option <code class="ph codeph">DISABLE_UNSAFE_SPILLS=true</code>. This 
option limits the spill-to-disk feature to prevent
+        runaway disk usage from queries that are known in advance to be 
suboptimal. Within
+        <span class="keyword cmdname">impala-shell</span>, run a query that 
you expect to be memory-intensive, based on the criteria
+        explained earlier. A self-join of a large table is a good candidate:
+      </p>
+
+<pre class="pre codeblock"><code>select count(*) from big_table a join 
big_table b using (column_with_many_values);
+</code></pre>
+
+      <p class="p">
+        Issue the <code class="ph codeph">PROFILE</code> command to get a 
detailed breakdown of the memory usage on each node
+        during the query. The crucial part of the profile output concerning 
memory is the <code class="ph codeph">BlockMgr</code>
+        portion. For example, this profile shows that the query did not quite 
exceed the memory limit.
+      </p>
+
+<pre class="pre codeblock"><code>BlockMgr:
+   - BlockWritesIssued: 1
+   - BlockWritesOutstanding: 0
+   - BlocksCreated: 24
+   - BlocksRecycled: 1
+   - BufferedPins: 0
+   - MaxBlockSize: 8.00 MB (8388608)
+   <strong class="ph b">- MemoryLimit: 200.00 MB (209715200)</strong>
+   <strong class="ph b">- PeakMemoryUsage: 192.22 MB (201555968)</strong>
+   - TotalBufferWaitTime: 0ns
+   - TotalEncryptionTime: 0ns
+   - TotalIntegrityCheckTime: 0ns
+   - TotalReadBlockTime: 0ns
+</code></pre>
+
+      <p class="p">
+        In this case, because the memory limit was already below any 
recommended value, I increased the volume of
+        data for the query rather than reducing the memory limit any further.
+      </p>
+
+      <p class="p">
+        Set the <code class="ph codeph">MEM_LIMIT</code> query option to a 
value that is smaller than the peak memory usage
+        reported in the profile output. Do not specify a memory limit lower 
than about 300 MB, because with such a
+        low limit, queries could fail to start for other reasons. Now try the 
memory-intensive query again.
+      </p>
+
+      <p class="p">
+        Check if the query fails with a message like the following:
+      </p>
+
+<pre class="pre codeblock"><code>WARNINGS: Spilling has been disabled for 
plans that do not have stats and are not hinted
+to prevent potentially bad plans from using too many cluster resources. 
Compute stats on
+these tables, hint the plan or disable this behavior via query options to 
enable spilling.
+</code></pre>
+
+      <p class="p">
+        If so, the query could have consumed substantial temporary disk space, 
slowing down so much that it would
+        not complete in any reasonable time. Rather than rely on the 
spill-to-disk feature in this case, issue the
+        <code class="ph codeph">COMPUTE STATS</code> statement for the table 
or tables in your sample query. Then run the query
+        again, check the peak memory usage again in the <code class="ph 
codeph">PROFILE</code> output, and adjust the memory
+        limit again if necessary to be lower than the peak memory usage.
+      </p>
+
+      <p class="p">
+        At this point, you have a query that is memory-intensive, but Impala 
can optimize it efficiently so that
+        the memory usage is not exorbitant. You have set an artificial 
constraint through the
+        <code class="ph codeph">MEM_LIMIT</code> option so that the query 
would normally fail with an out-of-memory error. But
+        the automatic spill-to-disk feature means that the query should 
actually succeed, at the expense of some
+        extra disk I/O to read and write temporary work data.
+      </p>
+
+      <p class="p">
+        Try the query again, and confirm that it succeeds. Examine the <code 
class="ph codeph">PROFILE</code> output again. This
+        time, look for lines of this form:
+      </p>
+
+<pre class="pre codeblock"><code>- SpilledPartitions: <var class="keyword 
varname">N</var>
+</code></pre>
+
+      <p class="p">
+        If you see any such lines with <var class="keyword varname">N</var> 
greater than 0, that indicates the query would have
+        failed in Impala releases prior to 2.0, but now it succeeded because 
of the spill-to-disk feature. Examine
+        the total time taken by the <code class="ph 
codeph">AGGREGATION_NODE</code> or other query fragments containing non-zero
+        <code class="ph codeph">SpilledPartitions</code> values. Compare the 
times to similar fragments that did not spill, for
+        example in the <code class="ph codeph">PROFILE</code> output when the 
same query is run with a higher memory limit. This
+        gives you an idea of the performance penalty of the spill operation 
for a particular query with a
+        particular memory limit. If you make the memory limit just a little 
lower than the peak memory usage, the
+        query only needs to write a small amount of temporary data to disk. 
The lower you set the memory limit, the
+        more temporary data is written and the slower the query becomes.
+      </p>
+
+      <p class="p">
+        Now repeat this procedure for actual queries used in your environment. 
Use the
+        <code class="ph codeph">DISABLE_UNSAFE_SPILLS</code> setting to 
identify cases where queries used more memory than
+        necessary due to lack of statistics on the relevant tables and 
columns, and issue <code class="ph codeph">COMPUTE
+        STATS</code> where necessary.
+      </p>
+
+      <p class="p">
+        <strong class="ph b">When to use DISABLE_UNSAFE_SPILLS:</strong>
+      </p>
+
+      <p class="p">
+        You might wonder, why not leave <code class="ph 
codeph">DISABLE_UNSAFE_SPILLS</code> turned on all the time. Whether and
+        how frequently to use this option depends on your system environment 
and workload.
+      </p>
+
+      <p class="p">
+        <code class="ph codeph">DISABLE_UNSAFE_SPILLS</code> is suitable for 
an environment with ad hoc queries whose performance
+        characteristics and memory usage are not known in advance. It prevents 
<span class="q">"worst-case scenario"</span> queries
+        that use large amounts of memory unnecessarily. Thus, you might turn 
this option on within a session while
+        developing new SQL code, even though it is turned off for existing 
applications.
+      </p>
+
+      <p class="p">
+        Organizations where table and column statistics are generally 
up-to-date might leave this option turned on
+        all the time, again to avoid worst-case scenarios for untested queries 
or if a problem in the ETL pipeline
+        results in a table with no statistics. Turning on <code class="ph 
codeph">DISABLE_UNSAFE_SPILLS</code> lets you <span class="q">"fail
+        fast"</span> in this case and immediately gather statistics or tune 
the problematic queries.
+      </p>
+
+      <p class="p">
+        Some organizations might leave this option turned off. For example, 
you might have tables large enough that
+        the <code class="ph codeph">COMPUTE STATS</code> takes substantial 
time to run, making it impractical to re-run after
+        loading new data. If you have examined the <code class="ph 
codeph">EXPLAIN</code> plans of your queries and know that
+        they are operating efficiently, you might leave <code class="ph 
codeph">DISABLE_UNSAFE_SPILLS</code> turned off. In that
+        case, you know that any queries that spill will not go overboard with 
their memory consumption.
+      </p>
+
+    </div>
+  </article>
+
+<article class="topic concept nested1" aria-labelledby="ariaid-title5" 
id="scalability__complex_query">
+<h2 class="title topictitle2" id="ariaid-title5">Limits on Query Size and 
Complexity</h2>
+<div class="body conbody">
+<p class="p">
+There are hardcoded limits on the maximum size and complexity of queries.
+Currently, the maximum number of expressions in a query is 2000.
+You might exceed the limits with large or deeply nested queries
+produced by business intelligence tools or other query generators.
+</p>
+<p class="p">
+If you have the ability to customize such queries or the query generation
+logic that produces them, replace sequences of repetitive expressions
+with single operators such as <code class="ph codeph">IN</code> or <code 
class="ph codeph">BETWEEN</code>
+that can represent multiple values or ranges.
+For example, instead of a large number of <code class="ph codeph">OR</code> 
clauses:
+</p>
+<pre class="pre codeblock"><code>WHERE val = 1 OR val = 2 OR val = 6 OR val = 
100 ...
+</code></pre>
+<p class="p">
+use a single <code class="ph codeph">IN</code> clause:
+</p>
+<pre class="pre codeblock"><code>WHERE val IN (1,2,6,100,...)</code></pre>
+</div>
+</article>
+
+<article class="topic concept nested1" aria-labelledby="ariaid-title6" 
id="scalability__scalability_io">
+<h2 class="title topictitle2" id="ariaid-title6">Scalability Considerations 
for Impala I/O</h2>
+<div class="body conbody">
+<p class="p">
+Impala parallelizes its I/O operations aggressively,
+therefore the more disks you can attach to each host, the better.
+Impala retrieves data from disk so quickly using
+bulk read operations on large blocks, that most queries
+are CPU-bound rather than I/O-bound.
+</p>
+<p class="p">
+Because the kind of sequential scanning typically done by
+Impala queries does not benefit much from the random-access
+capabilities of SSDs, spinning disks typically provide
+the most cost-effective kind of storage for Impala data,
+with little or no performance penalty as compared to SSDs.
+</p>
+<p class="p">
+Resource management features such as YARN, Llama, and admission control
+typically constrain the amount of memory, CPU, or overall number of
+queries in a high-concurrency environment.
+Currently, there is no throttling mechanism for Impala I/O.
+</p>
+</div>
+</article>
+
+<article class="topic concept nested1" aria-labelledby="ariaid-title7" 
id="scalability__big_tables">
+<h2 class="title topictitle2" id="ariaid-title7">Scalability Considerations 
for Table Layout</h2>
+<div class="body conbody">
+<p class="p">
+Due to the overhead of retrieving and updating table metadata
+in the metastore database, try to limit the number of columns
+in a table to a maximum of approximately 2000.
+Although Impala can handle wider tables than this, the metastore overhead
+can become significant, leading to query performance that is slower
+than expected based on the actual data volume.
+</p>
+<p class="p">
+To minimize overhead related to the metastore database and Impala query 
planning,
+try to limit the number of partitions for any partitioned table to a few tens 
of thousands.
+</p>
+</div>
+</article>
+
+<article class="topic concept nested1" aria-labelledby="ariaid-title8" 
id="scalability__kerberos_overhead_cluster_size">
+<h2 class="title topictitle2" id="ariaid-title8">Kerberos-Related Network 
Overhead for Large Clusters</h2>
+<div class="body conbody">
+<p class="p">
+When Impala starts up, or after each <code class="ph codeph">kinit</code> 
refresh, Impala sends a number of
+simultaneous requests to the KDC. For a cluster with 100 hosts, the KDC might 
be able to process
+all the requests within roughly 5 seconds. For a cluster with 1000 hosts, the 
time to process
+the requests would be roughly 500 seconds. Impala also makes a number of DNS 
requests at the same
+time as these Kerberos-related requests.
+</p>
+<p class="p">
+While these authentication requests are being processed, any submitted Impala 
queries will fail.
+During this period, the KDC and DNS may be slow to respond to requests from 
components other than Impala,
+so other secure services might be affected temporarily.
+</p>
+
+<p class="p">
+  To reduce the frequency  of the <code class="ph codeph">kinit</code> renewal 
that initiates
+  a new set of authentication requests, increase the <code class="ph 
codeph">kerberos_reinit_interval</code>
+  configuration setting for the <span class="keyword cmdname">impalad</span> 
daemons. Currently, the default is 60 minutes.
+  Consider using a higher value such as 360 (6 hours).
+</p>
+
+</div>
+</article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title9" 
id="scalability__scalability_hotspots">
+    <h2 class="title topictitle2" id="ariaid-title9">Avoiding CPU Hotspots for 
HDFS Cached Data</h2>
+    <div class="body conbody">
+      <p class="p">
+        You can use the HDFS caching feature, described in <a class="xref" 
href="impala_perf_hdfs_caching.html#hdfs_caching">Using HDFS Caching with 
Impala (Impala 2.1 or higher only)</a>,
+        with Impala to reduce I/O and memory-to-memory copying for frequently 
accessed tables or partitions.
+      </p>
+      <p class="p">
+        In the early days of this feature, you might have found that enabling 
HDFS caching
+        resulted in little or no performance improvement, because it could 
result in
+        <span class="q">"hotspots"</span>: instead of the I/O to read the 
table data being parallelized across
+        the cluster, the I/O was reduced but the CPU load to process the data 
blocks
+        might be concentrated on a single host.
+      </p>
+      <p class="p">
+        To avoid hotspots, include the <code class="ph codeph">WITH 
REPLICATION</code> clause with the
+        <code class="ph codeph">CREATE TABLE</code> or <code class="ph 
codeph">ALTER TABLE</code> statements for tables that use HDFS caching.
+        This clause allows more than one host to cache the relevant data 
blocks, so the CPU load
+        can be shared, reducing the load on any one host.
+        See <a class="xref" 
href="impala_create_table.html#create_table">CREATE TABLE Statement</a> and <a 
class="xref" href="impala_alter_table.html#alter_table">ALTER TABLE 
Statement</a>
+        for details.
+      </p>
+      <p class="p">
+        Hotspots with high CPU load for HDFS cached data could still arise in 
some cases, due to
+        the way that Impala schedules the work of processing data blocks on 
different hosts.
+        In <span class="keyword">Impala 2.5</span> and higher, scheduling 
improvements mean that the work for
+        HDFS cached data is divided better among all the hosts that have 
cached replicas
+        for a particular data block. When more than one host has a cached 
replica for a data block,
+        Impala assigns the work of processing that block to whichever host has 
done the least work
+        (in terms of number of bytes read) for the current query. If hotspots 
persist even with this
+        load-based scheduling algorithm, you can enable the query option <code 
class="ph codeph">SCHEDULE_RANDOM_REPLICA=TRUE</code>
+        to further distribute the CPU load. This setting causes Impala to 
randomly pick a host to process a cached
+        data block if the scheduling algorithm encounters a tie when deciding 
which host has done the
+        least work.
+      </p>
+    </div>
+  </article>
+
+</article></main></body></html>
\ No newline at end of file


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_scan_node_codegen_threshold.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_scan_node_codegen_threshold.html 
b/docs/build/html/topics/impala_scan_node_codegen_threshold.html
new file mode 100644
index 0000000..2e71e50
--- /dev/null
+++ b/docs/build/html/topics/impala_scan_node_codegen_threshold.html
@@ -0,0 +1,69 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_query_options.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="scan_node_codegen_threshold"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>SCAN_NODE_CODEGEN_THRESHOLD Query Option (Impala 
2.5 or higher only)</title></head><body id="scan_node_codegen_threshold"><main 
role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">SCAN_NODE_CODEGEN_THRESHOLD 
Query Option (<span class="keyword">Impala 2.5</span> or higher only)</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      The <code class="ph codeph">SCAN_NODE_CODEGEN_THRESHOLD</code> query 
option
+      adjusts the aggressiveness of the code generation optimization process
+      when performing I/O read operations. It can help to work around 
performance problems
+      for queries where the table is small and the <code class="ph 
codeph">WHERE</code> clause is complicated.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Type:</strong> integer
+      </p>
+
+    <p class="p">
+      <strong class="ph b">Default:</strong> 1800000 (1.8 million)
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Added in:</strong> <span class="keyword">Impala 
2.5.0</span>
+      </p>
+
+    <p class="p">
+        <strong class="ph b">Usage notes:</strong>
+      </p>
+
+    <p class="p">
+      This query option is intended mainly for the case where a query with a 
very complicated
+      <code class="ph codeph">WHERE</code> clause, such as an <code class="ph 
codeph">IN</code> operator with thousands
+      of entries, is run against a small table, especially a small table using 
Parquet format.
+      The code generation phase can become the dominant factor in the query 
response time,
+      making the query take several seconds even though there is relatively 
little work to do.
+      In this case, increase the value of this option to a much larger amount, 
anything up to
+      the maximum for a 32-bit integer.
+    </p>
+
+    <p class="p">
+      Because this option only affects the code generation phase for the 
portion of the
+      query that performs I/O (the <dfn class="term">scan nodes</dfn> within 
the query plan), it
+      lets you continue to keep code generation enabled for other queries, and 
other parts
+      of the same query, that can benefit from it. In contrast, the
+      <code class="ph codeph">IMPALA_DISABLE_CODEGEN</code> query option turns 
off code generation entirely.
+    </p>
+
+    <p class="p">
+      Because of the way the work for queries is divided internally, this 
option might not
+      affect code generation for all kinds of queries. If a plan fragment 
contains a scan
+      node and some other kind of plan node, code generation still occurs 
regardless of
+      this option setting.
+    </p>
+
+    <p class="p">
+      To use this option effectively, you should be familiar with reading 
query profile output
+      to determine the proportion of time spent in the code generation phase, 
and whether
+      code generation is enabled or not for specific plan fragments.
+    </p>
+
+
+
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_query_options.html">Query Options for the SET 
Statement</a></div></div></nav></article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_schedule_random_replica.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_schedule_random_replica.html 
b/docs/build/html/topics/impala_schedule_random_replica.html
new file mode 100644
index 0000000..9826960
--- /dev/null
+++ b/docs/build/html/topics/impala_schedule_random_replica.html
@@ -0,0 +1,83 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_query_options.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="schedule_random_replica"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>SCHEDULE_RANDOM_REPLICA Query Option (Impala 2.5 
or higher only)</title></head><body id="schedule_random_replica"><main 
role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">SCHEDULE_RANDOM_REPLICA 
Query Option (<span class="keyword">Impala 2.5</span> or higher only)</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+    </p>
+
+    <p class="p">
+      The <code class="ph codeph">SCHEDULE_RANDOM_REPLICA</code> query option 
fine-tunes the algorithm for deciding which host
+      processes each HDFS data block. It only applies to tables and partitions 
that are not enabled
+      for the HDFS caching feature.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Type:</strong> Boolean; recognized values are 1 
and 0, or <code class="ph codeph">true</code> and <code class="ph 
codeph">false</code>;
+        any other value interpreted as <code class="ph codeph">false</code>
+      </p>
+
+    <p class="p">
+        <strong class="ph b">Default:</strong> <code class="ph 
codeph">false</code>
+      </p>
+
+    <p class="p">
+        <strong class="ph b">Added in:</strong> <span class="keyword">Impala 
2.5.0</span>
+      </p>
+
+    <p class="p">
+        <strong class="ph b">Usage notes:</strong>
+      </p>
+
+    <p class="p">
+      In the presence of HDFS cached replicas, Impala randomizes
+      which host processes each cached data block.
+      To ensure that HDFS data blocks are cached on more
+      than one host, use the <code class="ph codeph">WITH REPLICATION</code> 
clause along with
+      the <code class="ph codeph">CACHED IN</code> clause in a
+      <code class="ph codeph">CREATE TABLE</code> or <code class="ph 
codeph">ALTER TABLE</code> statement.
+      Specify a replication value greater than or equal to the HDFS block 
replication factor.
+    </p>
+
+    <p class="p">
+      The <code class="ph codeph">SCHEDULE_RANDOM_REPLICA</code> query option 
applies to tables and partitions
+      that <em class="ph i">do not</em> use HDFS caching.
+      By default, Impala estimates how much work each host has done for
+      the query, and selects the host that has the lowest workload.
+      This algorithm is intended to reduce CPU hotspots arising when the
+      same host is selected to process multiple data blocks, but hotspots
+      might still arise for some combinations of queries and data layout.
+      When the <code class="ph codeph">SCHEDULE_RANDOM_REPLICA</code> option 
is enabled,
+      Impala further randomizes the scheduling algorithm for non-HDFS cached 
blocks,
+      which can further reduce the chance of CPU hotspots.
+    </p>
+
+    <p class="p">
+      This query option works in conjunction with the work scheduling 
improvements
+      in <span class="keyword">Impala 2.5</span> and higher. The scheduling 
improvements
+      distribute the processing for cached HDFS data blocks to minimize 
hotspots:
+      if a data block is cached on more than one host, Impala chooses which 
host
+      to process each block based on which host has read the fewest bytes 
during
+      the current query. Enable <code class="ph 
codeph">SCHEDULE_RANDOM_REPLICA</code> setting if CPU hotspots
+      still persist because of cases where hosts are <span 
class="q">"tied"</span> in terms of
+      the amount of work done; by default, Impala picks the first eligible host
+      in this case.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Related information:</strong>
+      </p>
+    <p class="p">
+      <a class="xref" href="impala_perf_hdfs_caching.html#hdfs_caching">Using 
HDFS Caching with Impala (Impala 2.1 or higher only)</a>,
+      <a class="xref" 
href="impala_scalability.html#scalability_hotspots">Avoiding CPU Hotspots for 
HDFS Cached Data</a>
+      , <a class="xref" 
href="impala_replica_preference.html#replica_preference">REPLICA_PREFERENCE 
Query Option (Impala 2.7 or higher only)</a>
+    </p>
+
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_query_options.html">Query Options for the SET 
Statement</a></div></div></nav></article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_schema_design.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_schema_design.html 
b/docs/build/html/topics/impala_schema_design.html
new file mode 100644
index 0000000..6825c5d
--- /dev/null
+++ b/docs/build/html/topics/impala_schema_design.html
@@ -0,0 +1,184 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_planning.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="schema_design"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>Guidelines for Designing Impala 
Schemas</title></head><body id="schema_design"><main role="main"><article 
role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Guidelines for Designing 
Impala Schemas</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      The guidelines in this topic help you to construct an optimized and 
scalable schema, one that integrates well
+      with your existing data management processes. Use these guidelines as a 
checklist when doing any
+      proof-of-concept work, porting exercise, or before deploying to 
production.
+    </p>
+
+    <p class="p">
+      If you are adapting an existing database or Hive schema for use with 
Impala, read the guidelines in this
+      section and then see <a class="xref" 
href="impala_porting.html#porting">Porting SQL from Other Database Systems to 
Impala</a> for specific porting and compatibility tips.
+    </p>
+
+    <p class="p toc inpage"></p>
+
+    <section class="section" 
id="schema_design__schema_design_text_vs_binary"><h2 class="title 
sectiontitle">Prefer binary file formats over text-based formats.</h2>
+
+      
+
+      <p class="p">
+        To save space and improve memory usage and query performance, use 
binary file formats for any large or
+        intensively queried tables. Parquet file format is the most efficient 
for data warehouse-style analytic
+        queries. Avro is the other binary file format that Impala supports, 
that you might already have as part of
+        a Hadoop ETL pipeline.
+      </p>
+
+      <p class="p">
+        Although Impala can create and query tables with the RCFile and 
SequenceFile file formats, such tables are
+        relatively bulky due to the text-based nature of those formats, and 
are not optimized for data
+        warehouse-style queries due to their row-oriented layout. Impala does 
not support <code class="ph codeph">INSERT</code>
+        operations for tables with these file formats.
+      </p>
+
+      <p class="p">
+        Guidelines:
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          For an efficient and scalable format for large, performance-critical 
tables, use the Parquet file format.
+        </li>
+
+        <li class="li">
+          To deliver intermediate data during the ETL process, in a format 
that can also be used by other Hadoop
+          components, Avro is a reasonable choice.
+        </li>
+
+        <li class="li">
+          For convenient import of raw data, use a text table instead of 
RCFile or SequenceFile, and convert to
+          Parquet in a later stage of the ETL process.
+        </li>
+      </ul>
+    </section>
+
+    <section class="section" id="schema_design__schema_design_compression"><h2 
class="title sectiontitle">Use Snappy compression where practical.</h2>
+
+      
+
+      <p class="p">
+        Snappy compression involves low CPU overhead to decompress, while 
still providing substantial space
+        savings. In cases where you have a choice of compression codecs, such 
as with the Parquet and Avro file
+        formats, use Snappy compression unless you find a compelling reason to 
use a different codec.
+      </p>
+    </section>
+
+    <section class="section" 
id="schema_design__schema_design_numeric_types"><h2 class="title 
sectiontitle">Prefer numeric types over strings.</h2>
+
+      
+
+      <p class="p">
+        If you have numeric values that you could treat as either strings or 
numbers (such as
+        <code class="ph codeph">YEAR</code>, <code class="ph 
codeph">MONTH</code>, and <code class="ph codeph">DAY</code> for partition key 
columns), define
+        them as the smallest applicable integer types. For example, <code 
class="ph codeph">YEAR</code> can be
+        <code class="ph codeph">SMALLINT</code>, <code class="ph 
codeph">MONTH</code> and <code class="ph codeph">DAY</code> can be <code 
class="ph codeph">TINYINT</code>.
+        Although you might not see any difference in the way partitioned 
tables or text files are laid out on disk,
+        using numeric types will save space in binary formats such as Parquet, 
and in memory when doing queries,
+        particularly resource-intensive queries such as joins.
+      </p>
+    </section>
+
+
+
+    <section class="section" 
id="schema_design__schema_design_partitioning"><h2 class="title 
sectiontitle">Partition, but do not over-partition.</h2>
+
+      
+
+      <p class="p">
+        Partitioning is an important aspect of performance tuning for Impala. 
Follow the procedures in
+        <a class="xref" 
href="impala_partitioning.html#partitioning">Partitioning for Impala Tables</a> 
to set up partitioning for your biggest, most
+        intensively queried tables.
+      </p>
+
+      <p class="p">
+        If you are moving to Impala from a traditional database system, or 
just getting started in the Big Data
+        field, you might not have enough data volume to take advantage of 
Impala parallel queries with your
+        existing partitioning scheme. For example, if you have only a few tens 
of megabytes of data per day,
+        partitioning by <code class="ph codeph">YEAR</code>, <code class="ph 
codeph">MONTH</code>, and <code class="ph codeph">DAY</code> columns might be
+        too granular. Most of your cluster might be sitting idle during 
queries that target a single day, or each
+        node might have very little work to do. Consider reducing the number 
of partition key columns so that each
+        partition directory contains several gigabytes worth of data.
+      </p>
+
+      <p class="p">
+        For example, consider a Parquet table where each data file is 1 HDFS 
block, with a maximum block size of 1
+        GB. (In Impala 2.0 and later, the default Parquet block size is 
reduced to 256 MB. For this exercise, let's
+        assume you have bumped the size back up to 1 GB by setting the query 
option
+        <code class="ph codeph">PARQUET_FILE_SIZE=1g</code>.) if you have a 
10-node cluster, you need 10 data files (up to 10 GB)
+        to give each node some work to do for a query. But each core on each 
machine can process a separate data
+        block in parallel. With 16-core machines on a 10-node cluster, a query 
could process up to 160 GB fully in
+        parallel. If there are only a few data files per partition, not only 
are most cluster nodes sitting idle
+        during queries, so are most cores on those machines.
+      </p>
+
+      <p class="p">
+        You can reduce the Parquet block size to as low as 128 MB or 64 MB to 
increase the number of files per
+        partition and improve parallelism. But also consider reducing the 
level of partitioning so that analytic
+        queries have enough data to work with.
+      </p>
+    </section>
+
+    <section class="section" 
id="schema_design__schema_design_compute_stats"><h2 class="title 
sectiontitle">Always compute stats after loading data.</h2>
+
+      
+
+      <p class="p">
+        Impala makes extensive use of statistics about data in the overall 
table and in each column, to help plan
+        resource-intensive operations such as join queries and inserting into 
partitioned Parquet tables. Because
+        this information is only available after data is loaded, run the <code 
class="ph codeph">COMPUTE STATS</code> statement
+        on a table after loading or replacing data in a table or partition.
+      </p>
+
+      <p class="p">
+        Having accurate statistics can make the difference between a 
successful operation, or one that fails due to
+        an out-of-memory error or a timeout. When you encounter performance or 
capacity issues, always use the
+        <code class="ph codeph">SHOW STATS</code> statement to check if the 
statistics are present and up-to-date for all tables
+        in the query.
+      </p>
+
+      <p class="p">
+        When doing a join query, Impala consults the statistics for each 
joined table to determine their relative
+        sizes and to estimate the number of rows produced in each join stage. 
When doing an <code class="ph codeph">INSERT</code>
+        into a Parquet table, Impala consults the statistics for the source 
table to determine how to distribute
+        the work of constructing the data files for each partition.
+      </p>
+
+      <p class="p">
+        See <a class="xref" 
href="impala_compute_stats.html#compute_stats">COMPUTE STATS Statement</a> for 
the syntax of the <code class="ph codeph">COMPUTE
+        STATS</code> statement, and <a class="xref" 
href="impala_perf_stats.html#perf_stats">Table and Column Statistics</a> for 
all the performance
+        considerations for table and column statistics.
+      </p>
+    </section>
+
+    <section class="section" id="schema_design__schema_design_explain"><h2 
class="title sectiontitle">Verify sensible execution plans with EXPLAIN and 
SUMMARY.</h2>
+
+      
+
+      <p class="p">
+        Before executing a resource-intensive query, use the <code class="ph 
codeph">EXPLAIN</code> statement to get an overview
+        of how Impala intends to parallelize the query and distribute the 
work. If you see that the query plan is
+        inefficient, you can take tuning steps such as changing file formats, 
using partitioned tables, running the
+        <code class="ph codeph">COMPUTE STATS</code> statement, or adding 
query hints. For information about all of these
+        techniques, see <a class="xref" 
href="impala_performance.html#performance">Tuning Impala for Performance</a>.
+      </p>
+
+      <p class="p">
+        After you run a query, you can see performance-related information 
about how it actually ran by issuing the
+        <code class="ph codeph">SUMMARY</code> command in <span class="keyword 
cmdname">impala-shell</span>. Prior to Impala 1.4, you would use
+        the <code class="ph codeph">PROFILE</code> command, but its highly 
technical output was only useful for the most
+        experienced users. <code class="ph codeph">SUMMARY</code>, new in 
Impala 1.4, summarizes the most useful information for
+        all stages of execution, for all nodes rather than splitting out 
figures for each node.
+      </p>
+    </section>
+
+
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_planning.html">Planning for Impala 
Deployment</a></div></div></nav></article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_schema_objects.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_schema_objects.html 
b/docs/build/html/topics/impala_schema_objects.html
new file mode 100644
index 0000000..b8ea7cd
--- /dev/null
+++ b/docs/build/html/topics/impala_schema_objects.html
@@ -0,0 +1,48 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_langref.html"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_aliases.html"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_databases.html"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_functions_overview.html"><meta name="DC.Relation" 
scheme="URI" content="../topics/impala_identifiers.html"><meta 
name="DC.Relation" scheme="URI" content="../topics/impala_tables.html"><meta 
name="DC.Relation" scheme="URI" content="../topics/impala_views.html"><meta 
name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta 
name="version" content="Impala 2.8.x"><meta name="version" content="Impala 
2.8.x"><meta name=
 "DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="schema_objects"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>Impala Schema Objects and Object 
Names</title></head><body id="schema_objects"><main role="main"><article 
role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Impala Schema Objects and 
Object Names</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      With Impala, you work with schema objects that are familiar to database 
users: primarily databases, tables, views,
+      and functions. The SQL syntax to work with these objects is explained in
+      <a class="xref" href="impala_langref_sql.html#langref_sql">Impala SQL 
Statements</a>. This section explains the conceptual knowledge you need to
+      work with these objects and the various ways to specify their names.
+    </p>
+
+    <p class="p">
+      Within a table, partitions can also be considered a kind of object. 
Partitioning is an important subject for
+      Impala, with its own documentation section covering use cases and 
performance considerations. See
+      <a class="xref" 
href="impala_partitioning.html#partitioning">Partitioning for Impala Tables</a> 
for details.
+    </p>
+
+    <p class="p">
+      Impala does not have a counterpart of the <span 
class="q">"tablespace"</span> notion from some database systems. By default,
+      all the data files for a database, table, or partition are located 
within nested folders within the HDFS file
+      system. You can also specify a particular HDFS location for a given 
Impala table or partition. The raw data
+      for these objects is represented as a collection of data files, 
providing the flexibility to load data by
+      simply moving files into the expected HDFS location.
+    </p>
+
+    <p class="p">
+      Information about the schema objects is held in the
+      <a class="xref" href="impala_hadoop.html#intro_metastore">metastore</a> 
database. This database is shared between
+      Impala and Hive, allowing each to create, drop, and query each other's 
databases, tables, and so on. When
+      Impala makes a change to schema objects through a <code class="ph 
codeph">CREATE</code>, <code class="ph codeph">ALTER</code>,
+      <code class="ph codeph">DROP</code>, <code class="ph 
codeph">INSERT</code>, or <code class="ph codeph">LOAD DATA</code> statement, 
it broadcasts those
+      changes to all nodes in the cluster through the <a class="xref" 
href="impala_components.html#intro_catalogd">catalog
+      service</a>. When you make such changes through Hive or directly through 
manipulating HDFS files, you use
+      the <a class="xref" href="impala_refresh.html#refresh">REFRESH</a> or
+      <a class="xref" 
href="impala_invalidate_metadata.html#invalidate_metadata">INVALIDATE 
METADATA</a> statements on the
+      Impala side to recognize the newly loaded data, new tables, and so on.
+    </p>
+
+    <p class="p toc"></p>
+  </div>
+<nav role="navigation" class="related-links"><ul class="ullinks"><li 
class="link ulchildlink"><strong><a 
href="../topics/impala_aliases.html">Overview of Impala 
Aliases</a></strong><br></li><li class="link ulchildlink"><strong><a 
href="../topics/impala_databases.html">Overview of Impala 
Databases</a></strong><br></li><li class="link ulchildlink"><strong><a 
href="../topics/impala_functions_overview.html">Overview of Impala 
Functions</a></strong><br></li><li class="link ulchildlink"><strong><a 
href="../topics/impala_identifiers.html">Overview of Impala 
Identifiers</a></strong><br></li><li class="link ulchildlink"><strong><a 
href="../topics/impala_tables.html">Overview of Impala 
Tables</a></strong><br></li><li class="link ulchildlink"><strong><a 
href="../topics/impala_views.html">Overview of Impala 
Views</a></strong><br></li></ul><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_langref.html">Impala SQL Language Refer
 ence</a></div></div></nav></article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_scratch_limit.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_scratch_limit.html 
b/docs/build/html/topics/impala_scratch_limit.html
new file mode 100644
index 0000000..98bac93
--- /dev/null
+++ b/docs/build/html/topics/impala_scratch_limit.html
@@ -0,0 +1,77 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_query_options.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="scratch_limit"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>SCRATCH_LIMIT Query Option</title></head><body 
id="scratch_limit"><main role="main"><article role="article" 
aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">SCRATCH_LIMIT Query 
Option</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      Specifies the maximum amount of disk storage, in bytes, that any Impala 
query can consume
+      on any host using the <span class="q">"spill to disk"</span> mechanism 
that handles queries that exceed
+      the memory limit.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Syntax:</strong>
+      </p>
+
+    <p class="p">
+      Specify the size in bytes, or with a trailing <code class="ph 
codeph">m</code> or <code class="ph codeph">g</code> character to indicate
+      megabytes or gigabytes. For example:
+    </p>
+
+
+<pre class="pre codeblock"><code>-- 128 megabytes.
+set SCRATCH_LIMIT=134217728
+
+-- 512 megabytes.
+set SCRATCH_LIMIT=512m;
+
+-- 1 gigabyte.
+set SCRATCH_LIMIT=1g;
+</code></pre>
+
+    <p class="p">
+        <strong class="ph b">Usage notes:</strong>
+      </p>
+
+    <p class="p">
+      A value of zero turns off the spill to disk feature for queries
+      in the current session, causing them to fail immediately if they
+      exceed the memory limit.
+    </p>
+
+    <p class="p">
+      The amount of memory used per host for a query is limited by the
+      <code class="ph codeph">MEM_LIMIT</code> query option.
+    </p>
+
+    <p class="p">
+      The more Impala daemon hosts in the cluster, the less memory is used on 
each host,
+      and therefore also less scratch space is required for queries that
+      exceed the memory limit.
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Type:</strong> numeric, with optional unit specifier
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Default:</strong> -1 (amount of spill space is 
unlimited)
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Related information:</strong>
+      </p>
+
+    <p class="p">
+      <a class="xref" href="impala_scalability.html#spill_to_disk">SQL 
Operations that Spill to Disk</a>,
+      <a class="xref" href="impala_mem_limit.html#mem_limit">MEM_LIMIT Query 
Option</a>
+    </p>
+
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_query_options.html">Query Options for the SET 
Statement</a></div></div></nav></article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_security.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_security.html 
b/docs/build/html/topics/impala_security.html
new file mode 100644
index 0000000..45d9923
--- /dev/null
+++ b/docs/build/html/topics/impala_security.html
@@ -0,0 +1,99 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_security_guidelines.html"><meta name="DC.Relation" 
scheme="URI" content="../topics/impala_security_files.html"><meta 
name="DC.Relation" scheme="URI" 
content="../topics/impala_security_install.html"><meta name="DC.Relation" 
scheme="URI" content="../topics/impala_security_metastore.html"><meta 
name="DC.Relation" scheme="URI" 
content="../topics/impala_security_webui.html"><meta name="DC.Relation" 
scheme="URI" content="../topics/impala_ssl.html"><meta name="DC.Relation" 
scheme="URI" content="../topics/impala_authorization.html"><meta 
name="DC.Relation" scheme="URI" 
content="../topics/impala_authentication.html"><meta name="DC.Relation" 
scheme="URI" content="../topics/i
 mpala_auditing.html"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_lineage.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="security"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>Impala Security</title></head><body 
id="security"><main role="main"><article role="article" 
aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1"><span class="ph">Impala 
Security</span></h1>
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      Impala includes a fine-grained authorization framework for Hadoop, based 
on Apache Sentry.
+      Sentry authorization was added in Impala 1.1.0. Together with the 
Kerberos
+      authentication framework, Sentry takes Hadoop security to a new level 
needed for the requirements of
+      highly regulated industries such as healthcare, financial services, and 
government. Impala also includes
+      an auditing capability which was added in Impala 1.1.1; Impala generates 
the audit data which can be
+      consumed, filtered, and visualized by cluster-management components 
focused on governance.
+    </p>
+
+    <p class="p">
+      The Impala security features have several objectives. At the most basic 
level, security prevents
+      accidents or mistakes that could disrupt application processing, delete 
or corrupt data, or reveal data to
+      unauthorized users. More advanced security features and practices can 
harden the system against malicious
+      users trying to gain unauthorized access or perform other disallowed 
operations. The auditing feature
+      provides a way to confirm that no unauthorized access occurred, and 
detect whether any such attempts were
+      made. This is a critical set of features for production deployments in 
large organizations that handle
+      important or sensitive data. It sets the stage for multi-tenancy, where 
multiple applications run
+      concurrently and are prevented from interfering with each other.
+    </p>
+
+    <p class="p">
+      The material in this section presumes that you are already familiar with 
administering secure Linux systems.
+      That is, you should know the general security practices for Linux and 
Hadoop, and their associated commands
+      and configuration files. For example, you should know how to create 
Linux users and groups, manage Linux
+      group membership, set Linux and HDFS file permissions and ownership, and 
designate the default permissions
+      and ownership for new files. You should be familiar with the 
configuration of the nodes in your Hadoop
+      cluster, and know how to apply configuration changes or run a set of 
commands across all the nodes.
+    </p>
+
+    <p class="p">
+      The security features are divided into these broad categories:
+    </p>
+
+    <dl class="dl">
+      
+
+        <dt class="dt dlterm">
+          authorization
+        </dt>
+
+        <dd class="dd">
+          Which users are allowed to access which resources, and what 
operations are they allowed to perform?
+          Impala relies on the open source Sentry project for authorization. 
By default (when authorization is not
+          enabled), Impala does all read and write operations with the 
privileges of the <code class="ph codeph">impala</code>
+          user, which is suitable for a development/test environment but not 
for a secure production environment.
+          When authorization is enabled, Impala uses the OS user ID of the 
user who runs
+          <span class="keyword cmdname">impala-shell</span> or other client 
program, and associates various privileges with each
+          user. See <a class="xref" 
href="impala_authorization.html#authorization">Enabling Sentry Authorization 
for Impala</a> for details about setting up and managing
+          authorization.
+        </dd>
+
+      
+
+      
+
+        <dt class="dt dlterm">
+          authentication
+        </dt>
+
+        <dd class="dd">
+          How does Impala verify the identity of the user to confirm that they 
really are allowed to exercise the
+          privileges assigned to that user? Impala relies on the Kerberos 
subsystem for authentication. See
+          <a class="xref" href="impala_kerberos.html#kerberos">Enabling 
Kerberos Authentication for Impala</a> for details about setting up and 
managing authentication.
+        </dd>
+
+      
+
+      
+
+        <dt class="dt dlterm">
+          auditing
+        </dt>
+
+        <dd class="dd">
+          What operations were attempted, and did they succeed or not? This 
feature provides a way to look back and
+          diagnose whether attempts were made to perform unauthorized 
operations. You use this information to track
+          down suspicious activity, and to see where changes are needed in 
authorization policies. The audit data
+          produced by this feature can be collected and presented in a 
user-friendly form by cluster-management
+          software. See <a class="xref" 
href="impala_auditing.html#auditing">Auditing Impala Operations</a> for details 
about setting up and managing
+          auditing.
+        </dd>
+
+      
+    </dl>
+
+    <p class="p toc"></p>
+
+    
+  </div>
+<nav role="navigation" class="related-links"><ul class="ullinks"><li 
class="link ulchildlink"><strong><a 
href="../topics/impala_security_guidelines.html">Security Guidelines for 
Impala</a></strong><br></li><li class="link ulchildlink"><strong><a 
href="../topics/impala_security_files.html">Securing Impala Data and Log 
Files</a></strong><br></li><li class="link ulchildlink"><strong><a 
href="../topics/impala_security_install.html">Installation Considerations for 
Impala Security</a></strong><br></li><li class="link ulchildlink"><strong><a 
href="../topics/impala_security_metastore.html">Securing the Hive Metastore 
Database</a></strong><br></li><li class="link ulchildlink"><strong><a 
href="../topics/impala_security_webui.html">Securing the Impala Web User 
Interface</a></strong><br></li><li class="link ulchildlink"><strong><a 
href="../topics/impala_ssl.html">Configuring TLS/SSL for 
Impala</a></strong><br></li><li class="link ulchildlink"><strong><a 
href="../topics/impala_authorization.html
 ">Enabling Sentry Authorization for Impala</a></strong><br></li><li 
class="link ulchildlink"><strong><a 
href="../topics/impala_authentication.html">Impala 
Authentication</a></strong><br></li><li class="link ulchildlink"><strong><a 
href="../topics/impala_auditing.html">Auditing Impala 
Operations</a></strong><br></li><li class="link ulchildlink"><strong><a 
href="../topics/impala_lineage.html">Viewing Lineage Information for Impala 
Data</a></strong><br></li></ul></nav></article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_security_files.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_security_files.html 
b/docs/build/html/topics/impala_security_files.html
new file mode 100644
index 0000000..e980e60
--- /dev/null
+++ b/docs/build/html/topics/impala_security_files.html
@@ -0,0 +1,58 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_security.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="secure_files"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>Securing Impala Data and Log 
Files</title></head><body id="secure_files"><main role="main"><article 
role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Securing Impala Data and 
Log Files</h1>
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      One aspect of security is to protect files from unauthorized access at 
the filesystem level. For example, if
+      you store sensitive data in HDFS, you specify permissions on the 
associated files and directories in HDFS to
+      restrict read and write permissions to the appropriate users and groups.
+    </p>
+
+    <p class="p">
+      If you issue queries containing sensitive values in the <code class="ph 
codeph">WHERE</code> clause, such as financial
+      account numbers, those values are stored in Impala log files in the 
Linux filesystem and you must secure
+      those files also. For the locations of Impala log files, see <a 
class="xref" href="impala_logging.html#logging">Using Impala Logging</a>.
+    </p>
+
+    <p class="p">
+      All Impala read and write operations are performed under the filesystem 
privileges of the
+      <code class="ph codeph">impala</code> user. The <code class="ph 
codeph">impala</code> user must be able to read all directories and data
+      files that you query, and write into all the directories and data files 
for <code class="ph codeph">INSERT</code> and
+      <code class="ph codeph">LOAD DATA</code> statements. At a minimum, make 
sure the <code class="ph codeph">impala</code> user is in the
+      <code class="ph codeph">hive</code> group so that it can access files 
and directories shared between Impala and Hive. See
+      <a class="xref" href="impala_prereqs.html#prereqs_account">User Account 
Requirements</a> for more details.
+    </p>
+
+    <p class="p">
+      Setting file permissions is necessary for Impala to function correctly, 
but is not an effective security
+      practice by itself:
+    </p>
+
+    <ul class="ul">
+      <li class="li">
+      <p class="p">
+        The way to ensure that only authorized users can submit requests for 
databases and tables they are allowed
+        to access is to set up Sentry authorization, as explained in
+        <a class="xref" 
href="impala_authorization.html#authorization">Enabling Sentry Authorization 
for Impala</a>. With authorization enabled, the checking of the user
+        ID and group is done by Impala, and unauthorized access is blocked by 
Impala itself. The actual low-level
+        read and write requests are still done by the <code class="ph 
codeph">impala</code> user, so you must have appropriate
+        file and directory permissions for that user ID.
+      </p>
+      </li>
+
+      <li class="li">
+      <p class="p">
+        You must also set up Kerberos authentication, as described in <a 
class="xref" href="impala_kerberos.html#kerberos">Enabling Kerberos 
Authentication for Impala</a>,
+        so that users can only connect from trusted hosts. With Kerberos 
enabled, if someone connects a new host to
+        the network and creates user IDs that match your privileged IDs, they 
will be blocked from connecting to
+        Impala at all from that host.
+      </p>
+      </li>
+    </ul>
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_security.html">Impala 
Security</a></div></div></nav></article></main></body></html>
\ No newline at end of file

[09/51] [partial] incubator-impala git commit: IMPALA-4181 [DOCS] Publish rendered Impala documentation to ASF site

Reply via email to