http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_scalability.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_scalability.html b/docs/build/html/topics/impala_scalability.html new file mode 100644 index 0000000..a850d35 --- /dev/null +++ b/docs/build/html/topics/impala_scalability.html @@ -0,0 +1,711 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2. 8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="scalability"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Scalability Considerations for Impala</title></head><body id="scalability"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Scalability Considerations for Impala</h1> + + + + <div class="body conbody"> + + <p class="p"> + This section explains how the size of your cluster and the volume of data influences SQL performance and + schema design for Impala tables. Typically, adding more cluster capacity reduces problems due to memory + limits or disk throughput. On the other hand, larger clusters are more likely to have other kinds of + scalability issues, such as a single slow node that causes performance problems for queries. + </p> + + <p class="p toc inpage"></p> + + <p class="p"> + A good source of tips related to scalability and performance tuning is the + <a class="xref" href="http://www.slideshare.net/cloudera/the-impala-cookbook-42530186" target="_blank">Impala Cookbook</a> + presentation. These slides are updated periodically as new features come out and new benchmarks are performed. + </p> + + </div> + + + + <article class="topic concept nested1" aria-labelledby="ariaid-title2" id="scalability__scalability_catalog"> + + <h2 class="title topictitle2" id="ariaid-title2">Impact of Many Tables or Partitions on Impala Catalog Performance and Memory Usage</h2> + + <div class="body conbody"> + + + + <p class="p"> + Because Hadoop I/O is optimized for reading and writing large files, Impala is optimized for tables + containing relatively few, large data files. Schemas containing thousands of tables, or tables containing + thousands of partitions, can encounter performance issues during startup or during DDL operations such as + <code class="ph codeph">ALTER TABLE</code> statements. + </p> + + <div class="note important note_important"><span class="note__title importanttitle">Important:</span> + <p class="p"> + Because of a change in the default heap size for the <span class="keyword cmdname">catalogd</span> daemon in + <span class="keyword">Impala 2.5</span> and higher, the following procedure to increase the <span class="keyword cmdname">catalogd</span> + memory limit might be required following an upgrade to <span class="keyword">Impala 2.5</span> even if not + needed previously. + </p> + </div> + + <div class="p"> + For schemas with large numbers of tables, partitions, and data files, the <span class="keyword cmdname">catalogd</span> + daemon might encounter an out-of-memory error. To increase the memory limit for the + <span class="keyword cmdname">catalogd</span> daemon: + + <ol class="ol"> + <li class="li"> + <p class="p"> + Check current memory usage for the <span class="keyword cmdname">catalogd</span> daemon by running the + following commands on the host where that daemon runs on your cluster: + </p> + <pre class="pre codeblock"><code> + jcmd <var class="keyword varname">catalogd_pid</var> VM.flags + jmap -heap <var class="keyword varname">catalogd_pid</var> + </code></pre> + </li> + <li class="li"> + <p class="p"> + Decide on a large enough value for the <span class="keyword cmdname">catalogd</span> heap. + You express it as an environment variable value as follows: + </p> + <pre class="pre codeblock"><code> + JAVA_TOOL_OPTIONS="-Xmx8g" + </code></pre> + </li> + <li class="li"> + <p class="p"> + On systems not using cluster management software, put this environment variable setting into the + startup script for the <span class="keyword cmdname">catalogd</span> daemon, then restart the <span class="keyword cmdname">catalogd</span> + daemon. + </p> + </li> + <li class="li"> + <p class="p"> + Use the same <span class="keyword cmdname">jcmd</span> and <span class="keyword cmdname">jmap</span> commands as earlier to + verify that the new settings are in effect. + </p> + </li> + </ol> + </div> + + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="scalability__statestore_scalability"> + + <h2 class="title topictitle2" id="ariaid-title3">Scalability Considerations for the Impala Statestore</h2> + + <div class="body conbody"> + + <p class="p"> + Before <span class="keyword">Impala 2.1</span>, the statestore sent only one kind of message to its subscribers. This message contained all + updates for any topics that a subscriber had subscribed to. It also served to let subscribers know that the + statestore had not failed, and conversely the statestore used the success of sending a heartbeat to a + subscriber to decide whether or not the subscriber had failed. + </p> + + <p class="p"> + Combining topic updates and failure detection in a single message led to bottlenecks in clusters with large + numbers of tables, partitions, and HDFS data blocks. When the statestore was overloaded with metadata + updates to transmit, heartbeat messages were sent less frequently, sometimes causing subscribers to time + out their connection with the statestore. Increasing the subscriber timeout and decreasing the frequency of + statestore heartbeats worked around the problem, but reduced responsiveness when the statestore failed or + restarted. + </p> + + <p class="p"> + As of <span class="keyword">Impala 2.1</span>, the statestore now sends topic updates and heartbeats in separate messages. This allows the + statestore to send and receive a steady stream of lightweight heartbeats, and removes the requirement to + send topic updates according to a fixed schedule, reducing statestore network overhead. + </p> + + <p class="p"> + The statestore now has the following relevant configuration flags for the <span class="keyword cmdname">statestored</span> + daemon: + </p> + + <dl class="dl"> + + + <dt class="dt dlterm" id="statestore_scalability__statestore_num_update_threads"> + <code class="ph codeph">-statestore_num_update_threads</code> + </dt> + + <dd class="dd"> + The number of threads inside the statestore dedicated to sending topic updates. You should not + typically need to change this value. + <p class="p"> + <strong class="ph b">Default:</strong> 10 + </p> + </dd> + + + + + + <dt class="dt dlterm" id="statestore_scalability__statestore_update_frequency_ms"> + <code class="ph codeph">-statestore_update_frequency_ms</code> + </dt> + + <dd class="dd"> + The frequency, in milliseconds, with which the statestore tries to send topic updates to each + subscriber. This is a best-effort value; if the statestore is unable to meet this frequency, it sends + topic updates as fast as it can. You should not typically need to change this value. + <p class="p"> + <strong class="ph b">Default:</strong> 2000 + </p> + </dd> + + + + + + <dt class="dt dlterm" id="statestore_scalability__statestore_num_heartbeat_threads"> + <code class="ph codeph">-statestore_num_heartbeat_threads</code> + </dt> + + <dd class="dd"> + The number of threads inside the statestore dedicated to sending heartbeats. You should not typically + need to change this value. + <p class="p"> + <strong class="ph b">Default:</strong> 10 + </p> + </dd> + + + + + + <dt class="dt dlterm" id="statestore_scalability__statestore_heartbeat_frequency_ms"> + <code class="ph codeph">-statestore_heartbeat_frequency_ms</code> + </dt> + + <dd class="dd"> + The frequency, in milliseconds, with which the statestore tries to send heartbeats to each subscriber. + This value should be good for large catalogs and clusters up to approximately 150 nodes. Beyond that, + you might need to increase this value to make the interval longer between heartbeat messages. + <p class="p"> + <strong class="ph b">Default:</strong> 1000 (one heartbeat message every second) + </p> + </dd> + + + </dl> + + <p class="p"> + If it takes a very long time for a cluster to start up, and <span class="keyword cmdname">impala-shell</span> consistently + displays <code class="ph codeph">This Impala daemon is not ready to accept user requests</code>, the statestore might be + taking too long to send the entire catalog topic to the cluster. In this case, consider adding + <code class="ph codeph">--load_catalog_in_background=false</code> to your catalog service configuration. This setting + stops the statestore from loading the entire catalog into memory at cluster startup. Instead, metadata for + each table is loaded when the table is accessed for the first time. + </p> + </div> + </article> + + + + + + <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="scalability__spill_to_disk"> + + <h2 class="title topictitle2" id="ariaid-title4">SQL Operations that Spill to Disk</h2> + + <div class="body conbody"> + + <p class="p"> + Certain memory-intensive operations write temporary data to disk (known as <dfn class="term">spilling</dfn> to disk) + when Impala is close to exceeding its memory limit on a particular host. + </p> + + <p class="p"> + The result is a query that completes successfully, rather than failing with an out-of-memory error. The + tradeoff is decreased performance due to the extra disk I/O to write the temporary data and read it back + in. The slowdown could be potentially be significant. Thus, while this feature improves reliability, + you should optimize your queries, system parameters, and hardware configuration to make this spilling a rare occurrence. + </p> + + <p class="p"> + <strong class="ph b">What kinds of queries might spill to disk:</strong> + </p> + + <p class="p"> + Several SQL clauses and constructs require memory allocations that could activat the spilling mechanism: + </p> + <ul class="ul"> + <li class="li"> + <p class="p"> + when a query uses a <code class="ph codeph">GROUP BY</code> clause for columns + with millions or billions of distinct values, Impala keeps a + similar number of temporary results in memory, to accumulate the + aggregate results for each value in the group. + </p> + </li> + <li class="li"> + <p class="p"> + When large tables are joined together, Impala keeps the values of + the join columns from one table in memory, to compare them to + incoming values from the other table. + </p> + </li> + <li class="li"> + <p class="p"> + When a large result set is sorted by the <code class="ph codeph">ORDER BY</code> + clause, each node sorts its portion of the result set in memory. + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">DISTINCT</code> and <code class="ph codeph">UNION</code> operators + build in-memory data structures to represent all values found so + far, to eliminate duplicates as the query progresses. + </p> + </li> + + </ul> + + <p class="p"> + When the spill-to-disk feature is activated for a join node within a query, Impala does not + produce any runtime filters for that join operation on that host. Other join nodes within + the query are not affected. + </p> + + <p class="p"> + <strong class="ph b">How Impala handles scratch disk space for spilling:</strong> + </p> + + <p class="p"> + By default, intermediate files used during large sort, join, aggregation, or analytic function operations + are stored in the directory <span class="ph filepath">/tmp/impala-scratch</span> . These files are removed when the + operation finishes. (Multiple concurrent queries can perform operations that use the <span class="q">"spill to disk"</span> + technique, without any name conflicts for these temporary files.) You can specify a different location by + starting the <span class="keyword cmdname">impalad</span> daemon with the + <code class="ph codeph">--scratch_dirs="<var class="keyword varname">path_to_directory</var>"</code> configuration option. + You can specify a single directory, or a comma-separated list of directories. The scratch directories must + be on the local filesystem, not in HDFS. You might specify different directory paths for different hosts, + depending on the capacity and speed + of the available storage devices. In <span class="keyword">Impala 2.3</span> or higher, Impala successfully starts (with a warning + Impala successfully starts (with a warning written to the log) if it cannot create or read and write files + in one of the scratch directories. If there is less than 1 GB free on the filesystem where that directory resides, + Impala still runs, but writes a warning message to its log. If Impala encounters an error reading or writing + files in a scratch directory during a query, Impala logs the error and the query fails. + </p> + + <p class="p"> + <strong class="ph b">Memory usage for SQL operators:</strong> + </p> + + <p class="p"> + The infrastructure of the spilling feature affects the way the affected SQL operators, such as + <code class="ph codeph">GROUP BY</code>, <code class="ph codeph">DISTINCT</code>, and joins, use memory. + On each host that participates in the query, each such operator in a query accumulates memory + while building the data structure to process the aggregation or join operation. The amount + of memory used depends on the portion of the data being handled by that host, and thus might + be different from one host to another. When the amount of memory being used for the operator + on a particular host reaches a threshold amount, Impala reserves an additional memory buffer + to use as a work area in case that operator causes the query to exceed the memory limit for + that host. After allocating the memory buffer, the memory used by that operator remains + essentially stable or grows only slowly, until the point where the memory limit is reached + and the query begins writing temporary data to disk. + </p> + + <p class="p"> + Prior to Impala 2.2, the extra memory buffer for an operator that might spill to disk + was allocated when the data structure used by the applicable SQL operator reaches 16 MB in size, + and the memory buffer itself was 512 MB. In Impala 2.2, these values are halved: the threshold value + is 8 MB and the memory buffer is 256 MB. <span class="ph">In <span class="keyword">Impala 2.3</span> and higher, the memory for the buffer + is allocated in pieces, only as needed, to avoid sudden large jumps in memory usage.</span> A query that uses + multiple such operators might allocate multiple such memory buffers, as the size of the data structure + for each operator crosses the threshold on a particular host. + </p> + + <p class="p"> + Therefore, a query that processes a relatively small amount of data on each host would likely + never reach the threshold for any operator, and would never allocate any extra memory buffers. A query + that did process millions of groups, distinct values, join keys, and so on might cross the threshold, + causing its memory requirement to rise suddenly and then flatten out. The larger the cluster, less data is processed + on any particular host, thus reducing the chance of requiring the extra memory allocation. + </p> + + <p class="p"> + <strong class="ph b">Added in:</strong> This feature was added to the <code class="ph codeph">ORDER BY</code> clause in Impala 1.4. + This feature was extended to cover join queries, aggregation functions, and analytic + functions in Impala 2.0. The size of the memory work area required by + each operator that spills was reduced from 512 megabytes to 256 megabytes in Impala 2.2. + </p> + + <p class="p"> + <strong class="ph b">Avoiding queries that spill to disk:</strong> + </p> + + <p class="p"> + Because the extra I/O can impose significant performance overhead on these types of queries, try to avoid + this situation by using the following steps: + </p> + + <ol class="ol"> + <li class="li"> + Detect how often queries spill to disk, and how much temporary data is written. Refer to the following + sources: + <ul class="ul"> + <li class="li"> + The output of the <code class="ph codeph">PROFILE</code> command in the <span class="keyword cmdname">impala-shell</span> + interpreter. This data shows the memory usage for each host and in total across the cluster. The + <code class="ph codeph">BlockMgr.BytesWritten</code> counter reports how much data was written to disk during the + query. + </li> + + <li class="li"> + The <span class="ph uicontrol">Queries</span> tab in the Impala debug web user interface. Select the query to + examine and click the corresponding <span class="ph uicontrol">Profile</span> link. This data breaks down the + memory usage for a single host within the cluster, the host whose web interface you are connected to. + </li> + </ul> + </li> + + <li class="li"> + Use one or more techniques to reduce the possibility of the queries spilling to disk: + <ul class="ul"> + <li class="li"> + Increase the Impala memory limit if practical, for example, if you can increase the available memory + by more than the amount of temporary data written to disk on a particular node. Remember that in + Impala 2.0 and later, you can issue <code class="ph codeph">SET MEM_LIMIT</code> as a SQL statement, which lets you + fine-tune the memory usage for queries from JDBC and ODBC applications. + </li> + + <li class="li"> + Increase the number of nodes in the cluster, to increase the aggregate memory available to Impala and + reduce the amount of memory required on each node. + </li> + + <li class="li"> + Increase the overall memory capacity of each DataNode at the hardware level. + </li> + + <li class="li"> + On a cluster with resources shared between Impala and other Hadoop components, use resource + management features to allocate more memory for Impala. See + <a class="xref" href="impala_resource_management.html#resource_management">Resource Management for Impala</a> for details. + </li> + + <li class="li"> + If the memory pressure is due to running many concurrent queries rather than a few memory-intensive + ones, consider using the Impala admission control feature to lower the limit on the number of + concurrent queries. By spacing out the most resource-intensive queries, you can avoid spikes in + memory usage and improve overall response times. See + <a class="xref" href="impala_admission.html#admission_control">Admission Control and Query Queuing</a> for details. + </li> + + <li class="li"> + Tune the queries with the highest memory requirements, using one or more of the following techniques: + <ul class="ul"> + <li class="li"> + Run the <code class="ph codeph">COMPUTE STATS</code> statement for all tables involved in large-scale joins and + aggregation queries. + </li> + + <li class="li"> + Minimize your use of <code class="ph codeph">STRING</code> columns in join columns. Prefer numeric values + instead. + </li> + + <li class="li"> + Examine the <code class="ph codeph">EXPLAIN</code> plan to understand the execution strategy being used for the + most resource-intensive queries. See <a class="xref" href="impala_explain_plan.html#perf_explain">Using the EXPLAIN Plan for Performance Tuning</a> for + details. + </li> + + <li class="li"> + If Impala still chooses a suboptimal execution strategy even with statistics available, or if it + is impractical to keep the statistics up to date for huge or rapidly changing tables, add hints + to the most resource-intensive queries to select the right execution strategy. See + <a class="xref" href="impala_hints.html#hints">Query Hints in Impala SELECT Statements</a> for details. + </li> + </ul> + </li> + + <li class="li"> + If your queries experience substantial performance overhead due to spilling, enable the + <code class="ph codeph">DISABLE_UNSAFE_SPILLS</code> query option. This option prevents queries whose memory usage + is likely to be exorbitant from spilling to disk. See + <a class="xref" href="impala_disable_unsafe_spills.html#disable_unsafe_spills">DISABLE_UNSAFE_SPILLS Query Option (Impala 2.0 or higher only)</a> for details. As you tune + problematic queries using the preceding steps, fewer and fewer will be cancelled by this option + setting. + </li> + </ul> + </li> + </ol> + + <p class="p"> + <strong class="ph b">Testing performance implications of spilling to disk:</strong> + </p> + + <p class="p"> + To artificially provoke spilling, to test this feature and understand the performance implications, use a + test environment with a memory limit of at least 2 GB. Issue the <code class="ph codeph">SET</code> command with no + arguments to check the current setting for the <code class="ph codeph">MEM_LIMIT</code> query option. Set the query + option <code class="ph codeph">DISABLE_UNSAFE_SPILLS=true</code>. This option limits the spill-to-disk feature to prevent + runaway disk usage from queries that are known in advance to be suboptimal. Within + <span class="keyword cmdname">impala-shell</span>, run a query that you expect to be memory-intensive, based on the criteria + explained earlier. A self-join of a large table is a good candidate: + </p> + +<pre class="pre codeblock"><code>select count(*) from big_table a join big_table b using (column_with_many_values); +</code></pre> + + <p class="p"> + Issue the <code class="ph codeph">PROFILE</code> command to get a detailed breakdown of the memory usage on each node + during the query. The crucial part of the profile output concerning memory is the <code class="ph codeph">BlockMgr</code> + portion. For example, this profile shows that the query did not quite exceed the memory limit. + </p> + +<pre class="pre codeblock"><code>BlockMgr: + - BlockWritesIssued: 1 + - BlockWritesOutstanding: 0 + - BlocksCreated: 24 + - BlocksRecycled: 1 + - BufferedPins: 0 + - MaxBlockSize: 8.00 MB (8388608) + <strong class="ph b">- MemoryLimit: 200.00 MB (209715200)</strong> + <strong class="ph b">- PeakMemoryUsage: 192.22 MB (201555968)</strong> + - TotalBufferWaitTime: 0ns + - TotalEncryptionTime: 0ns + - TotalIntegrityCheckTime: 0ns + - TotalReadBlockTime: 0ns +</code></pre> + + <p class="p"> + In this case, because the memory limit was already below any recommended value, I increased the volume of + data for the query rather than reducing the memory limit any further. + </p> + + <p class="p"> + Set the <code class="ph codeph">MEM_LIMIT</code> query option to a value that is smaller than the peak memory usage + reported in the profile output. Do not specify a memory limit lower than about 300 MB, because with such a + low limit, queries could fail to start for other reasons. Now try the memory-intensive query again. + </p> + + <p class="p"> + Check if the query fails with a message like the following: + </p> + +<pre class="pre codeblock"><code>WARNINGS: Spilling has been disabled for plans that do not have stats and are not hinted +to prevent potentially bad plans from using too many cluster resources. Compute stats on +these tables, hint the plan or disable this behavior via query options to enable spilling. +</code></pre> + + <p class="p"> + If so, the query could have consumed substantial temporary disk space, slowing down so much that it would + not complete in any reasonable time. Rather than rely on the spill-to-disk feature in this case, issue the + <code class="ph codeph">COMPUTE STATS</code> statement for the table or tables in your sample query. Then run the query + again, check the peak memory usage again in the <code class="ph codeph">PROFILE</code> output, and adjust the memory + limit again if necessary to be lower than the peak memory usage. + </p> + + <p class="p"> + At this point, you have a query that is memory-intensive, but Impala can optimize it efficiently so that + the memory usage is not exorbitant. You have set an artificial constraint through the + <code class="ph codeph">MEM_LIMIT</code> option so that the query would normally fail with an out-of-memory error. But + the automatic spill-to-disk feature means that the query should actually succeed, at the expense of some + extra disk I/O to read and write temporary work data. + </p> + + <p class="p"> + Try the query again, and confirm that it succeeds. Examine the <code class="ph codeph">PROFILE</code> output again. This + time, look for lines of this form: + </p> + +<pre class="pre codeblock"><code>- SpilledPartitions: <var class="keyword varname">N</var> +</code></pre> + + <p class="p"> + If you see any such lines with <var class="keyword varname">N</var> greater than 0, that indicates the query would have + failed in Impala releases prior to 2.0, but now it succeeded because of the spill-to-disk feature. Examine + the total time taken by the <code class="ph codeph">AGGREGATION_NODE</code> or other query fragments containing non-zero + <code class="ph codeph">SpilledPartitions</code> values. Compare the times to similar fragments that did not spill, for + example in the <code class="ph codeph">PROFILE</code> output when the same query is run with a higher memory limit. This + gives you an idea of the performance penalty of the spill operation for a particular query with a + particular memory limit. If you make the memory limit just a little lower than the peak memory usage, the + query only needs to write a small amount of temporary data to disk. The lower you set the memory limit, the + more temporary data is written and the slower the query becomes. + </p> + + <p class="p"> + Now repeat this procedure for actual queries used in your environment. Use the + <code class="ph codeph">DISABLE_UNSAFE_SPILLS</code> setting to identify cases where queries used more memory than + necessary due to lack of statistics on the relevant tables and columns, and issue <code class="ph codeph">COMPUTE + STATS</code> where necessary. + </p> + + <p class="p"> + <strong class="ph b">When to use DISABLE_UNSAFE_SPILLS:</strong> + </p> + + <p class="p"> + You might wonder, why not leave <code class="ph codeph">DISABLE_UNSAFE_SPILLS</code> turned on all the time. Whether and + how frequently to use this option depends on your system environment and workload. + </p> + + <p class="p"> + <code class="ph codeph">DISABLE_UNSAFE_SPILLS</code> is suitable for an environment with ad hoc queries whose performance + characteristics and memory usage are not known in advance. It prevents <span class="q">"worst-case scenario"</span> queries + that use large amounts of memory unnecessarily. Thus, you might turn this option on within a session while + developing new SQL code, even though it is turned off for existing applications. + </p> + + <p class="p"> + Organizations where table and column statistics are generally up-to-date might leave this option turned on + all the time, again to avoid worst-case scenarios for untested queries or if a problem in the ETL pipeline + results in a table with no statistics. Turning on <code class="ph codeph">DISABLE_UNSAFE_SPILLS</code> lets you <span class="q">"fail + fast"</span> in this case and immediately gather statistics or tune the problematic queries. + </p> + + <p class="p"> + Some organizations might leave this option turned off. For example, you might have tables large enough that + the <code class="ph codeph">COMPUTE STATS</code> takes substantial time to run, making it impractical to re-run after + loading new data. If you have examined the <code class="ph codeph">EXPLAIN</code> plans of your queries and know that + they are operating efficiently, you might leave <code class="ph codeph">DISABLE_UNSAFE_SPILLS</code> turned off. In that + case, you know that any queries that spill will not go overboard with their memory consumption. + </p> + + </div> + </article> + +<article class="topic concept nested1" aria-labelledby="ariaid-title5" id="scalability__complex_query"> +<h2 class="title topictitle2" id="ariaid-title5">Limits on Query Size and Complexity</h2> +<div class="body conbody"> +<p class="p"> +There are hardcoded limits on the maximum size and complexity of queries. +Currently, the maximum number of expressions in a query is 2000. +You might exceed the limits with large or deeply nested queries +produced by business intelligence tools or other query generators. +</p> +<p class="p"> +If you have the ability to customize such queries or the query generation +logic that produces them, replace sequences of repetitive expressions +with single operators such as <code class="ph codeph">IN</code> or <code class="ph codeph">BETWEEN</code> +that can represent multiple values or ranges. +For example, instead of a large number of <code class="ph codeph">OR</code> clauses: +</p> +<pre class="pre codeblock"><code>WHERE val = 1 OR val = 2 OR val = 6 OR val = 100 ... +</code></pre> +<p class="p"> +use a single <code class="ph codeph">IN</code> clause: +</p> +<pre class="pre codeblock"><code>WHERE val IN (1,2,6,100,...)</code></pre> +</div> +</article> + +<article class="topic concept nested1" aria-labelledby="ariaid-title6" id="scalability__scalability_io"> +<h2 class="title topictitle2" id="ariaid-title6">Scalability Considerations for Impala I/O</h2> +<div class="body conbody"> +<p class="p"> +Impala parallelizes its I/O operations aggressively, +therefore the more disks you can attach to each host, the better. +Impala retrieves data from disk so quickly using +bulk read operations on large blocks, that most queries +are CPU-bound rather than I/O-bound. +</p> +<p class="p"> +Because the kind of sequential scanning typically done by +Impala queries does not benefit much from the random-access +capabilities of SSDs, spinning disks typically provide +the most cost-effective kind of storage for Impala data, +with little or no performance penalty as compared to SSDs. +</p> +<p class="p"> +Resource management features such as YARN, Llama, and admission control +typically constrain the amount of memory, CPU, or overall number of +queries in a high-concurrency environment. +Currently, there is no throttling mechanism for Impala I/O. +</p> +</div> +</article> + +<article class="topic concept nested1" aria-labelledby="ariaid-title7" id="scalability__big_tables"> +<h2 class="title topictitle2" id="ariaid-title7">Scalability Considerations for Table Layout</h2> +<div class="body conbody"> +<p class="p"> +Due to the overhead of retrieving and updating table metadata +in the metastore database, try to limit the number of columns +in a table to a maximum of approximately 2000. +Although Impala can handle wider tables than this, the metastore overhead +can become significant, leading to query performance that is slower +than expected based on the actual data volume. +</p> +<p class="p"> +To minimize overhead related to the metastore database and Impala query planning, +try to limit the number of partitions for any partitioned table to a few tens of thousands. +</p> +</div> +</article> + +<article class="topic concept nested1" aria-labelledby="ariaid-title8" id="scalability__kerberos_overhead_cluster_size"> +<h2 class="title topictitle2" id="ariaid-title8">Kerberos-Related Network Overhead for Large Clusters</h2> +<div class="body conbody"> +<p class="p"> +When Impala starts up, or after each <code class="ph codeph">kinit</code> refresh, Impala sends a number of +simultaneous requests to the KDC. For a cluster with 100 hosts, the KDC might be able to process +all the requests within roughly 5 seconds. For a cluster with 1000 hosts, the time to process +the requests would be roughly 500 seconds. Impala also makes a number of DNS requests at the same +time as these Kerberos-related requests. +</p> +<p class="p"> +While these authentication requests are being processed, any submitted Impala queries will fail. +During this period, the KDC and DNS may be slow to respond to requests from components other than Impala, +so other secure services might be affected temporarily. +</p> + +<p class="p"> + To reduce the frequency of the <code class="ph codeph">kinit</code> renewal that initiates + a new set of authentication requests, increase the <code class="ph codeph">kerberos_reinit_interval</code> + configuration setting for the <span class="keyword cmdname">impalad</span> daemons. Currently, the default is 60 minutes. + Consider using a higher value such as 360 (6 hours). +</p> + +</div> +</article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title9" id="scalability__scalability_hotspots"> + <h2 class="title topictitle2" id="ariaid-title9">Avoiding CPU Hotspots for HDFS Cached Data</h2> + <div class="body conbody"> + <p class="p"> + You can use the HDFS caching feature, described in <a class="xref" href="impala_perf_hdfs_caching.html#hdfs_caching">Using HDFS Caching with Impala (Impala 2.1 or higher only)</a>, + with Impala to reduce I/O and memory-to-memory copying for frequently accessed tables or partitions. + </p> + <p class="p"> + In the early days of this feature, you might have found that enabling HDFS caching + resulted in little or no performance improvement, because it could result in + <span class="q">"hotspots"</span>: instead of the I/O to read the table data being parallelized across + the cluster, the I/O was reduced but the CPU load to process the data blocks + might be concentrated on a single host. + </p> + <p class="p"> + To avoid hotspots, include the <code class="ph codeph">WITH REPLICATION</code> clause with the + <code class="ph codeph">CREATE TABLE</code> or <code class="ph codeph">ALTER TABLE</code> statements for tables that use HDFS caching. + This clause allows more than one host to cache the relevant data blocks, so the CPU load + can be shared, reducing the load on any one host. + See <a class="xref" href="impala_create_table.html#create_table">CREATE TABLE Statement</a> and <a class="xref" href="impala_alter_table.html#alter_table">ALTER TABLE Statement</a> + for details. + </p> + <p class="p"> + Hotspots with high CPU load for HDFS cached data could still arise in some cases, due to + the way that Impala schedules the work of processing data blocks on different hosts. + In <span class="keyword">Impala 2.5</span> and higher, scheduling improvements mean that the work for + HDFS cached data is divided better among all the hosts that have cached replicas + for a particular data block. When more than one host has a cached replica for a data block, + Impala assigns the work of processing that block to whichever host has done the least work + (in terms of number of bytes read) for the current query. If hotspots persist even with this + load-based scheduling algorithm, you can enable the query option <code class="ph codeph">SCHEDULE_RANDOM_REPLICA=TRUE</code> + to further distribute the CPU load. This setting causes Impala to randomly pick a host to process a cached + data block if the scheduling algorithm encounters a tie when deciding which host has done the + least work. + </p> + </div> + </article> + +</article></main></body></html> \ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_scan_node_codegen_threshold.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_scan_node_codegen_threshold.html b/docs/build/html/topics/impala_scan_node_codegen_threshold.html new file mode 100644 index 0000000..2e71e50 --- /dev/null +++ b/docs/build/html/topics/impala_scan_node_codegen_threshold.html @@ -0,0 +1,69 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="scan_node_codegen_threshold"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>SCAN_NODE_CODEGEN_THRESHOLD Query Option (Impala 2.5 or higher only)</title></head><body id="scan_node_codegen_threshold"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">SCAN_NODE_CODEGEN_THRESHOLD Query Option (<span class="keyword">Impala 2.5</span> or higher only)</h1> + + + + <div class="body conbody"> + + <p class="p"> + + The <code class="ph codeph">SCAN_NODE_CODEGEN_THRESHOLD</code> query option + adjusts the aggressiveness of the code generation optimization process + when performing I/O read operations. It can help to work around performance problems + for queries where the table is small and the <code class="ph codeph">WHERE</code> clause is complicated. + </p> + + <p class="p"> + <strong class="ph b">Type:</strong> integer + </p> + + <p class="p"> + <strong class="ph b">Default:</strong> 1800000 (1.8 million) + </p> + + <p class="p"> + <strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.5.0</span> + </p> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + + <p class="p"> + This query option is intended mainly for the case where a query with a very complicated + <code class="ph codeph">WHERE</code> clause, such as an <code class="ph codeph">IN</code> operator with thousands + of entries, is run against a small table, especially a small table using Parquet format. + The code generation phase can become the dominant factor in the query response time, + making the query take several seconds even though there is relatively little work to do. + In this case, increase the value of this option to a much larger amount, anything up to + the maximum for a 32-bit integer. + </p> + + <p class="p"> + Because this option only affects the code generation phase for the portion of the + query that performs I/O (the <dfn class="term">scan nodes</dfn> within the query plan), it + lets you continue to keep code generation enabled for other queries, and other parts + of the same query, that can benefit from it. In contrast, the + <code class="ph codeph">IMPALA_DISABLE_CODEGEN</code> query option turns off code generation entirely. + </p> + + <p class="p"> + Because of the way the work for queries is divided internally, this option might not + affect code generation for all kinds of queries. If a plan fragment contains a scan + node and some other kind of plan node, code generation still occurs regardless of + this option setting. + </p> + + <p class="p"> + To use this option effectively, you should be familiar with reading query profile output + to determine the proportion of time spent in the code generation phase, and whether + code generation is enabled or not for specific plan fragments. + </p> + + + + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_schedule_random_replica.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_schedule_random_replica.html b/docs/build/html/topics/impala_schedule_random_replica.html new file mode 100644 index 0000000..9826960 --- /dev/null +++ b/docs/build/html/topics/impala_schedule_random_replica.html @@ -0,0 +1,83 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="schedule_random_replica"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>SCHEDULE_RANDOM_REPLICA Query Option (Impala 2.5 or higher only)</title></head><body id="schedule_random_replica"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">SCHEDULE_RANDOM_REPLICA Query Option (<span class="keyword">Impala 2.5</span> or higher only)</h1> + + + + <div class="body conbody"> + + <p class="p"> + + </p> + + <p class="p"> + The <code class="ph codeph">SCHEDULE_RANDOM_REPLICA</code> query option fine-tunes the algorithm for deciding which host + processes each HDFS data block. It only applies to tables and partitions that are not enabled + for the HDFS caching feature. + </p> + + <p class="p"> + <strong class="ph b">Type:</strong> Boolean; recognized values are 1 and 0, or <code class="ph codeph">true</code> and <code class="ph codeph">false</code>; + any other value interpreted as <code class="ph codeph">false</code> + </p> + + <p class="p"> + <strong class="ph b">Default:</strong> <code class="ph codeph">false</code> + </p> + + <p class="p"> + <strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.5.0</span> + </p> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + + <p class="p"> + In the presence of HDFS cached replicas, Impala randomizes + which host processes each cached data block. + To ensure that HDFS data blocks are cached on more + than one host, use the <code class="ph codeph">WITH REPLICATION</code> clause along with + the <code class="ph codeph">CACHED IN</code> clause in a + <code class="ph codeph">CREATE TABLE</code> or <code class="ph codeph">ALTER TABLE</code> statement. + Specify a replication value greater than or equal to the HDFS block replication factor. + </p> + + <p class="p"> + The <code class="ph codeph">SCHEDULE_RANDOM_REPLICA</code> query option applies to tables and partitions + that <em class="ph i">do not</em> use HDFS caching. + By default, Impala estimates how much work each host has done for + the query, and selects the host that has the lowest workload. + This algorithm is intended to reduce CPU hotspots arising when the + same host is selected to process multiple data blocks, but hotspots + might still arise for some combinations of queries and data layout. + When the <code class="ph codeph">SCHEDULE_RANDOM_REPLICA</code> option is enabled, + Impala further randomizes the scheduling algorithm for non-HDFS cached blocks, + which can further reduce the chance of CPU hotspots. + </p> + + <p class="p"> + This query option works in conjunction with the work scheduling improvements + in <span class="keyword">Impala 2.5</span> and higher. The scheduling improvements + distribute the processing for cached HDFS data blocks to minimize hotspots: + if a data block is cached on more than one host, Impala chooses which host + to process each block based on which host has read the fewest bytes during + the current query. Enable <code class="ph codeph">SCHEDULE_RANDOM_REPLICA</code> setting if CPU hotspots + still persist because of cases where hosts are <span class="q">"tied"</span> in terms of + the amount of work done; by default, Impala picks the first eligible host + in this case. + </p> + + <p class="p"> + <strong class="ph b">Related information:</strong> + </p> + <p class="p"> + <a class="xref" href="impala_perf_hdfs_caching.html#hdfs_caching">Using HDFS Caching with Impala (Impala 2.1 or higher only)</a>, + <a class="xref" href="impala_scalability.html#scalability_hotspots">Avoiding CPU Hotspots for HDFS Cached Data</a> + , <a class="xref" href="impala_replica_preference.html#replica_preference">REPLICA_PREFERENCE Query Option (Impala 2.7 or higher only)</a> + </p> + + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_schema_design.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_schema_design.html b/docs/build/html/topics/impala_schema_design.html new file mode 100644 index 0000000..6825c5d --- /dev/null +++ b/docs/build/html/topics/impala_schema_design.html @@ -0,0 +1,184 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_planning.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="schema_design"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Guidelines for Designing Impala Schemas</title></head><body id="schema_design"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Guidelines for Designing Impala Schemas</h1> + + + + <div class="body conbody"> + + <p class="p"> + The guidelines in this topic help you to construct an optimized and scalable schema, one that integrates well + with your existing data management processes. Use these guidelines as a checklist when doing any + proof-of-concept work, porting exercise, or before deploying to production. + </p> + + <p class="p"> + If you are adapting an existing database or Hive schema for use with Impala, read the guidelines in this + section and then see <a class="xref" href="impala_porting.html#porting">Porting SQL from Other Database Systems to Impala</a> for specific porting and compatibility tips. + </p> + + <p class="p toc inpage"></p> + + <section class="section" id="schema_design__schema_design_text_vs_binary"><h2 class="title sectiontitle">Prefer binary file formats over text-based formats.</h2> + + + + <p class="p"> + To save space and improve memory usage and query performance, use binary file formats for any large or + intensively queried tables. Parquet file format is the most efficient for data warehouse-style analytic + queries. Avro is the other binary file format that Impala supports, that you might already have as part of + a Hadoop ETL pipeline. + </p> + + <p class="p"> + Although Impala can create and query tables with the RCFile and SequenceFile file formats, such tables are + relatively bulky due to the text-based nature of those formats, and are not optimized for data + warehouse-style queries due to their row-oriented layout. Impala does not support <code class="ph codeph">INSERT</code> + operations for tables with these file formats. + </p> + + <p class="p"> + Guidelines: + </p> + + <ul class="ul"> + <li class="li"> + For an efficient and scalable format for large, performance-critical tables, use the Parquet file format. + </li> + + <li class="li"> + To deliver intermediate data during the ETL process, in a format that can also be used by other Hadoop + components, Avro is a reasonable choice. + </li> + + <li class="li"> + For convenient import of raw data, use a text table instead of RCFile or SequenceFile, and convert to + Parquet in a later stage of the ETL process. + </li> + </ul> + </section> + + <section class="section" id="schema_design__schema_design_compression"><h2 class="title sectiontitle">Use Snappy compression where practical.</h2> + + + + <p class="p"> + Snappy compression involves low CPU overhead to decompress, while still providing substantial space + savings. In cases where you have a choice of compression codecs, such as with the Parquet and Avro file + formats, use Snappy compression unless you find a compelling reason to use a different codec. + </p> + </section> + + <section class="section" id="schema_design__schema_design_numeric_types"><h2 class="title sectiontitle">Prefer numeric types over strings.</h2> + + + + <p class="p"> + If you have numeric values that you could treat as either strings or numbers (such as + <code class="ph codeph">YEAR</code>, <code class="ph codeph">MONTH</code>, and <code class="ph codeph">DAY</code> for partition key columns), define + them as the smallest applicable integer types. For example, <code class="ph codeph">YEAR</code> can be + <code class="ph codeph">SMALLINT</code>, <code class="ph codeph">MONTH</code> and <code class="ph codeph">DAY</code> can be <code class="ph codeph">TINYINT</code>. + Although you might not see any difference in the way partitioned tables or text files are laid out on disk, + using numeric types will save space in binary formats such as Parquet, and in memory when doing queries, + particularly resource-intensive queries such as joins. + </p> + </section> + + + + <section class="section" id="schema_design__schema_design_partitioning"><h2 class="title sectiontitle">Partition, but do not over-partition.</h2> + + + + <p class="p"> + Partitioning is an important aspect of performance tuning for Impala. Follow the procedures in + <a class="xref" href="impala_partitioning.html#partitioning">Partitioning for Impala Tables</a> to set up partitioning for your biggest, most + intensively queried tables. + </p> + + <p class="p"> + If you are moving to Impala from a traditional database system, or just getting started in the Big Data + field, you might not have enough data volume to take advantage of Impala parallel queries with your + existing partitioning scheme. For example, if you have only a few tens of megabytes of data per day, + partitioning by <code class="ph codeph">YEAR</code>, <code class="ph codeph">MONTH</code>, and <code class="ph codeph">DAY</code> columns might be + too granular. Most of your cluster might be sitting idle during queries that target a single day, or each + node might have very little work to do. Consider reducing the number of partition key columns so that each + partition directory contains several gigabytes worth of data. + </p> + + <p class="p"> + For example, consider a Parquet table where each data file is 1 HDFS block, with a maximum block size of 1 + GB. (In Impala 2.0 and later, the default Parquet block size is reduced to 256 MB. For this exercise, let's + assume you have bumped the size back up to 1 GB by setting the query option + <code class="ph codeph">PARQUET_FILE_SIZE=1g</code>.) if you have a 10-node cluster, you need 10 data files (up to 10 GB) + to give each node some work to do for a query. But each core on each machine can process a separate data + block in parallel. With 16-core machines on a 10-node cluster, a query could process up to 160 GB fully in + parallel. If there are only a few data files per partition, not only are most cluster nodes sitting idle + during queries, so are most cores on those machines. + </p> + + <p class="p"> + You can reduce the Parquet block size to as low as 128 MB or 64 MB to increase the number of files per + partition and improve parallelism. But also consider reducing the level of partitioning so that analytic + queries have enough data to work with. + </p> + </section> + + <section class="section" id="schema_design__schema_design_compute_stats"><h2 class="title sectiontitle">Always compute stats after loading data.</h2> + + + + <p class="p"> + Impala makes extensive use of statistics about data in the overall table and in each column, to help plan + resource-intensive operations such as join queries and inserting into partitioned Parquet tables. Because + this information is only available after data is loaded, run the <code class="ph codeph">COMPUTE STATS</code> statement + on a table after loading or replacing data in a table or partition. + </p> + + <p class="p"> + Having accurate statistics can make the difference between a successful operation, or one that fails due to + an out-of-memory error or a timeout. When you encounter performance or capacity issues, always use the + <code class="ph codeph">SHOW STATS</code> statement to check if the statistics are present and up-to-date for all tables + in the query. + </p> + + <p class="p"> + When doing a join query, Impala consults the statistics for each joined table to determine their relative + sizes and to estimate the number of rows produced in each join stage. When doing an <code class="ph codeph">INSERT</code> + into a Parquet table, Impala consults the statistics for the source table to determine how to distribute + the work of constructing the data files for each partition. + </p> + + <p class="p"> + See <a class="xref" href="impala_compute_stats.html#compute_stats">COMPUTE STATS Statement</a> for the syntax of the <code class="ph codeph">COMPUTE + STATS</code> statement, and <a class="xref" href="impala_perf_stats.html#perf_stats">Table and Column Statistics</a> for all the performance + considerations for table and column statistics. + </p> + </section> + + <section class="section" id="schema_design__schema_design_explain"><h2 class="title sectiontitle">Verify sensible execution plans with EXPLAIN and SUMMARY.</h2> + + + + <p class="p"> + Before executing a resource-intensive query, use the <code class="ph codeph">EXPLAIN</code> statement to get an overview + of how Impala intends to parallelize the query and distribute the work. If you see that the query plan is + inefficient, you can take tuning steps such as changing file formats, using partitioned tables, running the + <code class="ph codeph">COMPUTE STATS</code> statement, or adding query hints. For information about all of these + techniques, see <a class="xref" href="impala_performance.html#performance">Tuning Impala for Performance</a>. + </p> + + <p class="p"> + After you run a query, you can see performance-related information about how it actually ran by issuing the + <code class="ph codeph">SUMMARY</code> command in <span class="keyword cmdname">impala-shell</span>. Prior to Impala 1.4, you would use + the <code class="ph codeph">PROFILE</code> command, but its highly technical output was only useful for the most + experienced users. <code class="ph codeph">SUMMARY</code>, new in Impala 1.4, summarizes the most useful information for + all stages of execution, for all nodes rather than splitting out figures for each node. + </p> + </section> + + + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_planning.html">Planning for Impala Deployment</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_schema_objects.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_schema_objects.html b/docs/build/html/topics/impala_schema_objects.html new file mode 100644 index 0000000..b8ea7cd --- /dev/null +++ b/docs/build/html/topics/impala_schema_objects.html @@ -0,0 +1,48 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_langref.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_aliases.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_databases.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_functions_overview.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_identifiers.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_tables.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_views.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name= "DC.Format" content="XHTML"><meta name="DC.Identifier" content="schema_objects"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Impala Schema Objects and Object Names</title></head><body id="schema_objects"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Impala Schema Objects and Object Names</h1> + + + + <div class="body conbody"> + + <p class="p"> + + With Impala, you work with schema objects that are familiar to database users: primarily databases, tables, views, + and functions. The SQL syntax to work with these objects is explained in + <a class="xref" href="impala_langref_sql.html#langref_sql">Impala SQL Statements</a>. This section explains the conceptual knowledge you need to + work with these objects and the various ways to specify their names. + </p> + + <p class="p"> + Within a table, partitions can also be considered a kind of object. Partitioning is an important subject for + Impala, with its own documentation section covering use cases and performance considerations. See + <a class="xref" href="impala_partitioning.html#partitioning">Partitioning for Impala Tables</a> for details. + </p> + + <p class="p"> + Impala does not have a counterpart of the <span class="q">"tablespace"</span> notion from some database systems. By default, + all the data files for a database, table, or partition are located within nested folders within the HDFS file + system. You can also specify a particular HDFS location for a given Impala table or partition. The raw data + for these objects is represented as a collection of data files, providing the flexibility to load data by + simply moving files into the expected HDFS location. + </p> + + <p class="p"> + Information about the schema objects is held in the + <a class="xref" href="impala_hadoop.html#intro_metastore">metastore</a> database. This database is shared between + Impala and Hive, allowing each to create, drop, and query each other's databases, tables, and so on. When + Impala makes a change to schema objects through a <code class="ph codeph">CREATE</code>, <code class="ph codeph">ALTER</code>, + <code class="ph codeph">DROP</code>, <code class="ph codeph">INSERT</code>, or <code class="ph codeph">LOAD DATA</code> statement, it broadcasts those + changes to all nodes in the cluster through the <a class="xref" href="impala_components.html#intro_catalogd">catalog + service</a>. When you make such changes through Hive or directly through manipulating HDFS files, you use + the <a class="xref" href="impala_refresh.html#refresh">REFRESH</a> or + <a class="xref" href="impala_invalidate_metadata.html#invalidate_metadata">INVALIDATE METADATA</a> statements on the + Impala side to recognize the newly loaded data, new tables, and so on. + </p> + + <p class="p toc"></p> + </div> +<nav role="navigation" class="related-links"><ul class="ullinks"><li class="link ulchildlink"><strong><a href="../topics/impala_aliases.html">Overview of Impala Aliases</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_databases.html">Overview of Impala Databases</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_functions_overview.html">Overview of Impala Functions</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_identifiers.html">Overview of Impala Identifiers</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_tables.html">Overview of Impala Tables</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_views.html">Overview of Impala Views</a></strong><br></li></ul><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_langref.html">Impala SQL Language Refer ence</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_scratch_limit.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_scratch_limit.html b/docs/build/html/topics/impala_scratch_limit.html new file mode 100644 index 0000000..98bac93 --- /dev/null +++ b/docs/build/html/topics/impala_scratch_limit.html @@ -0,0 +1,77 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="scratch_limit"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>SCRATCH_LIMIT Query Option</title></head><body id="scratch_limit"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">SCRATCH_LIMIT Query Option</h1> + + + + <div class="body conbody"> + + <p class="p"> + + Specifies the maximum amount of disk storage, in bytes, that any Impala query can consume + on any host using the <span class="q">"spill to disk"</span> mechanism that handles queries that exceed + the memory limit. + </p> + + <p class="p"> + <strong class="ph b">Syntax:</strong> + </p> + + <p class="p"> + Specify the size in bytes, or with a trailing <code class="ph codeph">m</code> or <code class="ph codeph">g</code> character to indicate + megabytes or gigabytes. For example: + </p> + + +<pre class="pre codeblock"><code>-- 128 megabytes. +set SCRATCH_LIMIT=134217728 + +-- 512 megabytes. +set SCRATCH_LIMIT=512m; + +-- 1 gigabyte. +set SCRATCH_LIMIT=1g; +</code></pre> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + + <p class="p"> + A value of zero turns off the spill to disk feature for queries + in the current session, causing them to fail immediately if they + exceed the memory limit. + </p> + + <p class="p"> + The amount of memory used per host for a query is limited by the + <code class="ph codeph">MEM_LIMIT</code> query option. + </p> + + <p class="p"> + The more Impala daemon hosts in the cluster, the less memory is used on each host, + and therefore also less scratch space is required for queries that + exceed the memory limit. + </p> + + <p class="p"> + <strong class="ph b">Type:</strong> numeric, with optional unit specifier + </p> + + <p class="p"> + <strong class="ph b">Default:</strong> -1 (amount of spill space is unlimited) + </p> + + <p class="p"> + <strong class="ph b">Related information:</strong> + </p> + + <p class="p"> + <a class="xref" href="impala_scalability.html#spill_to_disk">SQL Operations that Spill to Disk</a>, + <a class="xref" href="impala_mem_limit.html#mem_limit">MEM_LIMIT Query Option</a> + </p> + + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_security.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_security.html b/docs/build/html/topics/impala_security.html new file mode 100644 index 0000000..45d9923 --- /dev/null +++ b/docs/build/html/topics/impala_security.html @@ -0,0 +1,99 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_security_guidelines.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_security_files.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_security_install.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_security_metastore.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_security_webui.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_ssl.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_authorization.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_authentication.html"><meta name="DC.Relation" scheme="URI" content="../topics/i mpala_auditing.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_lineage.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="security"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Impala Security</title></head><body id="security"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1"><span class="ph">Impala Security</span></h1> + + + <div class="body conbody"> + + <p class="p"> + Impala includes a fine-grained authorization framework for Hadoop, based on Apache Sentry. + Sentry authorization was added in Impala 1.1.0. Together with the Kerberos + authentication framework, Sentry takes Hadoop security to a new level needed for the requirements of + highly regulated industries such as healthcare, financial services, and government. Impala also includes + an auditing capability which was added in Impala 1.1.1; Impala generates the audit data which can be + consumed, filtered, and visualized by cluster-management components focused on governance. + </p> + + <p class="p"> + The Impala security features have several objectives. At the most basic level, security prevents + accidents or mistakes that could disrupt application processing, delete or corrupt data, or reveal data to + unauthorized users. More advanced security features and practices can harden the system against malicious + users trying to gain unauthorized access or perform other disallowed operations. The auditing feature + provides a way to confirm that no unauthorized access occurred, and detect whether any such attempts were + made. This is a critical set of features for production deployments in large organizations that handle + important or sensitive data. It sets the stage for multi-tenancy, where multiple applications run + concurrently and are prevented from interfering with each other. + </p> + + <p class="p"> + The material in this section presumes that you are already familiar with administering secure Linux systems. + That is, you should know the general security practices for Linux and Hadoop, and their associated commands + and configuration files. For example, you should know how to create Linux users and groups, manage Linux + group membership, set Linux and HDFS file permissions and ownership, and designate the default permissions + and ownership for new files. You should be familiar with the configuration of the nodes in your Hadoop + cluster, and know how to apply configuration changes or run a set of commands across all the nodes. + </p> + + <p class="p"> + The security features are divided into these broad categories: + </p> + + <dl class="dl"> + + + <dt class="dt dlterm"> + authorization + </dt> + + <dd class="dd"> + Which users are allowed to access which resources, and what operations are they allowed to perform? + Impala relies on the open source Sentry project for authorization. By default (when authorization is not + enabled), Impala does all read and write operations with the privileges of the <code class="ph codeph">impala</code> + user, which is suitable for a development/test environment but not for a secure production environment. + When authorization is enabled, Impala uses the OS user ID of the user who runs + <span class="keyword cmdname">impala-shell</span> or other client program, and associates various privileges with each + user. See <a class="xref" href="impala_authorization.html#authorization">Enabling Sentry Authorization for Impala</a> for details about setting up and managing + authorization. + </dd> + + + + + + <dt class="dt dlterm"> + authentication + </dt> + + <dd class="dd"> + How does Impala verify the identity of the user to confirm that they really are allowed to exercise the + privileges assigned to that user? Impala relies on the Kerberos subsystem for authentication. See + <a class="xref" href="impala_kerberos.html#kerberos">Enabling Kerberos Authentication for Impala</a> for details about setting up and managing authentication. + </dd> + + + + + + <dt class="dt dlterm"> + auditing + </dt> + + <dd class="dd"> + What operations were attempted, and did they succeed or not? This feature provides a way to look back and + diagnose whether attempts were made to perform unauthorized operations. You use this information to track + down suspicious activity, and to see where changes are needed in authorization policies. The audit data + produced by this feature can be collected and presented in a user-friendly form by cluster-management + software. See <a class="xref" href="impala_auditing.html#auditing">Auditing Impala Operations</a> for details about setting up and managing + auditing. + </dd> + + + </dl> + + <p class="p toc"></p> + + + </div> +<nav role="navigation" class="related-links"><ul class="ullinks"><li class="link ulchildlink"><strong><a href="../topics/impala_security_guidelines.html">Security Guidelines for Impala</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_security_files.html">Securing Impala Data and Log Files</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_security_install.html">Installation Considerations for Impala Security</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_security_metastore.html">Securing the Hive Metastore Database</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_security_webui.html">Securing the Impala Web User Interface</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_ssl.html">Configuring TLS/SSL for Impala</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_authorization.html ">Enabling Sentry Authorization for Impala</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_authentication.html">Impala Authentication</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_auditing.html">Auditing Impala Operations</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_lineage.html">Viewing Lineage Information for Impala Data</a></strong><br></li></ul></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_security_files.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_security_files.html b/docs/build/html/topics/impala_security_files.html new file mode 100644 index 0000000..e980e60 --- /dev/null +++ b/docs/build/html/topics/impala_security_files.html @@ -0,0 +1,58 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_security.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="secure_files"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Securing Impala Data and Log Files</title></head><body id="secure_files"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Securing Impala Data and Log Files</h1> + + + <div class="body conbody"> + + <p class="p"> + One aspect of security is to protect files from unauthorized access at the filesystem level. For example, if + you store sensitive data in HDFS, you specify permissions on the associated files and directories in HDFS to + restrict read and write permissions to the appropriate users and groups. + </p> + + <p class="p"> + If you issue queries containing sensitive values in the <code class="ph codeph">WHERE</code> clause, such as financial + account numbers, those values are stored in Impala log files in the Linux filesystem and you must secure + those files also. For the locations of Impala log files, see <a class="xref" href="impala_logging.html#logging">Using Impala Logging</a>. + </p> + + <p class="p"> + All Impala read and write operations are performed under the filesystem privileges of the + <code class="ph codeph">impala</code> user. The <code class="ph codeph">impala</code> user must be able to read all directories and data + files that you query, and write into all the directories and data files for <code class="ph codeph">INSERT</code> and + <code class="ph codeph">LOAD DATA</code> statements. At a minimum, make sure the <code class="ph codeph">impala</code> user is in the + <code class="ph codeph">hive</code> group so that it can access files and directories shared between Impala and Hive. See + <a class="xref" href="impala_prereqs.html#prereqs_account">User Account Requirements</a> for more details. + </p> + + <p class="p"> + Setting file permissions is necessary for Impala to function correctly, but is not an effective security + practice by itself: + </p> + + <ul class="ul"> + <li class="li"> + <p class="p"> + The way to ensure that only authorized users can submit requests for databases and tables they are allowed + to access is to set up Sentry authorization, as explained in + <a class="xref" href="impala_authorization.html#authorization">Enabling Sentry Authorization for Impala</a>. With authorization enabled, the checking of the user + ID and group is done by Impala, and unauthorized access is blocked by Impala itself. The actual low-level + read and write requests are still done by the <code class="ph codeph">impala</code> user, so you must have appropriate + file and directory permissions for that user ID. + </p> + </li> + + <li class="li"> + <p class="p"> + You must also set up Kerberos authentication, as described in <a class="xref" href="impala_kerberos.html#kerberos">Enabling Kerberos Authentication for Impala</a>, + so that users can only connect from trusted hosts. With Kerberos enabled, if someone connects a new host to + the network and creates user IDs that match your privileged IDs, they will be blocked from connecting to + Impala at all from that host. + </p> + </li> + </ul> + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_security.html">Impala Security</a></div></div></nav></article></main></body></html> \ No newline at end of file
