http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_scalability.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_scalability.xml b/docs/topics/impala_scalability.xml index 9abb1c5..ab4ef36 100644 --- a/docs/topics/impala_scalability.xml +++ b/docs/topics/impala_scalability.xml @@ -3,7 +3,19 @@ <title>Scalability Considerations for Impala</title> <titlealts audience="PDF"><navtitle>Scalability Considerations</navtitle></titlealts> - + <prolog> + <metadata> + <data name="Category" value="Performance"/> + <data name="Category" value="Impala"/> + <data name="Category" value="Planning"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Memory"/> + <data name="Category" value="Scalability"/> + <!-- Using domain knowledge about Impala, sizing, etc. to decide what to mark as 'Proof of Concept'. --> + <data name="Category" value="Proof of Concept"/> + </metadata> + </prolog> <conbody> @@ -14,8 +26,808 @@ scalability issues, such as a single slow node that causes performance problems for queries. </p> - + <p outputclass="toc inpage"/> + + <p conref="../shared/impala_common.xml#common/cookbook_blurb"/> + + </conbody> + + <concept audience="Cloudera" id="scalability_memory"> + + <title>Overview and Guidelines for Impala Memory Usage</title> + <prolog> + <metadata> + <data name="Category" value="Memory"/> + <data name="Category" value="Concepts"/> + <data name="Category" value="Best Practices"/> + <data name="Category" value="Guidelines"/> + </metadata> + </prolog> + + <conbody> + +<!-- +Outline adapted from Alan Choi's "best practices" and/or "performance cookbook" papers. +--> + +<codeblock>Memory Usage â the Basics +* Memory is used by: +* Hash join â RHS tables after decompression, filtering and projection +* Group by â proportional to the #groups +* Parquet writer buffer â 1GB per partition +* IO buffer (shared across queries) +* Metadata cache (no more than 1GB typically) +* Memory held and reused by later query +* Impala releases memory from time to time starting in 1.4. + +Memory Usage â Estimating Memory Usage +* Use Explain Plan +* Requires statistics! Mem estimate without stats is meaningless. +* Reports per-host memory requirement for this cluster size. +* Re-run if youâve re-sized the cluster! +[image of explain plan] + +Memory Usage â Estimating Memory Usage +* EXPLAINâs memory estimate issues +* Can be way off â much higher or much lower. +* group byâs estimate can be particularly off â when thereâs a large number of group by columns. +* Mem estimate = NDV of group by column 1 * NDV of group by column 2 * ... NDV of group by column n +* Ignore EXPLAINâs estimate if itâs too high! ⢠Do your own estimate for group by +* GROUP BY mem usage = (total number of groups * size of each row) + (total number of groups * size of each row) / num node + +Memory Usage â Finding Actual Memory Usage +* Search for âPer Node Peak Memory Usageâ in the profile. +This is accurate. Use it for production capacity planning. + +Memory Usage â Actual Memory Usage +* For complex queries, how do I know which part of my query is using too much memory? +* Use the ExecSummary from the query profile! +- But is that "Peak Mem" number aggregate or per-node? +[image of executive summary] + +Memory Usage â Hitting Mem-limit +* Top causes (in order) of hitting mem-limit even when running a single query: +1. Lack of statistics +2. Lots of joins within a single query +3. Big-table joining big-table +4. Gigantic group by + +Memory Usage â Hitting Mem-limit +Lack of stats +* Wrong join order, wrong join strategy, wrong insert strategy +* Explain Plan tells you that! +[image of explain plan] +* Fix: Compute Stats table + +Memory Usage â Hitting Mem-limit +Lots of joins within a single query +* select...from fact, dim1, dim2,dim3,...dimN where ... +* Each dim tbl can fit in memory, but not all of them together +* As of Impala 1.4, Impala might choose the wrong plan â BROADCAST +FIX 1: use shuffle hint +select ... from fact join [shuffle] dim1 on ... join dim2 [shuffle] ... +FIX 2: pre-join the dim tables (if possible) +- How about an example to illustrate that technique? +* few join=>better perf! + +Memory Usage: Hitting Mem-limit +Big-table joining big-table +* Big-table (after decompression, filtering, and projection) is a table that is bigger than total cluster memory size. +* Impala 2.0 will do this (via disk-based join). Consider using Hive for now. +* (Advanced) For a simple query, you can try this advanced workaround â per-partition join +* Requires the partition key be part of the join key +select ... from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (1,2,3) + union all +select ... from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (4,5,6) + +Memory Usage: Hitting Mem-limit +Gigantic group by +* The total number of distinct groups is huge, such as group by userid. +* Impala 2.0 will do this (via disk-based agg). Consider using Hive for now. +- Is this one of the cases where people were unhappy we recommended Hive? +* (Advanced) For a simple query, you can try this advanced workaround â per-partition agg +* Requires the partition key be part of the group by +select part_key, col1, col2, ...agg(..) from tbl where + part_key in (1,2,3) + Union all + Select part_key, col1, col2, ...agg(..) from tbl where + part_key in (4,5,6) +- But where's the GROUP BY in the preceding query? Need a real example. + +Memory Usage: Additional Notes +* Use explain plan for estimate; use profile for accurate measure +* Data skew can use uneven memory usage +* Review previous common issues on out-of-memory +* Note: Even with disk-based joins, you'll want to review these steps to speed up queries and use memory more efficiently +</codeblock> + </conbody> + </concept> + + <concept id="scalability_catalog"> + + <title>Impact of Many Tables or Partitions on Impala Catalog Performance and Memory Usage</title> + + <conbody> + + <p audience="Cloudera"> + Details to fill in in future: Impact of <q>load catalog in background</q> option. + Changing timeouts. Related Cloudera Manager settings. + </p> + + <p> + Because Hadoop I/O is optimized for reading and writing large files, Impala is optimized for tables + containing relatively few, large data files. Schemas containing thousands of tables, or tables containing + thousands of partitions, can encounter performance issues during startup or during DDL operations such as + <codeph>ALTER TABLE</codeph> statements. + </p> + + <note type="important" rev="TSB-168"> + <p> + Because of a change in the default heap size for the <cmdname>catalogd</cmdname> daemon in + CDH 5.7 / Impala 2.5 and higher, the following procedure to increase the <cmdname>catalogd</cmdname> + memory limit might be required following an upgrade to CDH 5.7 / Impala 2.5, even if not + needed previously. + </p> + </note> + + <p conref="../shared/impala_common.xml#common/increase_catalogd_heap_size"/> + + </conbody> + </concept> + + <concept rev="2.1.0" id="statestore_scalability"> + + <title>Scalability Considerations for the Impala Statestore</title> + + <conbody> + + <p> + Before CDH 5.3, the statestore sent only one kind of message to its subscribers. This message contained all + updates for any topics that a subscriber had subscribed to. It also served to let subscribers know that the + statestore had not failed, and conversely the statestore used the success of sending a heartbeat to a + subscriber to decide whether or not the subscriber had failed. + </p> + + <p> + Combining topic updates and failure detection in a single message led to bottlenecks in clusters with large + numbers of tables, partitions, and HDFS data blocks. When the statestore was overloaded with metadata + updates to transmit, heartbeat messages were sent less frequently, sometimes causing subscribers to time + out their connection with the statestore. Increasing the subscriber timeout and decreasing the frequency of + statestore heartbeats worked around the problem, but reduced responsiveness when the statestore failed or + restarted. + </p> + + <p> + As of CDH 5.3, the statestore now sends topic updates and heartbeats in separate messages. This allows the + statestore to send and receive a steady stream of lightweight heartbeats, and removes the requirement to + send topic updates according to a fixed schedule, reducing statestore network overhead. + </p> + + <p> + The statestore now has the following relevant configuration flags for the <cmdname>statestored</cmdname> + daemon: + </p> + + <dl> + <dlentry id="statestore_num_update_threads"> + + <dt> + <codeph>-statestore_num_update_threads</codeph> + </dt> + + <dd> + The number of threads inside the statestore dedicated to sending topic updates. You should not + typically need to change this value. + <p> + <b>Default:</b> 10 + </p> + </dd> + + </dlentry> + + <dlentry id="statestore_update_frequency_ms"> + + <dt> + <codeph>-statestore_update_frequency_ms</codeph> + </dt> + + <dd> + The frequency, in milliseconds, with which the statestore tries to send topic updates to each + subscriber. This is a best-effort value; if the statestore is unable to meet this frequency, it sends + topic updates as fast as it can. You should not typically need to change this value. + <p> + <b>Default:</b> 2000 + </p> + </dd> + + </dlentry> + + <dlentry id="statestore_num_heartbeat_threads"> + + <dt> + <codeph>-statestore_num_heartbeat_threads</codeph> + </dt> + + <dd> + The number of threads inside the statestore dedicated to sending heartbeats. You should not typically + need to change this value. + <p> + <b>Default:</b> 10 + </p> + </dd> + + </dlentry> + + <dlentry id="statestore_heartbeat_frequency_ms"> + + <dt> + <codeph>-statestore_heartbeat_frequency_ms</codeph> + </dt> + + <dd> + The frequency, in milliseconds, with which the statestore tries to send heartbeats to each subscriber. + This value should be good for large catalogs and clusters up to approximately 150 nodes. Beyond that, + you might need to increase this value to make the interval longer between heartbeat messages. + <p> + <b>Default:</b> 1000 (one heartbeat message every second) + </p> + </dd> + + </dlentry> + </dl> + + <p> + As of CDH 5.3, not all of these flags are present in the Cloudera Manager user interface. Some must be set + using the <uicontrol>Advanced Configuration Snippet</uicontrol> fields for the statestore component. + </p> + + <p> + If it takes a very long time for a cluster to start up, and <cmdname>impala-shell</cmdname> consistently + displays <codeph>This Impala daemon is not ready to accept user requests</codeph>, the statestore might be + taking too long to send the entire catalog topic to the cluster. In this case, consider adding + <codeph>--load_catalog_in_background=false</codeph> to your catalog service configuration. This setting + stops the statestore from loading the entire catalog into memory at cluster startup. Instead, metadata for + each table is loaded when the table is accessed for the first time. + </p> + </conbody> + </concept> + + <concept audience="Cloudera" id="scalability_cluster_size"> + + <title>Scalability Considerations for Impala Cluster Size and Topology</title> + + <conbody> + + <p> + </p> + </conbody> + </concept> + + <concept audience="Cloudera" id="concurrent_connections"> + + <title>Scaling the Number of Concurrent Connections</title> + + <conbody> + + <p></p> + </conbody> + </concept> + + <concept rev="2.0.0" id="spill_to_disk"> + + <title>SQL Operations that Spill to Disk</title> + + <conbody> + + <p> + Certain memory-intensive operations write temporary data to disk (known as <term>spilling</term> to disk) + when Impala is close to exceeding its memory limit on a particular host. + </p> + + <p> + The result is a query that completes successfully, rather than failing with an out-of-memory error. The + tradeoff is decreased performance due to the extra disk I/O to write the temporary data and read it back + in. The slowdown could be potentially be significant. Thus, while this feature improves reliability, + you should optimize your queries, system parameters, and hardware configuration to make this spilling a rare occurrence. + </p> + + <p> + <b>What kinds of queries might spill to disk:</b> + </p> + + <p> + Several SQL clauses and constructs require memory allocations that could activat the spilling mechanism: + </p> + <ul> + <li> + <p> + when a query uses a <codeph>GROUP BY</codeph> clause for columns + with millions or billions of distinct values, Impala keeps a + similar number of temporary results in memory, to accumulate the + aggregate results for each value in the group. + </p> + </li> + <li> + <p> + When large tables are joined together, Impala keeps the values of + the join columns from one table in memory, to compare them to + incoming values from the other table. + </p> + </li> + <li> + <p> + When a large result set is sorted by the <codeph>ORDER BY</codeph> + clause, each node sorts its portion of the result set in memory. + </p> + </li> + <li> + <p> + The <codeph>DISTINCT</codeph> and <codeph>UNION</codeph> operators + build in-memory data structures to represent all values found so + far, to eliminate duplicates as the query progresses. + </p> + </li> + <!-- JIRA still in open state as of 5.8 / 2.6, commenting out. + <li> + <p rev="IMPALA-3471"> + In CDH 5.8 / Impala 2.6 and higher, <term>top-N</term> queries (those with + <codeph>ORDER BY</codeph> and <codeph>LIMIT</codeph> clauses) can also spill. + Impala allocates enough memory to hold as many rows as specified by the <codeph>LIMIT</codeph> + clause, plus enough memory to hold as many rows as specified by any <codeph>OFFSET</codeph> clause. + </p> + </li> + --> + </ul> + + <p conref="../shared/impala_common.xml#common/spill_to_disk_vs_dynamic_partition_pruning"/> + + <p> + <b>How Impala handles scratch disk space for spilling:</b> + </p> + + <p rev="obwl" conref="../shared/impala_common.xml#common/order_by_scratch_dir"/> + + <p> + <b>Memory usage for SQL operators:</b> + </p> + + <p> + The infrastructure of the spilling feature affects the way the affected SQL operators, such as + <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use memory. + On each host that participates in the query, each such operator in a query accumulates memory + while building the data structure to process the aggregation or join operation. The amount + of memory used depends on the portion of the data being handled by that host, and thus might + be different from one host to another. When the amount of memory being used for the operator + on a particular host reaches a threshold amount, Impala reserves an additional memory buffer + to use as a work area in case that operator causes the query to exceed the memory limit for + that host. After allocating the memory buffer, the memory used by that operator remains + essentially stable or grows only slowly, until the point where the memory limit is reached + and the query begins writing temporary data to disk. + </p> + + <p rev="2.2.0"> + Prior to Impala 2.2 (CDH 5.4), the extra memory buffer for an operator that might spill to disk + was allocated when the data structure used by the applicable SQL operator reaches 16 MB in size, + and the memory buffer itself was 512 MB. In Impala 2.2, these values are halved: the threshold value + is 8 MB and the memory buffer is 256 MB. <ph rev="2.3.0">In Impala 2.3 / CDH 5.5 and higher, the memory for the buffer + is allocated in pieces, only as needed, to avoid sudden large jumps in memory usage.</ph> A query that uses + multiple such operators might allocate multiple such memory buffers, as the size of the data structure + for each operator crosses the threshold on a particular host. + </p> + + <p> + Therefore, a query that processes a relatively small amount of data on each host would likely + never reach the threshold for any operator, and would never allocate any extra memory buffers. A query + that did process millions of groups, distinct values, join keys, and so on might cross the threshold, + causing its memory requirement to rise suddenly and then flatten out. The larger the cluster, less data is processed + on any particular host, thus reducing the chance of requiring the extra memory allocation. + </p> + + <p> + <b>Added in:</b> This feature was added to the <codeph>ORDER BY</codeph> clause in Impala 1.4 for CDH 4, + and in CDH 5.1. This feature was extended to cover join queries, aggregation functions, and analytic + functions in Impala 2.0 for CDH 4, and in CDH 5.2. The size of the memory work area required by + each operator that spills was reduced from 512 megabytes to 256 megabytes in Impala 2.2 (CDH 5.4). + </p> + + <p> + <b>Avoiding queries that spill to disk:</b> + </p> + + <p> + Because the extra I/O can impose significant performance overhead on these types of queries, try to avoid + this situation by using the following steps: + </p> + + <ol> + <li> + Detect how often queries spill to disk, and how much temporary data is written. Refer to the following + sources: + <ul> + <li> + The output of the <codeph>PROFILE</codeph> command in the <cmdname>impala-shell</cmdname> + interpreter. This data shows the memory usage for each host and in total across the cluster. The + <codeph>BlockMgr.BytesWritten</codeph> counter reports how much data was written to disk during the + query. + </li> + + <li> + The <uicontrol>Impala Queries</uicontrol> dialog in Cloudera Manager. You can see the peak memory + usage for a query, combined across all nodes in the cluster. + </li> + + <li> + The <uicontrol>Queries</uicontrol> tab in the Impala debug web user interface. Select the query to + examine and click the corresponding <uicontrol>Profile</uicontrol> link. This data breaks down the + memory usage for a single host within the cluster, the host whose web interface you are connected to. + </li> + </ul> + </li> + + <li> + Use one or more techniques to reduce the possibility of the queries spilling to disk: + <ul> + <li> + Increase the Impala memory limit if practical, for example, if you can increase the available memory + by more than the amount of temporary data written to disk on a particular node. Remember that in + Impala 2.0 and later, you can issue <codeph>SET MEM_LIMIT</codeph> as a SQL statement, which lets you + fine-tune the memory usage for queries from JDBC and ODBC applications. + </li> + + <li> + Increase the number of nodes in the cluster, to increase the aggregate memory available to Impala and + reduce the amount of memory required on each node. + </li> + + <li> + Increase the overall memory capacity of each DataNode at the hardware level. + </li> + + <li> + On a cluster with resources shared between Impala and other Hadoop components, use resource + management features to allocate more memory for Impala. See + <xref href="impala_resource_management.xml#resource_management"/> for details. + </li> + + <li> + If the memory pressure is due to running many concurrent queries rather than a few memory-intensive + ones, consider using the Impala admission control feature to lower the limit on the number of + concurrent queries. By spacing out the most resource-intensive queries, you can avoid spikes in + memory usage and improve overall response times. See + <xref href="impala_admission.xml#admission_control"/> for details. + </li> + + <li> + Tune the queries with the highest memory requirements, using one or more of the following techniques: + <ul> + <li> + Run the <codeph>COMPUTE STATS</codeph> statement for all tables involved in large-scale joins and + aggregation queries. + </li> + + <li> + Minimize your use of <codeph>STRING</codeph> columns in join columns. Prefer numeric values + instead. + </li> + + <li> + Examine the <codeph>EXPLAIN</codeph> plan to understand the execution strategy being used for the + most resource-intensive queries. See <xref href="impala_explain_plan.xml#perf_explain"/> for + details. + </li> + + <li> + If Impala still chooses a suboptimal execution strategy even with statistics available, or if it + is impractical to keep the statistics up to date for huge or rapidly changing tables, add hints + to the most resource-intensive queries to select the right execution strategy. See + <xref href="impala_hints.xml#hints"/> for details. + </li> + </ul> + </li> + + <li> + If your queries experience substantial performance overhead due to spilling, enable the + <codeph>DISABLE_UNSAFE_SPILLS</codeph> query option. This option prevents queries whose memory usage + is likely to be exorbitant from spilling to disk. See + <xref href="impala_disable_unsafe_spills.xml#disable_unsafe_spills"/> for details. As you tune + problematic queries using the preceding steps, fewer and fewer will be cancelled by this option + setting. + </li> + </ul> + </li> + </ol> + + <p> + <b>Testing performance implications of spilling to disk:</b> + </p> + + <p> + To artificially provoke spilling, to test this feature and understand the performance implications, use a + test environment with a memory limit of at least 2 GB. Issue the <codeph>SET</codeph> command with no + arguments to check the current setting for the <codeph>MEM_LIMIT</codeph> query option. Set the query + option <codeph>DISABLE_UNSAFE_SPILLS=true</codeph>. This option limits the spill-to-disk feature to prevent + runaway disk usage from queries that are known in advance to be suboptimal. Within + <cmdname>impala-shell</cmdname>, run a query that you expect to be memory-intensive, based on the criteria + explained earlier. A self-join of a large table is a good candidate: + </p> + +<codeblock>select count(*) from big_table a join big_table b using (column_with_many_values); +</codeblock> + + <p> + Issue the <codeph>PROFILE</codeph> command to get a detailed breakdown of the memory usage on each node + during the query. The crucial part of the profile output concerning memory is the <codeph>BlockMgr</codeph> + portion. For example, this profile shows that the query did not quite exceed the memory limit. + </p> + +<codeblock>BlockMgr: + - BlockWritesIssued: 1 + - BlockWritesOutstanding: 0 + - BlocksCreated: 24 + - BlocksRecycled: 1 + - BufferedPins: 0 + - MaxBlockSize: 8.00 MB (8388608) + <b>- MemoryLimit: 200.00 MB (209715200)</b> + <b>- PeakMemoryUsage: 192.22 MB (201555968)</b> + - TotalBufferWaitTime: 0ns + - TotalEncryptionTime: 0ns + - TotalIntegrityCheckTime: 0ns + - TotalReadBlockTime: 0ns +</codeblock> + + <p> + In this case, because the memory limit was already below any recommended value, I increased the volume of + data for the query rather than reducing the memory limit any further. + </p> + + <p> + Set the <codeph>MEM_LIMIT</codeph> query option to a value that is smaller than the peak memory usage + reported in the profile output. Do not specify a memory limit lower than about 300 MB, because with such a + low limit, queries could fail to start for other reasons. Now try the memory-intensive query again. + </p> + + <p> + Check if the query fails with a message like the following: + </p> + +<codeblock>WARNINGS: Spilling has been disabled for plans that do not have stats and are not hinted +to prevent potentially bad plans from using too many cluster resources. Compute stats on +these tables, hint the plan or disable this behavior via query options to enable spilling. +</codeblock> + + <p> + If so, the query could have consumed substantial temporary disk space, slowing down so much that it would + not complete in any reasonable time. Rather than rely on the spill-to-disk feature in this case, issue the + <codeph>COMPUTE STATS</codeph> statement for the table or tables in your sample query. Then run the query + again, check the peak memory usage again in the <codeph>PROFILE</codeph> output, and adjust the memory + limit again if necessary to be lower than the peak memory usage. + </p> + + <p> + At this point, you have a query that is memory-intensive, but Impala can optimize it efficiently so that + the memory usage is not exorbitant. You have set an artificial constraint through the + <codeph>MEM_LIMIT</codeph> option so that the query would normally fail with an out-of-memory error. But + the automatic spill-to-disk feature means that the query should actually succeed, at the expense of some + extra disk I/O to read and write temporary work data. + </p> + + <p> + Try the query again, and confirm that it succeeds. Examine the <codeph>PROFILE</codeph> output again. This + time, look for lines of this form: + </p> + +<codeblock>- SpilledPartitions: <varname>N</varname> +</codeblock> + + <p> + If you see any such lines with <varname>N</varname> greater than 0, that indicates the query would have + failed in Impala releases prior to 2.0, but now it succeeded because of the spill-to-disk feature. Examine + the total time taken by the <codeph>AGGREGATION_NODE</codeph> or other query fragments containing non-zero + <codeph>SpilledPartitions</codeph> values. Compare the times to similar fragments that did not spill, for + example in the <codeph>PROFILE</codeph> output when the same query is run with a higher memory limit. This + gives you an idea of the performance penalty of the spill operation for a particular query with a + particular memory limit. If you make the memory limit just a little lower than the peak memory usage, the + query only needs to write a small amount of temporary data to disk. The lower you set the memory limit, the + more temporary data is written and the slower the query becomes. + </p> + + <p> + Now repeat this procedure for actual queries used in your environment. Use the + <codeph>DISABLE_UNSAFE_SPILLS</codeph> setting to identify cases where queries used more memory than + necessary due to lack of statistics on the relevant tables and columns, and issue <codeph>COMPUTE + STATS</codeph> where necessary. + </p> + + <p> + <b>When to use DISABLE_UNSAFE_SPILLS:</b> + </p> + + <p> + You might wonder, why not leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned on all the time. Whether and + how frequently to use this option depends on your system environment and workload. + </p> + + <p> + <codeph>DISABLE_UNSAFE_SPILLS</codeph> is suitable for an environment with ad hoc queries whose performance + characteristics and memory usage are not known in advance. It prevents <q>worst-case scenario</q> queries + that use large amounts of memory unnecessarily. Thus, you might turn this option on within a session while + developing new SQL code, even though it is turned off for existing applications. + </p> + + <p> + Organizations where table and column statistics are generally up-to-date might leave this option turned on + all the time, again to avoid worst-case scenarios for untested queries or if a problem in the ETL pipeline + results in a table with no statistics. Turning on <codeph>DISABLE_UNSAFE_SPILLS</codeph> lets you <q>fail + fast</q> in this case and immediately gather statistics or tune the problematic queries. + </p> + + <p> + Some organizations might leave this option turned off. For example, you might have tables large enough that + the <codeph>COMPUTE STATS</codeph> takes substantial time to run, making it impractical to re-run after + loading new data. If you have examined the <codeph>EXPLAIN</codeph> plans of your queries and know that + they are operating efficiently, you might leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned off. In that + case, you know that any queries that spill will not go overboard with their memory consumption. + </p> + +<!-- + <p> + <b>Turning off the spill-to-disk feature: (CDH 5.6 / Impala 2.4 and lower only)</b> + </p> + + <p> + Prior to CDH 5.7 / Impala 2.5, certain conditions... + </p> + + <p> + You might turn off the spill-to-disk feature if you are in an environment with constraints on disk space, + or if you prefer for queries that exceed the memory capacity in your cluster to <q>fail fast</q> so that + you can tune and retry them. + </p> + + <p> + To turn off this feature, set the following configuration options for each <cmdname>impalad</cmdname> + daemon, either through the <cmdname>impalad</cmdname> advanced configuration snippet in Cloudera Manager, + or during <cmdname>impalad</cmdname> startup on each DataNode on systems not managed by Cloudera Manager: + </p> + +<codeblock>−−enable_partitioned_aggregation=false +−−enable_partitioned_hash_join=false +</codeblock> +--> + </conbody> + </concept> + +<concept id="complex_query"> +<title>Limits on Query Size and Complexity</title> +<conbody> +<p> +There are hardcoded limits on the maximum size and complexity of queries. +Currently, the maximum number of expressions in a query is 2000. +You might exceed the limits with large or deeply nested queries +produced by business intelligence tools or other query generators. +</p> +<p> +If you have the ability to customize such queries or the query generation +logic that produces them, replace sequences of repetitive expressions +with single operators such as <codeph>IN</codeph> or <codeph>BETWEEN</codeph> +that can represent multiple values or ranges. +For example, instead of a large number of <codeph>OR</codeph> clauses: +</p> +<codeblock>WHERE val = 1 OR val = 2 OR val = 6 OR val = 100 ... +</codeblock> +<p> +use a single <codeph>IN</codeph> clause: +</p> +<codeblock>WHERE val IN (1,2,6,100,...)</codeblock> +</conbody> +</concept> + +<concept id="scalability_io"> +<title>Scalability Considerations for Impala I/O</title> +<conbody> +<p> +Impala parallelizes its I/O operations aggressively, +therefore the more disks you can attach to each host, the better. +Impala retrieves data from disk so quickly using +bulk read operations on large blocks, that most queries +are CPU-bound rather than I/O-bound. +</p> +<p> +Because the kind of sequential scanning typically done by +Impala queries does not benefit much from the random-access +capabilities of SSDs, spinning disks typically provide +the most cost-effective kind of storage for Impala data, +with little or no performance penalty as compared to SSDs. +</p> +<p> +Resource management features such as YARN, Llama, and admission control +typically constrain the amount of memory, CPU, or overall number of +queries in a high-concurrency environment. +Currently, there is no throttling mechanism for Impala I/O. +</p> +</conbody> +</concept> + +<concept id="big_tables"> +<title>Scalability Considerations for Table Layout</title> +<conbody> +<p> +Due to the overhead of retrieving and updating table metadata +in the metastore database, try to limit the number of columns +in a table to a maximum of approximately 2000. +Although Impala can handle wider tables than this, the metastore overhead +can become significant, leading to query performance that is slower +than expected based on the actual data volume. +</p> +<p> +To minimize overhead related to the metastore database and Impala query planning, +try to limit the number of partitions for any partitioned table to a few tens of thousands. +</p> +</conbody> +</concept> + +<concept rev="CDH-38321" id="kerberos_overhead_cluster_size"> +<title>Kerberos-Related Network Overhead for Large Clusters</title> +<conbody> +<p> +When Impala starts up, or after each <codeph>kinit</codeph> refresh, Impala sends a number of +simultaneous requests to the KDC. For a cluster with 100 hosts, the KDC might be able to process +all the requests within roughly 5 seconds. For a cluster with 1000 hosts, the time to process +the requests would be roughly 500 seconds. Impala also makes a number of DNS requests at the same +time as these Kerberos-related requests. +</p> +<p> +While these authentication requests are being processed, any submitted Impala queries will fail. +During this period, the KDC and DNS may be slow to respond to requests from components other than Impala, +so other secure services might be affected temporarily. +</p> +<p> +To reduce the frequency of the <codeph>kinit</codeph> renewal that initiates a new set of +authentication requests, increase the <codeph>kerberos_reinit_interval</codeph> configuration setting +for the <cmdname>impalad</cmdname> daemons. Currently, the default for a cluster not managed by +Cloudera Manager is 60 minutes, while the default under Cloudera Manager is 10 minutes. +Consider using a higher value such as 360 (6 hours). +</p> +</conbody> +</concept> + <concept id="scalability_hotspots" rev="2.5.0 IMPALA-2696"> + <title>Avoiding CPU Hotspots for HDFS Cached Data</title> + <conbody> + <p> + You can use the HDFS caching feature, described in <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/>, + with Impala to reduce I/O and memory-to-memory copying for frequently accessed tables or partitions. + </p> + <p> + In the early days of this feature, you might have found that enabling HDFS caching + resulted in little or no performance improvement, because it could result in + <q>hotspots</q>: instead of the I/O to read the table data being parallelized across + the cluster, the I/O was reduced but the CPU load to process the data blocks + might be concentrated on a single host. + </p> + <p> + To avoid hotspots, include the <codeph>WITH REPLICATION</codeph> clause with the + <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statements for tables that use HDFS caching. + This clause allows more than one host to cache the relevant data blocks, so the CPU load + can be shared, reducing the load on any one host. + See <xref href="impala_create_table.xml#create_table"/> and <xref href="impala_alter_table.xml#alter_table"/> + for details. + </p> + <p> + Hotspots with high CPU load for HDFS cached data could still arise in some cases, due to + the way that Impala schedules the work of processing data blocks on different hosts. + In CDH 5.7 / Impala 2.5 and higher, scheduling improvements mean that the work for + HDFS cached data is divided better among all the hosts that have cached replicas + for a particular data block. When more than one host has a cached replica for a data block, + Impala assigns the work of processing that block to whichever host has done the least work + (in terms of number of bytes read) for the current query. If hotspots persist even with this + load-based scheduling algorithm, you can enable the query option <codeph>SCHEDULE_RANDOM_REPLICA=TRUE</codeph> + to further distribute the CPU load. This setting causes Impala to randomly pick a host to process a cached + data block if the scheduling algorithm encounters a tie when deciding which host has done the + least work. + </p> + </conbody> + </concept> </concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_schema_objects.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_schema_objects.xml b/docs/topics/impala_schema_objects.xml index d8abe12..a9ddfcc 100644 --- a/docs/topics/impala_schema_objects.xml +++ b/docs/topics/impala_schema_objects.xml @@ -3,7 +3,7 @@ <concept id="schema_objects"> <title>Impala Schema Objects and Object Names</title> - <titlealts><navtitle>Schema Objects and Object Names</navtitle></titlealts> + <titlealts audience="PDF"><navtitle>Schema Objects and Object Names</navtitle></titlealts> <prolog> <metadata> <data name="Category" value="Impala"/> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_security.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_security.xml b/docs/topics/impala_security.xml index 5c98fb1..bf4606d 100644 --- a/docs/topics/impala_security.xml +++ b/docs/topics/impala_security.xml @@ -3,7 +3,19 @@ <concept id="security"> <title><ph audience="standalone">Impala Security</ph><ph audience="integrated">Overview of Impala Security</ph></title> - + <prolog> + <metadata> + <data name="Category" value="Security"/> + <data name="Category" value="Impala"/> + <data name="Category" value="Concepts"/> + <data name="Category" value="Auditing"/> + <data name="Category" value="Governance"/> + <data name="Category" value="Authentication"/> + <data name="Category" value="Authorization"/> + <data name="Category" value="Administrators"/> + </metadata> + </prolog> + <conbody> <p> @@ -16,6 +28,96 @@ reports. The auditing feature was added in Impala 1.1.1. </p> - + <p> + The Impala security features have several objectives. At the most basic level, security prevents + accidents or mistakes that could disrupt application processing, delete or corrupt data, or reveal data to + unauthorized users. More advanced security features and practices can harden the system against malicious + users trying to gain unauthorized access or perform other disallowed operations. The auditing feature + provides a way to confirm that no unauthorized access occurred, and detect whether any such attempts were + made. This is a critical set of features for production deployments in large organizations that handle + important or sensitive data. It sets the stage for multi-tenancy, where multiple applications run + concurrently and are prevented from interfering with each other. + </p> + + <p> + The material in this section presumes that you are already familiar with administering secure Linux systems. + That is, you should know the general security practices for Linux and Hadoop, and their associated commands + and configuration files. For example, you should know how to create Linux users and groups, manage Linux + group membership, set Linux and HDFS file permissions and ownership, and designate the default permissions + and ownership for new files. You should be familiar with the configuration of the nodes in your Hadoop + cluster, and know how to apply configuration changes or run a set of commands across all the nodes. + </p> + + <p> + The security features are divided into these broad categories: + </p> + + <dl> + <dlentry> + + <dt> + authorization + </dt> + + <dd> + Which users are allowed to access which resources, and what operations are they allowed to perform? + Impala relies on the open source Sentry project for authorization. By default (when authorization is not + enabled), Impala does all read and write operations with the privileges of the <codeph>impala</codeph> + user, which is suitable for a development/test environment but not for a secure production environment. + When authorization is enabled, Impala uses the OS user ID of the user who runs + <cmdname>impala-shell</cmdname> or other client program, and associates various privileges with each + user. See <xref href="impala_authorization.xml#authorization"/> for details about setting up and managing + authorization. + </dd> + + </dlentry> + + <dlentry> + + <dt> + authentication + </dt> + + <dd> + How does Impala verify the identity of the user to confirm that they really are allowed to exercise the + privileges assigned to that user? Impala relies on the Kerberos subsystem for authentication. See + <xref href="impala_kerberos.xml#kerberos"/> for details about setting up and managing authentication. + </dd> + + </dlentry> + + <dlentry> + + <dt> + auditing + </dt> + + <dd> + What operations were attempted, and did they succeed or not? This feature provides a way to look back and + diagnose whether attempts were made to perform unauthorized operations. You use this information to track + down suspicious activity, and to see where changes are needed in authorization policies. The audit data + produced by this feature is collected by the Cloudera Manager product and then presented in a + user-friendly form by the Cloudera Manager product. See <xref href="impala_auditing.xml#auditing"/> for + details about setting up and managing auditing. + </dd> + + </dlentry> + </dl> + + <p outputclass="toc"/> + + <p audience="integrated"> + These other topics in the <cite>Security Guide</cite> cover how Impala integrates with security frameworks + such as Kerberos, LDAP, and Sentry: + <ul> + <li> + <xref href="impala_authentication.xml#authentication"/> + </li> + + <li> + <xref href="impala_authorization.xml#authorization"/> + </li> + </ul> + </p> </conbody> </concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_select.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_select.xml b/docs/topics/impala_select.xml index db63f71..24ed896 100644 --- a/docs/topics/impala_select.xml +++ b/docs/topics/impala_select.xml @@ -3,7 +3,7 @@ <concept id="select"> <title>SELECT Statement</title> - <titlealts><navtitle>SELECT</navtitle></titlealts> + <titlealts audience="PDF"><navtitle>SELECT</navtitle></titlealts> <prolog> <metadata> <data name="Category" value="Impala"/> @@ -155,6 +155,9 @@ LIMIT <i>expression</i> [OFFSET <i>expression</i>] <p conref="../shared/impala_common.xml#common/security_blurb"/> <p conref="../shared/impala_common.xml#common/redaction_yes"/> + <p conref="../shared/impala_common.xml#common/s3_blurb"/> + <p conref="../shared/impala_common.xml#common/s3_block_splitting"/> + <p conref="../shared/impala_common.xml#common/cancel_blurb_yes"/> <p conref="../shared/impala_common.xml#common/permissions_blurb"/> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_set.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_set.xml b/docs/topics/impala_set.xml index afa6777..2594af9 100644 --- a/docs/topics/impala_set.xml +++ b/docs/topics/impala_set.xml @@ -3,24 +3,34 @@ <concept rev="2.0.0" id="set"> <title>SET Statement</title> - <titlealts><navtitle>SET</navtitle></titlealts> + <titlealts audience="PDF"><navtitle>SET</navtitle></titlealts> <prolog> <metadata> <data name="Category" value="Impala"/> <data name="Category" value="SQL"/> <data name="Category" value="Querying"/> <data name="Category" value="Configuring"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> </metadata> </prolog> <conbody> - <p> + <p rev="2.0.0"> <indexterm audience="Cloudera">SET statement</indexterm> Specifies values for query options that control the runtime behavior of other statements within the same session. </p> + <p rev="2.5.0 IMPALA-2180"> + In CDH 5.7 / Impala 2.5 and higher, <codeph>SET</codeph> also defines user-specified substitution variables for + the <cmdname>impala-shell</cmdname> interpreter. This feature uses the <codeph>SET</codeph> command + built into <cmdname>impala-shell</cmdname> instead of the SQL <codeph>SET</codeph> statement. + Therefore the substitution mechanism only works with queries processed by <cmdname>impala-shell</cmdname>, + not with queries submitted through JDBC or ODBC. + </p> + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> <codeblock>SET [<varname>query_option</varname>=<varname>option_value</varname>] @@ -43,6 +53,98 @@ <xref href="impala_query_options.xml#query_options"/> for the details of each query option. </p> + <p> + <b>User-specified substitution variables:</b> + </p> + + <p rev="2.5.0 IMPALA-2180"> + In CDH 5.7 / Impala 2.5 and higher, you can specify your own names and string substitution values + within the <cmdname>impala-shell</cmdname> interpreter. Once a substitution variable is set up, + its value is inserted into any SQL statement in that same <cmdname>impala-shell</cmdname> session + that contains the notation <codeph>${var:<varname>varname</varname>}</codeph>. + Using <codeph>SET</codeph> in an interactive <cmdname>impala-shell</cmdname> session overrides + any value for that same variable passed in through the <codeph>--var=<varname>varname</varname>=<varname>value</varname></codeph> + command-line option. + </p> + + <p rev="2.5.0 IMPALA-2180"> + For example, to set up some default parameters for report queries, but then override those default + within an <cmdname>impala-shell</cmdname> session, you might issue commands and statements such as + the following: + </p> + +<codeblock rev="2.5.0 IMPALA-2180"> +-- Initial setup for this example. +create table staging_table (s string); +insert into staging_table values ('foo'), ('bar'), ('bletch'); + +create table production_table (s string); +insert into production_table values ('North America'), ('EMEA'), ('Asia'); +quit; + +-- Start impala-shell with user-specified substitution variables, +-- run a query, then override the variables with SET and run the query again. +$ impala-shell --var=table_name=staging_table --var=cutoff=2 +... <varname>banner message</varname> ... +[localhost:21000] > select s from ${var:table_name} order by s limit ${var:cutoff}; +Query: select s from staging_table order by s limit 2 ++--------+ +| s | ++--------+ +| bar | +| bletch | ++--------+ +Fetched 2 row(s) in 1.06s + +[localhost:21000] > set var:table_name=production_table; +Variable TABLE_NAME set to production_table +[localhost:21000] > set var:cutoff=3; +Variable CUTOFF set to 3 + +[localhost:21000] > select s from ${var:table_name} order by s limit ${var:cutoff}; +Query: select s from production_table order by s limit 3 ++---------------+ +| s | ++---------------+ +| Asia | +| EMEA | +| North America | ++---------------+ +</codeblock> + + <p rev="2.5.0 IMPALA-2180"> + The following example shows how <codeph>SET</codeph> with no parameters displays + all user-specified substitution variables, and how <codeph>UNSET</codeph> removes + the substitution variable entirely: + </p> + +<codeblock rev="2.5.0 IMPALA-2180"> +[localhost:21000] > set; +Query options (defaults shown in []): + ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0] + ... + V_CPU_CORES: [0] + +Shell Options + LIVE_PROGRESS: False + LIVE_SUMMARY: False + +Variables: + CUTOFF: 3 + TABLE_NAME: staging_table + +[localhost:21000] > unset var:cutoff; +Unsetting variable CUTOFF +[localhost:21000] > select s from ${var:table_name} order by s limit ${var:cutoff}; +Error: Unknown variable CUTOFF +</codeblock> + + <p rev="2.5.0 IMPALA-2180"> + See <xref href="impala_shell_running_commands.xml"/> for more examples of using the + <codeph>--var</codeph>, <codeph>SET</codeph>, and <codeph>${var:<varname>varname</varname>}</codeph> + substitution technique in <cmdname>impala-shell</cmdname>. + </p> + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> <p> @@ -74,7 +176,7 @@ insert overwrite parquet_table select c1, c2, count(c3) from text_table group by a SQL statement lets you use this feature in client applications through the JDBC and ODBC APIs. </p> -<!-- <p conref="/Content/impala_common_xi44078.xml#common/jdbc_blurb"/> --> +<!-- <p conref="../shared/impala_common.xml#common/jdbc_blurb"/> --> <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_shell_options.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_shell_options.xml b/docs/topics/impala_shell_options.xml index 9f0bf60..26f7364 100644 --- a/docs/topics/impala_shell_options.xml +++ b/docs/topics/impala_shell_options.xml @@ -4,7 +4,15 @@ <title>impala-shell Configuration Options</title> <titlealts audience="PDF"><navtitle>Configuration Options</navtitle></titlealts> - + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Configuring"/> + <data name="Category" value="impala-shell"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + </metadata> + </prolog> <conbody> @@ -14,7 +22,559 @@ line, or through the <filepath>$HOME/.impalarc</filepath> configuration file. </p> - + <note> + <p> + These options are different than the configuration options for the <codeph>impalad</codeph> daemon itself. + For the <codeph>impalad</codeph> options, see <xref href="impala_config_options.xml#config_options"/>. + </p> + </note> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="shell_option_summary"> + + <title>Summary of impala-shell Configuration Options</title> + + <conbody> + + <p> + The following table shows the names and allowed arguments for the <cmdname>impala-shell</cmdname> + configuration options. You can specify options on the command line, or in a configuration file as described + in <xref href="impala_shell_options.xml#shell_config_file"/>. + </p> + + <table> + <tgroup cols="3"> + <colspec colname="1" colwidth="10*"/> + <colspec colname="2" colwidth="10*"/> + <colspec colname="3" colwidth="20*"/> + <thead> + <row> + <entry> + Command-Line Option + </entry> + <entry rev="2.0.0"> + Configuration File Setting + </entry> + <entry> + Explanation + </entry> + </row> + </thead> + <tbody> + <row> + <entry> + <p> + -B or --delimited + </p> + </entry> + <entry rev="2.0.0"> + <p> + write_delimited=true + </p> + </entry> + <entry> + <p> + Causes all query results to be printed in plain format as a delimited text file. Useful for + producing data files to be used with other Hadoop components. Also useful for avoiding the + performance overhead of pretty-printing all output, especially when running benchmark tests using + queries returning large result sets. Specify the delimiter character with the + <codeph>--output_delimiter</codeph> option. Store all query results in a file rather than + printing to the screen with the <codeph>-B</codeph> option. Added in Impala 1.0.1. + </p> + </entry> + </row> + <row> + <entry> + <p> + --print_header + </p> + </entry> + <entry rev="2.0.0"> + <p> + print_header=true + </p> + </entry> + <entry> + <p/> + </entry> + </row> + <row> + <entry> + <p> + -o <varname>filename</varname> or --output_file <varname>filename</varname> + </p> + </entry> + <entry rev="2.0.0"> + <p> + output_file=<varname>filename</varname> + </p> + </entry> + <entry> + <p> + Stores all query results in the specified file. Typically used to store the results of a single + query issued from the command line with the <codeph>-q</codeph> option. Also works for + interactive sessions; you see the messages such as number of rows fetched, but not the actual + result set. To suppress these incidental messages when combining the <codeph>-q</codeph> and + <codeph>-o</codeph> options, redirect <codeph>stderr</codeph> to <codeph>/dev/null</codeph>. + Added in Impala 1.0.1. + </p> + </entry> + </row> + <row> + <entry> + <p> + --output_delimiter=<varname>character</varname> + </p> + </entry> + <entry rev="2.0.0"> + <p> + output_delimiter=<varname>character</varname> + </p> + </entry> + <entry> + <p> + Specifies the character to use as a delimiter between fields when query results are printed in + plain format by the <codeph>-B</codeph> option. Defaults to tab (<codeph>'\t'</codeph>). If an + output value contains the delimiter character, that field is quoted, escaped by doubling quotation marks, or both. Added in + Impala 1.0.1. + </p> + </entry> + </row> + <row> + <entry> + <p> + -p or --show_profiles + </p> + </entry> + <entry rev="2.0.0"> + <p> + show_profiles=true + </p> + </entry> + <entry> + <p> + Displays the query execution plan (same output as the <codeph>EXPLAIN</codeph> statement) and a + more detailed low-level breakdown of execution steps, for every query executed by the shell. + </p> + </entry> + </row> + <row> + <entry> + <p> + -h or --help + </p> + </entry> + <entry rev="2.0.0"> + <p> + N/A + </p> + </entry> + <entry> + <p> + Displays help information. + </p> + </entry> + </row> + <row> + <entry> + <p> + -i <varname>hostname</varname> or + --impalad=<varname>hostname</varname>[:<varname>portnum</varname>] + </p> + </entry> + <entry rev="2.0.0"> + <p> + impalad=<varname>hostname</varname>[:<varname>portnum</varname>] + </p> + </entry> + <entry> + <p> + Connects to the <codeph>impalad</codeph> daemon on the specified host. The default port of 21000 + is assumed unless you provide another value. You can connect to any host in your cluster that is + running <codeph>impalad</codeph>. If you connect to an instance of <codeph>impalad</codeph> that + was started with an alternate port specified by the <codeph>--fe_port</codeph> flag, provide that + alternative port. + </p> + </entry> + </row> + <row> + <entry> + <p> + -q <varname>query</varname> or --query=<varname>query</varname> + </p> + </entry> + <entry rev="2.0.0"> + <p> + query=<varname>query</varname> + </p> + </entry> + <entry> + <p> + Passes a query or other <cmdname>impala-shell</cmdname> command from the command line. The + <cmdname>impala-shell</cmdname> interpreter immediately exits after processing the statement. It + is limited to a single statement, which could be a <codeph>SELECT</codeph>, <codeph>CREATE + TABLE</codeph>, <codeph>SHOW TABLES</codeph>, or any other statement recognized in + <codeph>impala-shell</codeph>. Because you cannot pass a <codeph>USE</codeph> statement and + another query, fully qualify the names for any tables outside the <codeph>default</codeph> + database. (Or use the <codeph>-f</codeph> option to pass a file with a <codeph>USE</codeph> + statement followed by other queries.) + </p> + </entry> + </row> + <row> + <entry> + <p> + -f <varname>query_file</varname> or --query_file=<varname>query_file</varname> + </p> + </entry> + <entry rev="2.0.0"> + <p> + query_file=<varname>path_to_query_file</varname> + </p> + </entry> + <entry> + <p> + Passes a SQL query from a file. Multiple statements must be semicolon (;) delimited. + <ph rev="2.3.0">In CDH 5.5 / Impala 2.3 and higher, you can specify a filename of <codeph>-</codeph> + to represent standard input. This feature makes it convenient to use <cmdname>impala-shell</cmdname> + as part of a Unix pipeline where SQL statements are generated dynamically by other tools.</ph> + </p> + </entry> + </row> + <row> + <entry> + <p> + -k or --kerberos + </p> + </entry> + <entry rev="2.0.0"> + <p> + use_kerberos=true + </p> + </entry> + <entry> + <p> + Kerberos authentication is used when the shell connects to <codeph>impalad</codeph>. If Kerberos + is not enabled on the instance of <codeph>impalad</codeph> to which you are connecting, errors + are displayed. + </p> + </entry> + </row> + <row> + <entry> + <p> + -s <varname>kerberos_service_name</varname> or --kerberos_service_name=<varname>name</varname> + </p> + </entry> + <entry rev="2.0.0"> + <p> + kerberos_service_name=<varname>name</varname> + </p> + </entry> + <entry> + <p> + Instructs <codeph>impala-shell</codeph> to authenticate to a particular <codeph>impalad</codeph> + service principal. If a <varname>kerberos_service_name</varname> is not specified, + <codeph>impala</codeph> is used by default. If this option is used in conjunction with a + connection in which Kerberos is not supported, errors are returned. + </p> + </entry> + </row> + <row> + <entry> + <p> + -V or --verbose + </p> + </entry> + <entry rev="2.0.0"> + <p> + verbose=true + </p> + </entry> + <entry> + <p> + Enables verbose output. + </p> + </entry> + </row> + <row> +<!-- Confirm verbose=true/false really is the same as verbose vs. quiet. --> + <entry> + <p> + --quiet + </p> + </entry> + <entry rev="2.0.0"> + <p> + verbose=false + </p> + </entry> + <entry> + <p> + Disables verbose output. + </p> + </entry> + </row> + <row> + <entry> + <p> + -v or --version + </p> + </entry> + <entry rev="2.0.0"> + <p> + version=true + </p> + </entry> + <entry> + <p> + Displays version information. + </p> + </entry> + </row> + <row> + <entry> + <p> + -c + </p> + </entry> + <entry rev="2.0.0"> + <p> + ignore_query_failure=true + </p> + </entry> + <entry> + <p> + Continues on query failure. + </p> + </entry> + </row> + <row> + <entry> + <p> + -r or --refresh_after_connect + </p> + </entry> + <entry rev="2.0.0"> + <p> + refresh_after_connect=true + </p> + </entry> + <entry> + <p> + Updates Impala metadata upon connection. Same as running the + <codeph><xref href="impala_invalidate_metadata.xml#invalidate_metadata">INVALIDATE + METADATA</xref></codeph> statement after connecting. (This option was originally named when the + <codeph>REFRESH</codeph> statement did the extensive metadata updates now performed by + <codeph>INVALIDATE METADATA</codeph>.) + </p> + </entry> + </row> + <row> + <entry> + <p> + -d <varname>default_db</varname> or --database=<varname>default_db</varname> + </p> + </entry> + <entry rev="2.0.0"> + <p> + default_db=<varname>default_db</varname> + </p> + </entry> + <entry> + <p> + Specifies the database to be used on startup. Same as running the + <codeph><xref href="impala_use.xml#use">USE</xref></codeph> statement after connecting. If not + specified, a database named <codeph>DEFAULT</codeph> is used. + </p> + </entry> + </row> + <row> + <entry> + -ssl + </entry> + <entry rev="2.0.0"> + ssl=true + </entry> + <entry> + Enables TLS/SSL for <cmdname>impala-shell</cmdname>. + </entry> + </row> + <row> + <entry> + --ca_cert=<varname>path_to_certificate</varname> + </entry> + <entry rev="2.0.0"> + ca_cert=<varname>path_to_certificate</varname> + </entry> + <entry> + The local pathname pointing to the third-party CA certificate, or to a copy of the server + certificate for self-signed server certificates. If <codeph>--ca_cert</codeph> is not set, + <cmdname>impala-shell</cmdname> enables TLS/SSL, but does not validate the server certificate. This is + useful for connecting to a known-good Impala that is only running over TLS/SSL, when a copy of the + certificate is not available (such as when debugging customer installations). + </entry> + </row> + <row rev="1.2.2"> + <entry> + -l + </entry> + <entry rev="2.0.0"> + use_ldap=true + </entry> + <entry> + Enables LDAP authentication. + </entry> + </row> + <row rev="1.2.2"> + <entry> + -u + </entry> + <entry rev="2.0.0"> + user=<varname>user_name</varname> + </entry> + <entry> + Supplies the username, when LDAP authentication is enabled by the <codeph>-l</codeph> option. + (Specify the short username, not the full LDAP distinguished name.) The shell then prompts + interactively for the password. + </entry> + </row> + <row rev="2.5.0 IMPALA-1934"> + <entry> + --ldap_password_cmd=<varname>command</varname> + </entry> + <entry> + N/A + </entry> + <entry> + Specifies a command to run to retrieve the LDAP password, + when LDAP authentication is enabled by the <codeph>-l</codeph> option. + If the command includes space-separated arguments, enclose the command and + its arguments in quotation marks. + </entry> + </row> + <row rev="2.0.0"> + <entry> + --config_file=<varname>path_to_config_file</varname> + </entry> + <entry> + N/A + </entry> + <entry> + Specifies the path of the file containing <cmdname>impala-shell</cmdname> configuration settings. + The default is <filepath>$HOME/.impalarc</filepath>. This setting can only be specified on the + command line. + </entry> + </row> + <row rev="2.3.0"> + <entry>--live_progress</entry> + <entry>N/A</entry> + <entry>Prints a progress bar showing roughly the percentage complete for each query. + The information is updated interactively as the query progresses. + See <xref href="impala_live_progress.xml#live_progress"/>.</entry> + </row> + <row rev="2.3.0"> + <entry>--live_summary</entry> + <entry>N/A</entry> + <entry>Prints a detailed report, similar to the <codeph>SUMMARY</codeph> command, showing progress details for each phase of query execution. + The information is updated interactively as the query progresses. + See <xref href="impala_live_summary.xml#live_summary"/>.</entry> + </row> + <row rev="2.5.0 IMPALA-1079"> + <entry>--var=<varname>variable_name</varname>=<varname>value</varname></entry> + <entry>N/A</entry> + <entry> + Defines a substitution variable that can be used within the <cmdname>impala-shell</cmdname> session. + The variable can be substituted into statements processed by the <codeph>-q</codeph> or <codeph>-f</codeph> options, + or in an interactive shell session. + Within a SQL statement, you substitute the value by using the notation <codeph>${var:<varname>variable_name</varname>}</codeph>. + This feature is available in CDH 5.7 / Impala 2.5 and higher. + </entry> + </row> + </tbody> + </tgroup> + </table> </conbody> </concept> + <concept id="shell_config_file"> + + <title>impala-shell Configuration File</title> + + <conbody> + + <p> + You can define a set of default options for your <cmdname>impala-shell</cmdname> environment, stored in the + file <filepath>$HOME/.impalarc</filepath>. This file consists of key-value pairs, one option per line. + Everything after a <codeph>#</codeph> character on a line is treated as a comment and ignored. + </p> + + <p> + The configuration file must contain a header label <codeph>[impala]</codeph>, followed by the options + specific to <cmdname>impala-shell</cmdname>. (This standard convention for configuration files lets you + use a single file to hold configuration options for multiple applications.) + </p> + + <p> + To specify a different filename or path for the configuration file, specify the argument + <codeph>--config_file=<varname>path_to_config_file</varname></codeph> on the + <cmdname>impala-shell</cmdname> command line. + </p> + + <p> + The names of the options in the configuration file are similar (although not necessarily identical) to the + long-form command-line arguments to the <cmdname>impala-shell</cmdname> command. For the names to use, see + <xref href="impala_shell_options.xml#shell_option_summary"/>. + </p> + + <p> + Any options you specify on the <cmdname>impala-shell</cmdname> command line override any corresponding + options within the configuration file. + </p> + + <p> + The following example shows a configuration file that you might use during benchmarking tests. It sets + verbose mode, so that the output from each SQL query is followed by timing information. + <cmdname>impala-shell</cmdname> starts inside the database containing the tables with the benchmark data, + avoiding the need to issue a <codeph>USE</codeph> statement or use fully qualified table names. + </p> + + <p> + In this example, the query output is formatted as delimited text rather than enclosed in ASCII art boxes, + and is stored in a file rather than printed to the screen. Those options are appropriate for benchmark + situations, so that the overhead of <cmdname>impala-shell</cmdname> formatting and printing the result set + does not factor into the timing measurements. It also enables the <codeph>show_profiles</codeph> option. + That option prints detailed performance information after each query, which might be valuable in + understanding the performance of benchmark queries. + </p> + +<codeblock>[impala] +verbose=true +default_db=tpc_benchmarking +write_delimited=true +output_delimiter=, +output_file=/home/tester1/benchmark_results.csv +show_profiles=true +</codeblock> + + <p> + The following example shows a configuration file that connects to a specific remote Impala node, runs a + single query within a particular database, then exits. You would typically use this kind of single-purpose + configuration setting with the <cmdname>impala-shell</cmdname> command-line option + <codeph>--config_file=<varname>path_to_config_file</varname></codeph>, to easily select between many + predefined queries that could be run against different databases, hosts, or even different clusters. To run + a sequence of statements instead of a single query, specify the configuration option + <codeph>query_file=<varname>path_to_query_file</varname></codeph> instead. + </p> + +<codeblock>[impala] +impalad=impala-test-node1.example.com +default_db=site_stats +# Issue a predefined query and immediately exit. +query=select count(*) from web_traffic where event_date = trunc(now(),'dd') +</codeblock> + </conbody> + </concept> +</concept>
