http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_explain.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_explain.html b/docs/build/html/topics/impala_explain.html new file mode 100644 index 0000000..473a94d --- /dev/null +++ b/docs/build/html/topics/impala_explain.html @@ -0,0 +1,291 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_langref_sql.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="explain"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>EXPLAIN Statement</title></head><body id="explain"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">EXPLAIN Statement</h1> + + + + <div class="body conbody"> + + <p class="p"> + + Returns the execution plan for a statement, showing the low-level mechanisms that Impala will use to read the + data, divide the work among nodes in the cluster, and transmit intermediate and final results across the + network. Use <code class="ph codeph">explain</code> followed by a complete <code class="ph codeph">SELECT</code> query. For example: + </p> + + <p class="p"> + <strong class="ph b">Syntax:</strong> + </p> + +<pre class="pre codeblock"><code>EXPLAIN { <var class="keyword varname">select_query</var> | <var class="keyword varname">ctas_stmt</var> | <var class="keyword varname">insert_stmt</var> } +</code></pre> + + <p class="p"> + The <var class="keyword varname">select_query</var> is a <code class="ph codeph">SELECT</code> statement, optionally prefixed by a + <code class="ph codeph">WITH</code> clause. See <a class="xref" href="impala_select.html#select">SELECT Statement</a> for details. + </p> + + <p class="p"> + The <var class="keyword varname">insert_stmt</var> is an <code class="ph codeph">INSERT</code> statement that inserts into or overwrites an + existing table. It can use either the <code class="ph codeph">INSERT ... SELECT</code> or <code class="ph codeph">INSERT ... + VALUES</code> syntax. See <a class="xref" href="impala_insert.html#insert">INSERT Statement</a> for details. + </p> + + <p class="p"> + The <var class="keyword varname">ctas_stmt</var> is a <code class="ph codeph">CREATE TABLE</code> statement using the <code class="ph codeph">AS + SELECT</code> clause, typically abbreviated as a <span class="q">"CTAS"</span> operation. See + <a class="xref" href="impala_create_table.html#create_table">CREATE TABLE Statement</a> for details. + </p> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + + <p class="p"> + You can interpret the output to judge whether the query is performing efficiently, and adjust the query + and/or the schema if not. For example, you might change the tests in the <code class="ph codeph">WHERE</code> clause, add + hints to make join operations more efficient, introduce subqueries, change the order of tables in a join, add + or change partitioning for a table, collect column statistics and/or table statistics in Hive, or any other + performance tuning steps. + </p> + + <p class="p"> + The <code class="ph codeph">EXPLAIN</code> output reminds you if table or column statistics are missing from any table + involved in the query. These statistics are important for optimizing queries involving large tables or + multi-table joins. See <a class="xref" href="impala_compute_stats.html#compute_stats">COMPUTE STATS Statement</a> for how to gather statistics, + and <a class="xref" href="impala_perf_stats.html#perf_stats">Table and Column Statistics</a> for how to use this information for query tuning. + </p> + + <div class="p"> + Read the <code class="ph codeph">EXPLAIN</code> plan from bottom to top: + <ul class="ul"> + <li class="li"> + The last part of the plan shows the low-level details such as the expected amount of data that will be + read, where you can judge the effectiveness of your partitioning strategy and estimate how long it will + take to scan a table based on total data size and the size of the cluster. + </li> + + <li class="li"> + As you work your way up, next you see the operations that will be parallelized and performed on each + Impala node. + </li> + + <li class="li"> + At the higher levels, you see how data flows when intermediate result sets are combined and transmitted + from one node to another. + </li> + + <li class="li"> + See <a class="xref" href="../shared/../topics/impala_explain_level.html#explain_level">EXPLAIN_LEVEL Query Option</a> for details about the + <code class="ph codeph">EXPLAIN_LEVEL</code> query option, which lets you customize how much detail to show in the + <code class="ph codeph">EXPLAIN</code> plan depending on whether you are doing high-level or low-level tuning, + dealing with logical or physical aspects of the query. + </li> + </ul> + </div> + + <p class="p"> + If you come from a traditional database background and are not familiar with data warehousing, keep in mind + that Impala is optimized for full table scans across very large tables. The structure and distribution of + this data is typically not suitable for the kind of indexing and single-row lookups that are common in OLTP + environments. Seeing a query scan entirely through a large table is common, not necessarily an indication of + an inefficient query. Of course, if you can reduce the volume of scanned data by orders of magnitude, for + example by using a query that affects only certain partitions within a partitioned table, then you might be + able to optimize a query so that it executes in seconds rather than minutes. + </p> + + <p class="p"> + For more information and examples to help you interpret <code class="ph codeph">EXPLAIN</code> output, see + <a class="xref" href="impala_explain_plan.html#perf_explain">Using the EXPLAIN Plan for Performance Tuning</a>. + </p> + + <p class="p"> + <strong class="ph b">Extended EXPLAIN output:</strong> + </p> + + <p class="p"> + For performance tuning of complex queries, and capacity planning (such as using the admission control and + resource management features), you can enable more detailed and informative output for the + <code class="ph codeph">EXPLAIN</code> statement. In the <span class="keyword cmdname">impala-shell</span> interpreter, issue the command + <code class="ph codeph">SET EXPLAIN_LEVEL=<var class="keyword varname">level</var></code>, where <var class="keyword varname">level</var> is an integer + from 0 to 3 or corresponding mnemonic values <code class="ph codeph">minimal</code>, <code class="ph codeph">standard</code>, + <code class="ph codeph">extended</code>, or <code class="ph codeph">verbose</code>. + </p> + + <p class="p"> + When extended <code class="ph codeph">EXPLAIN</code> output is enabled, <code class="ph codeph">EXPLAIN</code> statements print + information about estimated memory requirements, minimum number of virtual cores, and so on. + + </p> + + <p class="p"> + See <a class="xref" href="impala_explain_level.html#explain_level">EXPLAIN_LEVEL Query Option</a> for details and examples. + </p> + + <p class="p"> + <strong class="ph b">Examples:</strong> + </p> + + <p class="p"> + This example shows how the standard <code class="ph codeph">EXPLAIN</code> output moves from the lowest (physical) level to + the higher (logical) levels. The query begins by scanning a certain amount of data; each node performs an + aggregation operation (evaluating <code class="ph codeph">COUNT(*)</code>) on some subset of data that is local to that + node; the intermediate results are transmitted back to the coordinator node (labelled here as the + <code class="ph codeph">EXCHANGE</code> node); lastly, the intermediate results are summed to display the final result. + </p> + +<pre class="pre codeblock" id="explain__explain_plan_simple"><code>[impalad-host:21000] > explain select count(*) from customer_address; ++----------------------------------------------------------+ +| Explain String | ++----------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=42.00MB VCores=1 | +| | +| 03:AGGREGATE [MERGE FINALIZE] | +| | output: sum(count(*)) | +| | | +| 02:EXCHANGE [PARTITION=UNPARTITIONED] | +| | | +| 01:AGGREGATE | +| | output: count(*) | +| | | +| 00:SCAN HDFS [default.customer_address] | +| partitions=1/1 size=5.25MB | ++----------------------------------------------------------+ +</code></pre> + + <p class="p"> + These examples show how the extended <code class="ph codeph">EXPLAIN</code> output becomes more accurate and informative as + statistics are gathered by the <code class="ph codeph">COMPUTE STATS</code> statement. Initially, much of the information + about data size and distribution is marked <span class="q">"unavailable"</span>. Impala can determine the raw data size, but + not the number of rows or number of distinct values for each column without additional analysis. The + <code class="ph codeph">COMPUTE STATS</code> statement performs this analysis, so a subsequent <code class="ph codeph">EXPLAIN</code> + statement has additional information to use in deciding how to optimize the distributed query. + </p> + + + +<pre class="pre codeblock"><code>[localhost:21000] > set explain_level=extended; +EXPLAIN_LEVEL set to extended +[localhost:21000] > explain select x from t1; +[localhost:21000] > explain select x from t1; ++----------------------------------------------------------+ +| Explain String | ++----------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=32.00MB VCores=1 | +| | +| 01:EXCHANGE [PARTITION=UNPARTITIONED] | +| | hosts=1 per-host-mem=unavailable | +<strong class="ph b">| | tuple-ids=0 row-size=4B cardinality=unavailable |</strong> +| | | +| 00:SCAN HDFS [default.t2, PARTITION=RANDOM] | +| partitions=1/1 size=36B | +<strong class="ph b">| table stats: unavailable |</strong> +<strong class="ph b">| column stats: unavailable |</strong> +| hosts=1 per-host-mem=32.00MB | +<strong class="ph b">| tuple-ids=0 row-size=4B cardinality=unavailable |</strong> ++----------------------------------------------------------+ +</code></pre> + +<pre class="pre codeblock"><code>[localhost:21000] > compute stats t1; ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 1 partition(s) and 1 column(s). | ++-----------------------------------------+ +[localhost:21000] > explain select x from t1; ++----------------------------------------------------------+ +| Explain String | ++----------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=64.00MB VCores=1 | +| | +| 01:EXCHANGE [PARTITION=UNPARTITIONED] | +| | hosts=1 per-host-mem=unavailable | +| | tuple-ids=0 row-size=4B cardinality=0 | +| | | +| 00:SCAN HDFS [default.t1, PARTITION=RANDOM] | +| partitions=1/1 size=36B | +<strong class="ph b">| table stats: 0 rows total |</strong> +<strong class="ph b">| column stats: all |</strong> +| hosts=1 per-host-mem=64.00MB | +<strong class="ph b">| tuple-ids=0 row-size=4B cardinality=0 |</strong> ++----------------------------------------------------------+ +</code></pre> + + <p class="p"> + <strong class="ph b">Security considerations:</strong> + </p> + <p class="p"> + If these statements in your environment contain sensitive literal values such as credit card numbers or tax + identifiers, Impala can redact this sensitive information when displaying the statements in log files and + other administrative contexts. See <span class="xref">the documentation for your Apache Hadoop distribution</span> for details. + </p> + + <p class="p"> + <strong class="ph b">Cancellation:</strong> Cannot be cancelled. + </p> + + <p class="p"> + <strong class="ph b">HDFS permissions:</strong> + </p> + <p class="p"> + + The user ID that the <span class="keyword cmdname">impalad</span> daemon runs under, + typically the <code class="ph codeph">impala</code> user, must have read + and execute permissions for all applicable directories in all source tables + for the query that is being explained. + (A <code class="ph codeph">SELECT</code> operation could read files from multiple different HDFS directories + if the source table is partitioned.) + </p> + + <p class="p"> + <strong class="ph b">Kudu considerations:</strong> + </p> + <p class="p"> + The <code class="ph codeph">EXPLAIN</code> statement displays equivalent plan + information for queries against Kudu tables as for queries + against HDFS-based tables. + </p> + + <div class="p"> + To see which predicates Impala can <span class="q">"push down"</span> to Kudu for + efficient evaluation, without transmitting unnecessary rows back + to Impala, look for the <code class="ph codeph">kudu predicates</code> item in + the scan phase of the query. The label <code class="ph codeph">kudu predicates</code> + indicates a condition that can be evaluated efficiently on the Kudu + side. The label <code class="ph codeph">predicates</code> in a <code class="ph codeph">SCAN KUDU</code> + node indicates a condition that is evaluated by Impala. + For example, in a table with primary key column <code class="ph codeph">X</code> + and non-primary key column <code class="ph codeph">Y</code>, you can see that + some operators in the <code class="ph codeph">WHERE</code> clause are evaluated + immediately by Kudu and others are evaluated later by Impala: +<pre class="pre codeblock"><code> +EXPLAIN SELECT x,y from kudu_table WHERE + x = 1 AND x NOT IN (2,3) AND y = 1 + AND x IS NOT NULL AND x > 0; ++---------------- +| Explain String ++---------------- +... +| 00:SCAN KUDU [jrussell.hash_only] +| predicates: x IS NOT NULL, x NOT IN (2, 3) +| kudu predicates: x = 1, x > 0, y = 1 +</code></pre> + Only binary predicates and <code class="ph codeph">IN</code> predicates containing + literal values that exactly match the types in the Kudu table, and do not + require any casting, can be pushed to Kudu. + </div> + + <p class="p"> + <strong class="ph b">Related information:</strong> + </p> + <p class="p"> + <a class="xref" href="impala_select.html#select">SELECT Statement</a>, + <a class="xref" href="impala_insert.html#insert">INSERT Statement</a>, + <a class="xref" href="impala_create_table.html#create_table">CREATE TABLE Statement</a>, + <a class="xref" href="impala_explain_plan.html#explain_plan">Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles</a> + </p> + + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_langref_sql.html">Impala SQL Statements</a></div></div></nav></article></main></body></html> \ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_explain_level.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_explain_level.html b/docs/build/html/topics/impala_explain_level.html new file mode 100644 index 0000000..c9f527b --- /dev/null +++ b/docs/build/html/topics/impala_explain_level.html @@ -0,0 +1,342 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="explain_level"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>EXPLAIN_LEVEL Query Option</title></head><body id="explain_level"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">EXPLAIN_LEVEL Query Option</h1> + + + + <div class="body conbody"> + + <p class="p"> + + Controls the amount of detail provided in the output of the <code class="ph codeph">EXPLAIN</code> statement. The basic + output can help you identify high-level performance issues such as scanning a higher volume of data or more + partitions than you expect. The higher levels of detail show how intermediate results flow between nodes and + how different SQL operations such as <code class="ph codeph">ORDER BY</code>, <code class="ph codeph">GROUP BY</code>, joins, and + <code class="ph codeph">WHERE</code> clauses are implemented within a distributed query. + </p> + + <p class="p"> + <strong class="ph b">Type:</strong> <code class="ph codeph">STRING</code> or <code class="ph codeph">INT</code> + </p> + + <p class="p"> + <strong class="ph b">Default:</strong> <code class="ph codeph">1</code> + </p> + + <p class="p"> + <strong class="ph b">Arguments:</strong> + </p> + + <p class="p"> + The allowed range of numeric values for this option is 0 to 3: + </p> + + <ul class="ul"> + <li class="li"> + <code class="ph codeph">0</code> or <code class="ph codeph">MINIMAL</code>: A barebones list, one line per operation. Primarily useful + for checking the join order in very long queries where the regular <code class="ph codeph">EXPLAIN</code> output is too + long to read easily. + </li> + + <li class="li"> + <code class="ph codeph">1</code> or <code class="ph codeph">STANDARD</code>: The default level of detail, showing the logical way that + work is split up for the distributed query. + </li> + + <li class="li"> + <code class="ph codeph">2</code> or <code class="ph codeph">EXTENDED</code>: Includes additional detail about how the query planner + uses statistics in its decision-making process, to understand how a query could be tuned by gathering + statistics, using query hints, adding or removing predicates, and so on. + </li> + + <li class="li"> + <code class="ph codeph">3</code> or <code class="ph codeph">VERBOSE</code>: The maximum level of detail, showing how work is split up + within each node into <span class="q">"query fragments"</span> that are connected in a pipeline. This extra detail is + primarily useful for low-level performance testing and tuning within Impala itself, rather than for + rewriting the SQL code at the user level. + </li> + </ul> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + Prior to Impala 1.3, the allowed argument range for <code class="ph codeph">EXPLAIN_LEVEL</code> was 0 to 1: level 0 had + the mnemonic <code class="ph codeph">NORMAL</code>, and level 1 was <code class="ph codeph">VERBOSE</code>. In Impala 1.3 and higher, + <code class="ph codeph">NORMAL</code> is not a valid mnemonic value, and <code class="ph codeph">VERBOSE</code> still applies to the + highest level of detail but now corresponds to level 3. You might need to adjust the values if you have any + older <code class="ph codeph">impala-shell</code> script files that set the <code class="ph codeph">EXPLAIN_LEVEL</code> query option. + </div> + + <p class="p"> + Changing the value of this option controls the amount of detail in the output of the <code class="ph codeph">EXPLAIN</code> + statement. The extended information from level 2 or 3 is especially useful during performance tuning, when + you need to confirm whether the work for the query is distributed the way you expect, particularly for the + most resource-intensive operations such as join queries against large tables, queries against tables with + large numbers of partitions, and insert operations for Parquet tables. The extended information also helps to + check estimated resource usage when you use the admission control or resource management features explained + in <a class="xref" href="impala_resource_management.html#resource_management">Resource Management for Impala</a>. See + <a class="xref" href="impala_explain.html#explain">EXPLAIN Statement</a> for the syntax of the <code class="ph codeph">EXPLAIN</code> statement, and + <a class="xref" href="impala_explain_plan.html#perf_explain">Using the EXPLAIN Plan for Performance Tuning</a> for details about how to use the extended information. + </p> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + + <p class="p"> + As always, read the <code class="ph codeph">EXPLAIN</code> output from bottom to top. The lowest lines represent the + initial work of the query (scanning data files), the lines in the middle represent calculations done on each + node and how intermediate results are transmitted from one node to another, and the topmost lines represent + the final results being sent back to the coordinator node. + </p> + + <p class="p"> + The numbers in the left column are generated internally during the initial planning phase and do not + represent the actual order of operations, so it is not significant if they appear out of order in the + <code class="ph codeph">EXPLAIN</code> output. + </p> + + <p class="p"> + At all <code class="ph codeph">EXPLAIN</code> levels, the plan contains a warning if any tables in the query are missing + statistics. Use the <code class="ph codeph">COMPUTE STATS</code> statement to gather statistics for each table and suppress + this warning. See <a class="xref" href="impala_perf_stats.html#perf_stats">Table and Column Statistics</a> for details about how the statistics help + query performance. + </p> + + <p class="p"> + The <code class="ph codeph">PROFILE</code> command in <span class="keyword cmdname">impala-shell</span> always starts with an explain plan + showing full detail, the same as with <code class="ph codeph">EXPLAIN_LEVEL=3</code>. <span class="ph">After the explain + plan comes the executive summary, the same output as produced by the <code class="ph codeph">SUMMARY</code> command in + <span class="keyword cmdname">impala-shell</span>.</span> + </p> + + <p class="p"> + <strong class="ph b">Examples:</strong> + </p> + + <p class="p"> + These examples use a trivial, empty table to illustrate how the essential aspects of query planning are shown + in <code class="ph codeph">EXPLAIN</code> output: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > create table t1 (x int, s string); +[localhost:21000] > set explain_level=1; +[localhost:21000] > explain select count(*) from t1; ++------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=10.00MB VCores=1 | +| WARNING: The following tables are missing relevant table and/or column | +| statistics. | +| explain_plan.t1 | +| | +| 03:AGGREGATE [MERGE FINALIZE] | +| | output: sum(count(*)) | +| | | +| 02:EXCHANGE [PARTITION=UNPARTITIONED] | +| | | +| 01:AGGREGATE | +| | output: count(*) | +| | | +| 00:SCAN HDFS [explain_plan.t1] | +| partitions=1/1 size=0B | ++------------------------------------------------------------------------+ +[localhost:21000] > explain select * from t1; ++------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 | +| WARNING: The following tables are missing relevant table and/or column | +| statistics. | +| explain_plan.t1 | +| | +| 01:EXCHANGE [PARTITION=UNPARTITIONED] | +| | | +| 00:SCAN HDFS [explain_plan.t1] | +| partitions=1/1 size=0B | ++------------------------------------------------------------------------+ +[localhost:21000] > set explain_level=2; +[localhost:21000] > explain select * from t1; ++------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 | +| WARNING: The following tables are missing relevant table and/or column | +| statistics. | +| explain_plan.t1 | +| | +| 01:EXCHANGE [PARTITION=UNPARTITIONED] | +| | hosts=0 per-host-mem=unavailable | +| | tuple-ids=0 row-size=19B cardinality=unavailable | +| | | +| 00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM] | +| partitions=1/1 size=0B | +| table stats: unavailable | +| column stats: unavailable | +| hosts=0 per-host-mem=0B | +| tuple-ids=0 row-size=19B cardinality=unavailable | ++------------------------------------------------------------------------+ +[localhost:21000] > set explain_level=3; +[localhost:21000] > explain select * from t1; ++------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 | +<strong class="ph b">| WARNING: The following tables are missing relevant table and/or column |</strong> +<strong class="ph b">| statistics. |</strong> +<strong class="ph b">| explain_plan.t1 |</strong> +| | +| F01:PLAN FRAGMENT [PARTITION=UNPARTITIONED] | +| 01:EXCHANGE [PARTITION=UNPARTITIONED] | +| hosts=0 per-host-mem=unavailable | +| tuple-ids=0 row-size=19B cardinality=unavailable | +| | +| F00:PLAN FRAGMENT [PARTITION=RANDOM] | +| DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, PARTITION=UNPARTITIONED] | +| 00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM] | +| partitions=1/1 size=0B | +<strong class="ph b">| table stats: unavailable |</strong> +<strong class="ph b">| column stats: unavailable |</strong> +| hosts=0 per-host-mem=0B | +| tuple-ids=0 row-size=19B cardinality=unavailable | ++------------------------------------------------------------------------+ +</code></pre> + + <p class="p"> + As the warning message demonstrates, most of the information needed for Impala to do efficient query + planning, and for you to understand the performance characteristics of the query, requires running the + <code class="ph codeph">COMPUTE STATS</code> statement for the table: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > compute stats t1; ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 1 partition(s) and 2 column(s). | ++-----------------------------------------+ +[localhost:21000] > explain select * from t1; ++------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 | +| | +| F01:PLAN FRAGMENT [PARTITION=UNPARTITIONED] | +| 01:EXCHANGE [PARTITION=UNPARTITIONED] | +| hosts=0 per-host-mem=unavailable | +| tuple-ids=0 row-size=20B cardinality=0 | +| | +| F00:PLAN FRAGMENT [PARTITION=RANDOM] | +| DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, PARTITION=UNPARTITIONED] | +| 00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM] | +| partitions=1/1 size=0B | +<strong class="ph b">| table stats: 0 rows total |</strong> +<strong class="ph b">| column stats: all |</strong> +| hosts=0 per-host-mem=0B | +| tuple-ids=0 row-size=20B cardinality=0 | ++------------------------------------------------------------------------+ +</code></pre> + + <p class="p"> + Joins and other complicated, multi-part queries are the ones where you most commonly need to examine the + <code class="ph codeph">EXPLAIN</code> output and customize the amount of detail in the output. This example shows the + default <code class="ph codeph">EXPLAIN</code> output for a three-way join query, then the equivalent output with a + <code class="ph codeph">[SHUFFLE]</code> hint to change the join mechanism between the first two tables from a broadcast + join to a shuffle join. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > set explain_level=1; +[localhost:21000] > explain select one.*, two.*, three.* from t1 one, t1 two, t1 three where one.x = two.x and two.x = three.x; ++---------------------------------------------------------+ +| Explain String | ++---------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 | +| | +| 07:EXCHANGE [PARTITION=UNPARTITIONED] | +| | | +<strong class="ph b">| 04:HASH JOIN [INNER JOIN, BROADCAST] |</strong> +| | hash predicates: two.x = three.x | +| | | +<strong class="ph b">| |--06:EXCHANGE [BROADCAST] |</strong> +| | | | +| | 02:SCAN HDFS [explain_plan.t1 three] | +| | partitions=1/1 size=0B | +| | | +<strong class="ph b">| 03:HASH JOIN [INNER JOIN, BROADCAST] |</strong> +| | hash predicates: one.x = two.x | +| | | +<strong class="ph b">| |--05:EXCHANGE [BROADCAST] |</strong> +| | | | +| | 01:SCAN HDFS [explain_plan.t1 two] | +| | partitions=1/1 size=0B | +| | | +| 00:SCAN HDFS [explain_plan.t1 one] | +| partitions=1/1 size=0B | ++---------------------------------------------------------+ +[localhost:21000] > explain select one.*, two.*, three.* + > from t1 one join [shuffle] t1 two join t1 three + > where one.x = two.x and two.x = three.x; ++---------------------------------------------------------+ +| Explain String | ++---------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 | +| | +| 08:EXCHANGE [PARTITION=UNPARTITIONED] | +| | | +<strong class="ph b">| 04:HASH JOIN [INNER JOIN, BROADCAST] |</strong> +| | hash predicates: two.x = three.x | +| | | +<strong class="ph b">| |--07:EXCHANGE [BROADCAST] |</strong> +| | | | +| | 02:SCAN HDFS [explain_plan.t1 three] | +| | partitions=1/1 size=0B | +| | | +<strong class="ph b">| 03:HASH JOIN [INNER JOIN, PARTITIONED] |</strong> +| | hash predicates: one.x = two.x | +| | | +<strong class="ph b">| |--06:EXCHANGE [PARTITION=HASH(two.x)] |</strong> +| | | | +| | 01:SCAN HDFS [explain_plan.t1 two] | +| | partitions=1/1 size=0B | +| | | +<strong class="ph b">| 05:EXCHANGE [PARTITION=HASH(one.x)] |</strong> +| | | +| 00:SCAN HDFS [explain_plan.t1 one] | +| partitions=1/1 size=0B | ++---------------------------------------------------------+ +</code></pre> + + <p class="p"> + For a join involving many different tables, the default <code class="ph codeph">EXPLAIN</code> output might stretch over + several pages, and the only details you care about might be the join order and the mechanism (broadcast or + shuffle) for joining each pair of tables. In that case, you might set <code class="ph codeph">EXPLAIN_LEVEL</code> to its + lowest value of 0, to focus on just the join order and join mechanism for each stage. The following example + shows how the rows from the first and second joined tables are hashed and divided among the nodes of the + cluster for further filtering; then the entire contents of the third table are broadcast to all nodes for the + final stage of join processing. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > set explain_level=0; +[localhost:21000] > explain select one.*, two.*, three.* + > from t1 one join [shuffle] t1 two join t1 three + > where one.x = two.x and two.x = three.x; ++---------------------------------------------------------+ +| Explain String | ++---------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 | +| | +| 08:EXCHANGE [PARTITION=UNPARTITIONED] | +<strong class="ph b">| 04:HASH JOIN [INNER JOIN, BROADCAST] |</strong> +<strong class="ph b">| |--07:EXCHANGE [BROADCAST] |</strong> +| | 02:SCAN HDFS [explain_plan.t1 three] | +<strong class="ph b">| 03:HASH JOIN [INNER JOIN, PARTITIONED] |</strong> +<strong class="ph b">| |--06:EXCHANGE [PARTITION=HASH(two.x)] |</strong> +| | 01:SCAN HDFS [explain_plan.t1 two] | +<strong class="ph b">| 05:EXCHANGE [PARTITION=HASH(one.x)] |</strong> +| 00:SCAN HDFS [explain_plan.t1 one] | ++---------------------------------------------------------+ +</code></pre> + + + + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_explain_plan.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_explain_plan.html b/docs/build/html/topics/impala_explain_plan.html new file mode 100644 index 0000000..bcd0855 --- /dev/null +++ b/docs/build/html/topics/impala_explain_plan.html @@ -0,0 +1,592 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_performance.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="explain_plan"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles</title> </head><body id="explain_plan"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles</h1> + + + + <div class="body conbody"> + + <p class="p"> + To understand the high-level performance considerations for Impala queries, read the output of the + <code class="ph codeph">EXPLAIN</code> statement for the query. You can get the <code class="ph codeph">EXPLAIN</code> plan without + actually running the query itself. + </p> + + <p class="p"> + For an overview of the physical performance characteristics for a query, issue the <code class="ph codeph">SUMMARY</code> + statement in <span class="keyword cmdname">impala-shell</span> immediately after executing a query. This condensed information + shows which phases of execution took the most time, and how the estimates for memory usage and number of rows + at each phase compare to the actual values. + </p> + + <p class="p"> + To understand the detailed performance characteristics for a query, issue the <code class="ph codeph">PROFILE</code> + statement in <span class="keyword cmdname">impala-shell</span> immediately after executing a query. This low-level information + includes physical details about memory, CPU, I/O, and network usage, and thus is only available after the + query is actually run. + </p> + + <p class="p toc inpage"></p> + + <p class="p"> + Also, see <a class="xref" href="impala_hbase.html#hbase_performance">Performance Considerations for the Impala-HBase Integration</a> + and <a class="xref" href="impala_s3.html#s3_performance">Understanding and Tuning Impala Query Performance for S3 Data</a> + for examples of interpreting + <code class="ph codeph">EXPLAIN</code> plans for queries against HBase tables + <span class="ph">and data stored in the Amazon Simple Storage System (S3)</span>. + </p> + </div> + + <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_performance.html">Tuning Impala for Performance</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="explain_plan__perf_explain"> + + <h2 class="title topictitle2" id="ariaid-title2">Using the EXPLAIN Plan for Performance Tuning</h2> + + <div class="body conbody"> + + <p class="p"> + The <code class="ph codeph"><a class="xref" href="impala_explain.html#explain">EXPLAIN</a></code> statement gives you an outline + of the logical steps that a query will perform, such as how the work will be distributed among the nodes + and how intermediate results will be combined to produce the final result set. You can see these details + before actually running the query. You can use this information to check that the query will not operate in + some very unexpected or inefficient way. + </p> + + + +<pre class="pre codeblock"><code>[impalad-host:21000] > explain select count(*) from customer_address; ++----------------------------------------------------------+ +| Explain String | ++----------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=42.00MB VCores=1 | +| | +| 03:AGGREGATE [MERGE FINALIZE] | +| | output: sum(count(*)) | +| | | +| 02:EXCHANGE [PARTITION=UNPARTITIONED] | +| | | +| 01:AGGREGATE | +| | output: count(*) | +| | | +| 00:SCAN HDFS [default.customer_address] | +| partitions=1/1 size=5.25MB | ++----------------------------------------------------------+ +</code></pre> + + <div class="p"> + Read the <code class="ph codeph">EXPLAIN</code> plan from bottom to top: + <ul class="ul"> + <li class="li"> + The last part of the plan shows the low-level details such as the expected amount of data that will be + read, where you can judge the effectiveness of your partitioning strategy and estimate how long it will + take to scan a table based on total data size and the size of the cluster. + </li> + + <li class="li"> + As you work your way up, next you see the operations that will be parallelized and performed on each + Impala node. + </li> + + <li class="li"> + At the higher levels, you see how data flows when intermediate result sets are combined and transmitted + from one node to another. + </li> + + <li class="li"> + See <a class="xref" href="../shared/../topics/impala_explain_level.html#explain_level">EXPLAIN_LEVEL Query Option</a> for details about the + <code class="ph codeph">EXPLAIN_LEVEL</code> query option, which lets you customize how much detail to show in the + <code class="ph codeph">EXPLAIN</code> plan depending on whether you are doing high-level or low-level tuning, + dealing with logical or physical aspects of the query. + </li> + </ul> + </div> + + <p class="p"> + The <code class="ph codeph">EXPLAIN</code> plan is also printed at the beginning of the query profile report described in + <a class="xref" href="#perf_profile">Using the Query Profile for Performance Tuning</a>, for convenience in examining both the logical and physical aspects of the + query side-by-side. + </p> + + <p class="p"> + The amount of detail displayed in the <code class="ph codeph">EXPLAIN</code> output is controlled by the + <a class="xref" href="impala_explain_level.html#explain_level">EXPLAIN_LEVEL</a> query option. You typically + increase this setting from <code class="ph codeph">normal</code> to <code class="ph codeph">verbose</code> (or from <code class="ph codeph">0</code> + to <code class="ph codeph">1</code>) when doublechecking the presence of table and column statistics during performance + tuning, or when estimating query resource usage in conjunction with the resource management features. + </p> + + + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="explain_plan__perf_summary"> + + <h2 class="title topictitle2" id="ariaid-title3">Using the SUMMARY Report for Performance Tuning</h2> + + <div class="body conbody"> + + <p class="p"> + The <code class="ph codeph"><a class="xref" href="impala_shell_commands.html#shell_commands">SUMMARY</a></code> command within + the <span class="keyword cmdname">impala-shell</span> interpreter gives you an easy-to-digest overview of the timings for the + different phases of execution for a query. Like the <code class="ph codeph">EXPLAIN</code> plan, it is easy to see + potential performance bottlenecks. Like the <code class="ph codeph">PROFILE</code> output, it is available after the + query is run and so displays actual timing numbers. + </p> + + <p class="p"> + The <code class="ph codeph">SUMMARY</code> report is also printed at the beginning of the query profile report described + in <a class="xref" href="#perf_profile">Using the Query Profile for Performance Tuning</a>, for convenience in examining high-level and low-level aspects of the query + side-by-side. + </p> + + <p class="p"> + For example, here is a query involving an aggregate function, on a single-node VM. The different stages of + the query and their timings are shown (rolled up for all nodes), along with estimated and actual values + used in planning the query. In this case, the <code class="ph codeph">AVG()</code> function is computed for a subset of + data on each node (stage 01) and then the aggregated results from all nodes are combined at the end (stage + 03). You can see which stages took the most time, and whether any estimates were substantially different + than the actual data distribution. (When examining the time values, be sure to consider the suffixes such + as <code class="ph codeph">us</code> for microseconds and <code class="ph codeph">ms</code> for milliseconds, rather than just looking + for the largest numbers.) + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > select avg(ss_sales_price) from store_sales where ss_coupon_amt = 0; ++---------------------+ +| avg(ss_sales_price) | ++---------------------+ +| 37.80770926328327 | ++---------------------+ +[localhost:21000] > summary; ++--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+ +| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | ++--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+ +| 03:AGGREGATE | 1 | 1.03ms | 1.03ms | 1 | 1 | 48.00 KB | -1 B | MERGE FINALIZE | +| 02:EXCHANGE | 1 | 0ns | 0ns | 1 | 1 | 0 B | -1 B | UNPARTITIONED | +| 01:AGGREGATE | 1 | 30.79ms | 30.79ms | 1 | 1 | 80.00 KB | 10.00 MB | | +| 00:SCAN HDFS | 1 | 5.45s | 5.45s | 2.21M | -1 | 64.05 MB | 432.00 MB | tpc.store_sales | ++--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+ +</code></pre> + + <p class="p"> + Notice how the longest initial phase of the query is measured in seconds (s), while later phases working on + smaller intermediate results are measured in milliseconds (ms) or even nanoseconds (ns). + </p> + + <p class="p"> + Here is an example from a more complicated query, as it would appear in the <code class="ph codeph">PROFILE</code> + output: + </p> + +<pre class="pre codeblock"><code>Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail +------------------------------------------------------------------------------------------------------------------------ +09:MERGING-EXCHANGE 1 79.738us 79.738us 5 5 0 -1.00 B UNPARTITIONED +05:TOP-N 3 84.693us 88.810us 5 5 12.00 KB 120.00 B +04:AGGREGATE 3 5.263ms 6.432ms 5 5 44.00 KB 10.00 MB MERGE FINALIZE +08:AGGREGATE 3 16.659ms 27.444ms 52.52K 600.12K 3.20 MB 15.11 MB MERGE +07:EXCHANGE 3 2.644ms 5.1ms 52.52K 600.12K 0 0 HASH(o_orderpriority) +03:AGGREGATE 3 342.913ms 966.291ms 52.52K 600.12K 10.80 MB 15.11 MB +02:HASH JOIN 3 2s165ms 2s171ms 144.87K 600.12K 13.63 MB 941.01 KB INNER JOIN, BROADCAST +|--06:EXCHANGE 3 8.296ms 8.692ms 57.22K 15.00K 0 0 BROADCAST +| 01:SCAN HDFS 2 1s412ms 1s978ms 57.22K 15.00K 24.21 MB 176.00 MB tpch.orders o +00:SCAN HDFS 3 8s032ms 8s558ms 3.79M 600.12K 32.29 MB 264.00 MB tpch.lineitem l +</code></pre> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="explain_plan__perf_profile"> + + <h2 class="title topictitle2" id="ariaid-title4">Using the Query Profile for Performance Tuning</h2> + + <div class="body conbody"> + + <p class="p"> + The <code class="ph codeph">PROFILE</code> statement, available in the <span class="keyword cmdname">impala-shell</span> interpreter, + produces a detailed low-level report showing how the most recent query was executed. Unlike the + <code class="ph codeph">EXPLAIN</code> plan described in <a class="xref" href="#perf_explain">Using the EXPLAIN Plan for Performance Tuning</a>, this information is only available + after the query has finished. It shows physical details such as the number of bytes read, maximum memory + usage, and so on for each node. You can use this information to determine if the query is I/O-bound or + CPU-bound, whether some network condition is imposing a bottleneck, whether a slowdown is affecting some + nodes but not others, and to check that recommended configuration settings such as short-circuit local + reads are in effect. + </p> + + <p class="p"> + By default, time values in the profile output reflect the wall-clock time taken by an operation. + For values denoting system time or user time, the measurement unit is reflected in the metric + name, such as <code class="ph codeph">ScannerThreadsSysTime</code> or <code class="ph codeph">ScannerThreadsUserTime</code>. + For example, a multi-threaded I/O operation might show a small figure for wall-clock time, + while the corresponding system time is larger, representing the sum of the CPU time taken by each thread. + Or a wall-clock time figure might be larger because it counts time spent waiting, while + the corresponding system and user time figures only measure the time while the operation + is actively using CPU cycles. + </p> + + <p class="p"> + The <a class="xref" href="impala_explain_plan.html#perf_explain"><code class="ph codeph">EXPLAIN</code> plan</a> is also printed + at the beginning of the query profile report, for convenience in examining both the logical and physical + aspects of the query side-by-side. The + <a class="xref" href="impala_explain_level.html#explain_level">EXPLAIN_LEVEL</a> query option also controls the + verbosity of the <code class="ph codeph">EXPLAIN</code> output printed by the <code class="ph codeph">PROFILE</code> command. + </p> + + + + <p class="p"> + Here is an example of a query profile, from a relatively straightforward query on a single-node + pseudo-distributed cluster to keep the output relatively brief. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > profile; +Query Runtime Profile: +Query (id=6540a03d4bee0691:4963d6269b210ebd): + Summary: + Session ID: ea4a197f1c7bf858:c74e66f72e3a33ba + Session Type: BEESWAX + Start Time: 2013-12-02 17:10:30.263067000 + End Time: 2013-12-02 17:10:50.932044000 + Query Type: QUERY + Query State: FINISHED + Query Status: OK + Impala Version: impalad version 1.2.1 RELEASE (build edb5af1bcad63d410bc5d47cc203df3a880e9324) + User: doc_demo + Network Address: 127.0.0.1:49161 + Default Db: stats_testing + Sql Statement: select t1.s, t2.s from t1 join t2 on (t1.id = t2.parent) + Plan: +---------------- +Estimated Per-Host Requirements: Memory=2.09GB VCores=2 + +PLAN FRAGMENT 0 + PARTITION: UNPARTITIONED + + 4:EXCHANGE + cardinality: unavailable + per-host memory: unavailable + tuple ids: 0 1 + +PLAN FRAGMENT 1 + PARTITION: RANDOM + + STREAM DATA SINK + EXCHANGE ID: 4 + UNPARTITIONED + + 2:HASH JOIN + | join op: INNER JOIN (BROADCAST) + | hash predicates: + | t1.id = t2.parent + | cardinality: unavailable + | per-host memory: 2.00GB + | tuple ids: 0 1 + | + |----3:EXCHANGE + | cardinality: unavailable + | per-host memory: 0B + | tuple ids: 1 + | + 0:SCAN HDFS + table=stats_testing.t1 #partitions=1/1 size=33B + table stats: unavailable + column stats: unavailable + cardinality: unavailable + per-host memory: 32.00MB + tuple ids: 0 + +PLAN FRAGMENT 2 + PARTITION: RANDOM + + STREAM DATA SINK + EXCHANGE ID: 3 + UNPARTITIONED + + 1:SCAN HDFS + table=stats_testing.t2 #partitions=1/1 size=960.00KB + table stats: unavailable + column stats: unavailable + cardinality: unavailable + per-host memory: 96.00MB + tuple ids: 1 +---------------- + Query Timeline: 20s670ms + - Start execution: 2.559ms (2.559ms) + - Planning finished: 23.587ms (21.27ms) + - Rows available: 666.199ms (642.612ms) + - First row fetched: 668.919ms (2.719ms) + - Unregister query: 20s668ms (20s000ms) + ImpalaServer: + - ClientFetchWaitTimer: 19s637ms + - RowMaterializationTimer: 167.121ms + Execution Profile 6540a03d4bee0691:4963d6269b210ebd:(Active: 837.815ms, % non-child: 0.00%) + Per Node Peak Memory Usage: impala-1.example.com:22000(7.42 MB) + - FinalizationTimer: 0ns + Coordinator Fragment:(Active: 195.198ms, % non-child: 0.00%) + MemoryUsage(500.0ms): 16.00 KB, 7.42 MB, 7.33 MB, 7.10 MB, 6.94 MB, 6.71 MB, 6.56 MB, 6.40 MB, 6.17 MB, 6.02 MB, 5.79 MB, 5.63 MB, 5.48 MB, 5.25 MB, 5.09 MB, 4.86 MB, 4.71 MB, 4.47 MB, 4.32 MB, 4.09 MB, 3.93 MB, 3.78 MB, 3.55 MB, 3.39 MB, 3.16 MB, 3.01 MB, 2.78 MB, 2.62 MB, 2.39 MB, 2.24 MB, 2.08 MB, 1.85 MB, 1.70 MB, 1.54 MB, 1.31 MB, 1.16 MB, 948.00 KB, 790.00 KB, 553.00 KB, 395.00 KB, 237.00 KB + ThreadUsage(500.0ms): 1 + - AverageThreadTokens: 1.00 + - PeakMemoryUsage: 7.42 MB + - PrepareTime: 36.144us + - RowsProduced: 98.30K (98304) + - TotalCpuTime: 20s449ms + - TotalNetworkWaitTime: 191.630ms + - TotalStorageWaitTime: 0ns + CodeGen:(Active: 150.679ms, % non-child: 77.19%) + - CodegenTime: 0ns + - CompileTime: 139.503ms + - LoadTime: 10.7ms + - ModuleFileSize: 95.27 KB + EXCHANGE_NODE (id=4):(Active: 194.858ms, % non-child: 99.83%) + - BytesReceived: 2.33 MB + - ConvertRowBatchTime: 2.732ms + - DataArrivalWaitTime: 191.118ms + - DeserializeRowBatchTimer: 14.943ms + - FirstBatchArrivalWaitTime: 191.117ms + - PeakMemoryUsage: 7.41 MB + - RowsReturned: 98.30K (98304) + - RowsReturnedRate: 504.49 K/sec + - SendersBlockedTimer: 0ns + - SendersBlockedTotalTimer(*): 0ns + Averaged Fragment 1:(Active: 442.360ms, % non-child: 0.00%) + split sizes: min: 33.00 B, max: 33.00 B, avg: 33.00 B, stddev: 0.00 + completion times: min:443.720ms max:443.720ms mean: 443.720ms stddev:0ns + execution rates: min:74.00 B/sec max:74.00 B/sec mean:74.00 B/sec stddev:0.00 /sec + num instances: 1 + - AverageThreadTokens: 1.00 + - PeakMemoryUsage: 6.06 MB + - PrepareTime: 7.291ms + - RowsProduced: 98.30K (98304) + - TotalCpuTime: 784.259ms + - TotalNetworkWaitTime: 388.818ms + - TotalStorageWaitTime: 3.934ms + CodeGen:(Active: 312.862ms, % non-child: 70.73%) + - CodegenTime: 2.669ms + - CompileTime: 302.467ms + - LoadTime: 9.231ms + - ModuleFileSize: 95.27 KB + DataStreamSender (dst_id=4):(Active: 80.63ms, % non-child: 18.10%) + - BytesSent: 2.33 MB + - NetworkThroughput(*): 35.89 MB/sec + - OverallThroughput: 29.06 MB/sec + - PeakMemoryUsage: 5.33 KB + - SerializeBatchTime: 26.487ms + - ThriftTransmitTime(*): 64.814ms + - UncompressedRowBatchSize: 6.66 MB + HASH_JOIN_NODE (id=2):(Active: 362.25ms, % non-child: 3.92%) + - BuildBuckets: 1.02K (1024) + - BuildRows: 98.30K (98304) + - BuildTime: 12.622ms + - LoadFactor: 0.00 + - PeakMemoryUsage: 6.02 MB + - ProbeRows: 3 + - ProbeTime: 3.579ms + - RowsReturned: 98.30K (98304) + - RowsReturnedRate: 271.54 K/sec + EXCHANGE_NODE (id=3):(Active: 344.680ms, % non-child: 77.92%) + - BytesReceived: 1.15 MB + - ConvertRowBatchTime: 2.792ms + - DataArrivalWaitTime: 339.936ms + - DeserializeRowBatchTimer: 9.910ms + - FirstBatchArrivalWaitTime: 199.474ms + - PeakMemoryUsage: 156.00 KB + - RowsReturned: 98.30K (98304) + - RowsReturnedRate: 285.20 K/sec + - SendersBlockedTimer: 0ns + - SendersBlockedTotalTimer(*): 0ns + HDFS_SCAN_NODE (id=0):(Active: 13.616us, % non-child: 0.00%) + - AverageHdfsReadThreadConcurrency: 0.00 + - AverageScannerThreadConcurrency: 0.00 + - BytesRead: 33.00 B + - BytesReadLocal: 33.00 B + - BytesReadShortCircuit: 33.00 B + - NumDisksAccessed: 1 + - NumScannerThreadsStarted: 1 + - PeakMemoryUsage: 46.00 KB + - PerReadThreadRawHdfsThroughput: 287.52 KB/sec + - RowsRead: 3 + - RowsReturned: 3 + - RowsReturnedRate: 220.33 K/sec + - ScanRangesComplete: 1 + - ScannerThreadsInvoluntaryContextSwitches: 26 + - ScannerThreadsTotalWallClockTime: 55.199ms + - DelimiterParseTime: 2.463us + - MaterializeTupleTime(*): 1.226us + - ScannerThreadsSysTime: 0ns + - ScannerThreadsUserTime: 42.993ms + - ScannerThreadsVoluntaryContextSwitches: 1 + - TotalRawHdfsReadTime(*): 112.86us + - TotalReadThroughput: 0.00 /sec + Averaged Fragment 2:(Active: 190.120ms, % non-child: 0.00%) + split sizes: min: 960.00 KB, max: 960.00 KB, avg: 960.00 KB, stddev: 0.00 + completion times: min:191.736ms max:191.736ms mean: 191.736ms stddev:0ns + execution rates: min:4.89 MB/sec max:4.89 MB/sec mean:4.89 MB/sec stddev:0.00 /sec + num instances: 1 + - AverageThreadTokens: 0.00 + - PeakMemoryUsage: 906.33 KB + - PrepareTime: 3.67ms + - RowsProduced: 98.30K (98304) + - TotalCpuTime: 403.351ms + - TotalNetworkWaitTime: 34.999ms + - TotalStorageWaitTime: 108.675ms + CodeGen:(Active: 162.57ms, % non-child: 85.24%) + - CodegenTime: 3.133ms + - CompileTime: 148.316ms + - LoadTime: 12.317ms + - ModuleFileSize: 95.27 KB + DataStreamSender (dst_id=3):(Active: 70.620ms, % non-child: 37.14%) + - BytesSent: 1.15 MB + - NetworkThroughput(*): 23.30 MB/sec + - OverallThroughput: 16.23 MB/sec + - PeakMemoryUsage: 5.33 KB + - SerializeBatchTime: 22.69ms + - ThriftTransmitTime(*): 49.178ms + - UncompressedRowBatchSize: 3.28 MB + HDFS_SCAN_NODE (id=1):(Active: 118.839ms, % non-child: 62.51%) + - AverageHdfsReadThreadConcurrency: 0.00 + - AverageScannerThreadConcurrency: 0.00 + - BytesRead: 960.00 KB + - BytesReadLocal: 960.00 KB + - BytesReadShortCircuit: 960.00 KB + - NumDisksAccessed: 1 + - NumScannerThreadsStarted: 1 + - PeakMemoryUsage: 869.00 KB + - PerReadThreadRawHdfsThroughput: 130.21 MB/sec + - RowsRead: 98.30K (98304) + - RowsReturned: 98.30K (98304) + - RowsReturnedRate: 827.20 K/sec + - ScanRangesComplete: 15 + - ScannerThreadsInvoluntaryContextSwitches: 34 + - ScannerThreadsTotalWallClockTime: 189.774ms + - DelimiterParseTime: 15.703ms + - MaterializeTupleTime(*): 3.419ms + - ScannerThreadsSysTime: 1.999ms + - ScannerThreadsUserTime: 44.993ms + - ScannerThreadsVoluntaryContextSwitches: 118 + - TotalRawHdfsReadTime(*): 7.199ms + - TotalReadThroughput: 0.00 /sec + Fragment 1: + Instance 6540a03d4bee0691:4963d6269b210ebf (host=impala-1.example.com:22000):(Active: 442.360ms, % non-child: 0.00%) + Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:1/33.00 B + MemoryUsage(500.0ms): 69.33 KB + ThreadUsage(500.0ms): 1 + - AverageThreadTokens: 1.00 + - PeakMemoryUsage: 6.06 MB + - PrepareTime: 7.291ms + - RowsProduced: 98.30K (98304) + - TotalCpuTime: 784.259ms + - TotalNetworkWaitTime: 388.818ms + - TotalStorageWaitTime: 3.934ms + CodeGen:(Active: 312.862ms, % non-child: 70.73%) + - CodegenTime: 2.669ms + - CompileTime: 302.467ms + - LoadTime: 9.231ms + - ModuleFileSize: 95.27 KB + DataStreamSender (dst_id=4):(Active: 80.63ms, % non-child: 18.10%) + - BytesSent: 2.33 MB + - NetworkThroughput(*): 35.89 MB/sec + - OverallThroughput: 29.06 MB/sec + - PeakMemoryUsage: 5.33 KB + - SerializeBatchTime: 26.487ms + - ThriftTransmitTime(*): 64.814ms + - UncompressedRowBatchSize: 6.66 MB + HASH_JOIN_NODE (id=2):(Active: 362.25ms, % non-child: 3.92%) + ExecOption: Build Side Codegen Enabled, Probe Side Codegen Enabled, Hash Table Built Asynchronously + - BuildBuckets: 1.02K (1024) + - BuildRows: 98.30K (98304) + - BuildTime: 12.622ms + - LoadFactor: 0.00 + - PeakMemoryUsage: 6.02 MB + - ProbeRows: 3 + - ProbeTime: 3.579ms + - RowsReturned: 98.30K (98304) + - RowsReturnedRate: 271.54 K/sec + EXCHANGE_NODE (id=3):(Active: 344.680ms, % non-child: 77.92%) + - BytesReceived: 1.15 MB + - ConvertRowBatchTime: 2.792ms + - DataArrivalWaitTime: 339.936ms + - DeserializeRowBatchTimer: 9.910ms + - FirstBatchArrivalWaitTime: 199.474ms + - PeakMemoryUsage: 156.00 KB + - RowsReturned: 98.30K (98304) + - RowsReturnedRate: 285.20 K/sec + - SendersBlockedTimer: 0ns + - SendersBlockedTotalTimer(*): 0ns + HDFS_SCAN_NODE (id=0):(Active: 13.616us, % non-child: 0.00%) + Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:1/33.00 B + Hdfs Read Thread Concurrency Bucket: 0:0% 1:0% + File Formats: TEXT/NONE:1 + ExecOption: Codegen enabled: 1 out of 1 + - AverageHdfsReadThreadConcurrency: 0.00 + - AverageScannerThreadConcurrency: 0.00 + - BytesRead: 33.00 B + - BytesReadLocal: 33.00 B + - BytesReadShortCircuit: 33.00 B + - NumDisksAccessed: 1 + - NumScannerThreadsStarted: 1 + - PeakMemoryUsage: 46.00 KB + - PerReadThreadRawHdfsThroughput: 287.52 KB/sec + - RowsRead: 3 + - RowsReturned: 3 + - RowsReturnedRate: 220.33 K/sec + - ScanRangesComplete: 1 + - ScannerThreadsInvoluntaryContextSwitches: 26 + - ScannerThreadsTotalWallClockTime: 55.199ms + - DelimiterParseTime: 2.463us + - MaterializeTupleTime(*): 1.226us + - ScannerThreadsSysTime: 0ns + - ScannerThreadsUserTime: 42.993ms + - ScannerThreadsVoluntaryContextSwitches: 1 + - TotalRawHdfsReadTime(*): 112.86us + - TotalReadThroughput: 0.00 /sec + Fragment 2: + Instance 6540a03d4bee0691:4963d6269b210ec0 (host=impala-1.example.com:22000):(Active: 190.120ms, % non-child: 0.00%) + Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:15/960.00 KB + - AverageThreadTokens: 0.00 + - PeakMemoryUsage: 906.33 KB + - PrepareTime: 3.67ms + - RowsProduced: 98.30K (98304) + - TotalCpuTime: 403.351ms + - TotalNetworkWaitTime: 34.999ms + - TotalStorageWaitTime: 108.675ms + CodeGen:(Active: 162.57ms, % non-child: 85.24%) + - CodegenTime: 3.133ms + - CompileTime: 148.316ms + - LoadTime: 12.317ms + - ModuleFileSize: 95.27 KB + DataStreamSender (dst_id=3):(Active: 70.620ms, % non-child: 37.14%) + - BytesSent: 1.15 MB + - NetworkThroughput(*): 23.30 MB/sec + - OverallThroughput: 16.23 MB/sec + - PeakMemoryUsage: 5.33 KB + - SerializeBatchTime: 22.69ms + - ThriftTransmitTime(*): 49.178ms + - UncompressedRowBatchSize: 3.28 MB + HDFS_SCAN_NODE (id=1):(Active: 118.839ms, % non-child: 62.51%) + Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:15/960.00 KB + Hdfs Read Thread Concurrency Bucket: 0:0% 1:0% + File Formats: TEXT/NONE:15 + ExecOption: Codegen enabled: 15 out of 15 + - AverageHdfsReadThreadConcurrency: 0.00 + - AverageScannerThreadConcurrency: 0.00 + - BytesRead: 960.00 KB + - BytesReadLocal: 960.00 KB + - BytesReadShortCircuit: 960.00 KB + - NumDisksAccessed: 1 + - NumScannerThreadsStarted: 1 + - PeakMemoryUsage: 869.00 KB + - PerReadThreadRawHdfsThroughput: 130.21 MB/sec + - RowsRead: 98.30K (98304) + - RowsReturned: 98.30K (98304) + - RowsReturnedRate: 827.20 K/sec + - ScanRangesComplete: 15 + - ScannerThreadsInvoluntaryContextSwitches: 34 + - ScannerThreadsTotalWallClockTime: 189.774ms + - DelimiterParseTime: 15.703ms + - MaterializeTupleTime(*): 3.419ms + - ScannerThreadsSysTime: 1.999ms + - ScannerThreadsUserTime: 44.993ms + - ScannerThreadsVoluntaryContextSwitches: 118 + - TotalRawHdfsReadTime(*): 7.199ms + - TotalReadThroughput: 0.00 /sec</code></pre> + </div> + </article> +</article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_faq.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_faq.html b/docs/build/html/topics/impala_faq.html new file mode 100644 index 0000000..b85bb8a --- /dev/null +++ b/docs/build/html/topics/impala_faq.html @@ -0,0 +1,21 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="faq"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Impala Frequently Asked Questions</title></head><body id="faq"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Impala Frequently Asked Questions</h1> + + + <div class="body conbody"> + + <p class="p"> + This section lists frequently asked questions for Apache Impala (incubating), + the interactive SQL engine for Hadoop. + </p> + + <p class="p"> + This section is under construction. + </p> + + </div> + +</article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_file_formats.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_file_formats.html b/docs/build/html/topics/impala_file_formats.html new file mode 100644 index 0000000..d9ccbca --- /dev/null +++ b/docs/build/html/topics/impala_file_formats.html @@ -0,0 +1,236 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_txtfile.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_parquet.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_avro.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_rcfile.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_seqfile.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="file_formats"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>How Impala Works with Hado op File Formats</title></head><body id="file_formats"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">How Impala Works with Hadoop File Formats</h1> + + + + <div class="body conbody"> + + <p class="p"> + + + Impala supports several familiar file formats used in Apache Hadoop. Impala can load and query data files + produced by other Hadoop components such as Pig or MapReduce, and data files produced by Impala can be used + by other components also. The following sections discuss the procedures, limitations, and performance + considerations for using each file format with Impala. + </p> + + <p class="p"> + The file format used for an Impala table has significant performance consequences. Some file formats include + compression support that affects the size of data on the disk and, consequently, the amount of I/O and CPU + resources required to deserialize data. The amounts of I/O and CPU resources required can be a limiting + factor in query performance since querying often begins with moving and decompressing data. To reduce the + potential impact of this part of the process, data is often compressed. By compressing data, a smaller total + number of bytes are transferred from disk to memory. This reduces the amount of time taken to transfer the + data, but a tradeoff occurs when the CPU decompresses the content. + </p> + + <p class="p"> + Impala can query files encoded with most of the popular file formats and compression codecs used in Hadoop. + Impala can create and insert data into tables that use some file formats but not others; for file formats + that Impala cannot write to, create the table in Hive, issue the <code class="ph codeph">INVALIDATE METADATA <var class="keyword varname">table_name</var></code> + statement in <code class="ph codeph">impala-shell</code>, and query the table through Impala. File formats can be + structured, in which case they may include metadata and built-in compression. Supported formats include: + </p> + + <table class="table"><caption><span class="table--title-label">Table 1. </span><span class="title">File Format Support in Impala</span></caption><colgroup><col style="width:10%"><col style="width:10%"><col style="width:20%"><col style="width:30%"><col style="width:30%"></colgroup><thead class="thead"> + <tr class="row"> + <th class="entry nocellnorowborder" id="file_formats__entry__1"> + File Type + </th> + <th class="entry nocellnorowborder" id="file_formats__entry__2"> + Format + </th> + <th class="entry nocellnorowborder" id="file_formats__entry__3"> + Compression Codecs + </th> + <th class="entry nocellnorowborder" id="file_formats__entry__4"> + Impala Can CREATE? + </th> + <th class="entry nocellnorowborder" id="file_formats__entry__5"> + Impala Can INSERT? + </th> + </tr> + </thead><tbody class="tbody"> + <tr class="row" id="file_formats__parquet_support"> + <td class="entry nocellnorowborder" headers="file_formats__entry__1 "> + <a class="xref" href="impala_parquet.html#parquet">Parquet</a> + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__2 "> + Structured + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__3 "> + Snappy, gzip; currently Snappy by default + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__4 "> + Yes. + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__5 "> + Yes: <code class="ph codeph">CREATE TABLE</code>, <code class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, and query. + </td> + </tr> + <tr class="row" id="file_formats__txtfile_support"> + <td class="entry nocellnorowborder" headers="file_formats__entry__1 "> + <a class="xref" href="impala_txtfile.html#txtfile">Text</a> + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__2 "> + Unstructured + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__3 "> + LZO, gzip, bzip2, Snappy + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__4 "> + Yes. For <code class="ph codeph">CREATE TABLE</code> with no <code class="ph codeph">STORED AS</code> clause, the default file + format is uncompressed text, with values separated by ASCII <code class="ph codeph">0x01</code> characters + (typically represented as Ctrl-A). + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__5 "> + Yes: <code class="ph codeph">CREATE TABLE</code>, <code class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, and query. + If LZO compression is used, you must create the table and load data in Hive. If other kinds of + compression are used, you must load data through <code class="ph codeph">LOAD DATA</code>, Hive, or manually in + HDFS. + + + </td> + </tr> + <tr class="row" id="file_formats__avro_support"> + <td class="entry nocellnorowborder" headers="file_formats__entry__1 "> + <a class="xref" href="impala_avro.html#avro">Avro</a> + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__2 "> + Structured + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__3 "> + Snappy, gzip, deflate, bzip2 + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__4 "> + Yes, in Impala 1.4.0 and higher. Before that, create the table using Hive. + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__5 "> + No. Import data by using <code class="ph codeph">LOAD DATA</code> on data files already in the right format, or use + <code class="ph codeph">INSERT</code> in Hive followed by <code class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code> in Impala. + </td> + + </tr> + <tr class="row" id="file_formats__rcfile_support"> + <td class="entry nocellnorowborder" headers="file_formats__entry__1 "> + <a class="xref" href="impala_rcfile.html#rcfile">RCFile</a> + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__2 "> + Structured + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__3 "> + Snappy, gzip, deflate, bzip2 + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__4 "> + Yes. + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__5 "> + No. Import data by using <code class="ph codeph">LOAD DATA</code> on data files already in the right format, or use + <code class="ph codeph">INSERT</code> in Hive followed by <code class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code> in Impala. + </td> + + </tr> + <tr class="row" id="file_formats__sequencefile_support"> + <td class="entry nocellnorowborder" headers="file_formats__entry__1 "> + <a class="xref" href="impala_seqfile.html#seqfile">SequenceFile</a> + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__2 "> + Structured + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__3 "> + Snappy, gzip, deflate, bzip2 + </td> + <td class="entry nocellnorowborder" headers="file_formats__entry__4 ">Yes.</td> + <td class="entry nocellnorowborder" headers="file_formats__entry__5 "> + No. Import data by using <code class="ph codeph">LOAD DATA</code> on data files already in the right format, or use + <code class="ph codeph">INSERT</code> in Hive followed by <code class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code> in Impala. + </td> + + </tr> + </tbody></table> + + <p class="p"> + Impala can only query the file formats listed in the preceding table. + In particular, Impala does not support the ORC file format. + </p> + + <p class="p"> + Impala supports the following compression codecs: + </p> + + <ul class="ul"> + <li class="li"> + Snappy. Recommended for its effective balance between compression ratio and decompression speed. Snappy + compression is very fast, but gzip provides greater space savings. Supported for text files in Impala 2.0 + and higher. + + </li> + + <li class="li"> + Gzip. Recommended when achieving the highest level of compression (and therefore greatest disk-space + savings) is desired. Supported for text files in Impala 2.0 and higher. + </li> + + <li class="li"> + Deflate. Not supported for text files. + </li> + + <li class="li"> + Bzip2. Supported for text files in Impala 2.0 and higher. + + </li> + + <li class="li"> + <p class="p"> LZO, for text files only. Impala can query + LZO-compressed text tables, but currently cannot create them or insert + data into them; perform these operations in Hive. </p> + </li> + </ul> + </div> + + <nav role="navigation" class="related-links"><ul class="ullinks"><li class="link ulchildlink"><strong><a href="../topics/impala_txtfile.html">Using Text Data Files with Impala Tables</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_parquet.html">Using the Parquet File Format with Impala Tables</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_avro.html">Using the Avro File Format with Impala Tables</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_rcfile.html">Using the RCFile File Format with Impala Tables</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_seqfile.html">Using the SequenceFile File Format with Impala Tables</a></strong><br></li></ul></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="file_formats__file_format_choosing"> + + <h2 class="title topictitle2" id="ariaid-title2">Choosing the File Format for a Table</h2> + + + <div class="body conbody"> + + <p class="p"> + Different file formats and compression codecs work better for different data sets. While Impala typically + provides performance gains regardless of file format, choosing the proper format for your data can yield + further performance improvements. Use the following considerations to decide which combination of file + format and compression to use for a particular table: + </p> + + <ul class="ul"> + <li class="li"> + If you are working with existing files that are already in a supported file format, use the same format + for the Impala table where practical. If the original format does not yield acceptable query performance + or resource usage, consider creating a new Impala table with different file format or compression + characteristics, and doing a one-time conversion by copying the data to the new table using the + <code class="ph codeph">INSERT</code> statement. Depending on the file format, you might run the + <code class="ph codeph">INSERT</code> statement in <code class="ph codeph">impala-shell</code> or in Hive. + </li> + + <li class="li"> + Text files are convenient to produce through many different tools, and are human-readable for ease of + verification and debugging. Those characteristics are why text is the default format for an Impala + <code class="ph codeph">CREATE TABLE</code> statement. When performance and resource usage are the primary + considerations, use one of the other file formats and consider using compression. A typical workflow + might involve bringing data into an Impala table by copying CSV or TSV files into the appropriate data + directory, and then using the <code class="ph codeph">INSERT ... SELECT</code> syntax to copy the data into a table + using a different, more compact file format. + </li> + + <li class="li"> + If your architecture involves storing data to be queried in memory, do not compress the data. There is no + I/O savings since the data does not need to be moved from disk, but there is a CPU cost to decompress the + data. + </li> + </ul> + </div> + </article> +</article></main></body></html> \ No newline at end of file
