http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_partitioning.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_partitioning.html b/docs/build/html/topics/impala_partitioning.html new file mode 100644 index 0000000..b361083 --- /dev/null +++ b/docs/build/html/topics/impala_partitioning.html @@ -0,0 +1,653 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="partitioning"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Partitioning for Impala Tables</title></head><body id="partitioning"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Partitioning for Impala Tables</h1> + + + + + + <div class="body conbody"> + + <p class="p"> + + By default, all the data files for a table are located in a single directory. Partitioning is a technique for physically dividing the + data during loading, based on values from one or more columns, to speed up queries that test those columns. For example, with a + <code class="ph codeph">school_records</code> table partitioned on a <code class="ph codeph">year</code> column, there is a separate data directory for each + different year value, and all the data for that year is stored in a data file in that directory. A query that includes a + <code class="ph codeph">WHERE</code> condition such as <code class="ph codeph">YEAR=1966</code>, <code class="ph codeph">YEAR IN (1989,1999)</code>, or <code class="ph codeph">YEAR BETWEEN + 1984 AND 1989</code> can examine only the data files from the appropriate directory or directories, greatly reducing the amount of + data to read and test. + </p> + + <p class="p toc inpage"></p> + + <p class="p"> + See <a class="xref" href="impala_tutorial.html#tut_external_partition_data">Attaching an External Partitioned Table to an HDFS Directory Structure</a> for an example that illustrates the syntax for creating partitioned + tables, the underlying directory structure in HDFS, and how to attach a partitioned Impala external table to data files stored + elsewhere in HDFS. + </p> + + <p class="p"> + Parquet is a popular format for partitioned Impala tables because it is well suited to handle huge data volumes. See + <a class="xref" href="impala_parquet.html#parquet_performance">Query Performance for Impala Parquet Tables</a> for performance considerations for partitioned Parquet tables. + </p> + + <p class="p"> + See <a class="xref" href="impala_literals.html#null">NULL</a> for details about how <code class="ph codeph">NULL</code> values are represented in partitioned tables. + </p> + + <p class="p"> + See <a class="xref" href="impala_s3.html#s3">Using Impala with the Amazon S3 Filesystem</a> for details about setting up tables where some or all partitions reside on the Amazon Simple + Storage Service (S3). + </p> + + </div> + + <article class="topic concept nested1" aria-labelledby="ariaid-title2" id="partitioning__partitioning_choosing"> + + <h2 class="title topictitle2" id="ariaid-title2">When to Use Partitioned Tables</h2> + + <div class="body conbody"> + + <p class="p"> + Partitioning is typically appropriate for: + </p> + + <ul class="ul"> + <li class="li"> + Tables that are very large, where reading the entire data set takes an impractical amount of time. + </li> + + <li class="li"> + Tables that are always or almost always queried with conditions on the partitioning columns. In our example of a table partitioned + by year, <code class="ph codeph">SELECT COUNT(*) FROM school_records WHERE year = 1985</code> is efficient, only examining a small fraction of + the data; but <code class="ph codeph">SELECT COUNT(*) FROM school_records</code> has to process a separate data file for each year, resulting in + more overall work than in an unpartitioned table. You would probably not partition this way if you frequently queried the table + based on last name, student ID, and so on without testing the year. + </li> + + <li class="li"> + Columns that have reasonable cardinality (number of different values). If a column only has a small number of values, for example + <code class="ph codeph">Male</code> or <code class="ph codeph">Female</code>, you do not gain much efficiency by eliminating only about 50% of the data to + read for each query. If a column has only a few rows matching each value, the number of directories to process can become a + limiting factor, and the data file in each directory could be too small to take advantage of the Hadoop mechanism for transmitting + data in multi-megabyte blocks. For example, you might partition census data by year, store sales data by year and month, and web + traffic data by year, month, and day. (Some users with high volumes of incoming data might even partition down to the individual + hour and minute.) + </li> + + <li class="li"> + Data that already passes through an extract, transform, and load (ETL) pipeline. The values of the partitioning columns are + stripped from the original data files and represented by directory names, so loading data into a partitioned table involves some + sort of transformation or preprocessing. + </li> + </ul> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="partitioning__partition_sql"> + + <h2 class="title topictitle2" id="ariaid-title3">SQL Statements for Partitioned Tables</h2> + + <div class="body conbody"> + + <p class="p"> + In terms of Impala SQL syntax, partitioning affects these statements: + </p> + + <ul class="ul"> + <li class="li"> + <code class="ph codeph"><a class="xref" href="impala_create_table.html#create_table">CREATE TABLE</a></code>: you specify a <code class="ph codeph">PARTITIONED + BY</code> clause when creating the table to identify names and data types of the partitioning columns. These columns are not + included in the main list of columns for the table. + </li> + + <li class="li"> + In <span class="keyword">Impala 2.5</span> and higher, you can also use the <code class="ph codeph">PARTITIONED BY</code> clause in a <code class="ph codeph">CREATE TABLE AS + SELECT</code> statement. This syntax lets you use a single statement to create a partitioned table, copy data into it, and + create new partitions based on the values in the inserted data. + </li> + + <li class="li"> + <code class="ph codeph"><a class="xref" href="impala_alter_table.html#alter_table">ALTER TABLE</a></code>: you can add or drop partitions, to work with + different portions of a huge data set. You can designate the HDFS directory that holds the data files for a specific partition. + With data partitioned by date values, you might <span class="q">"age out"</span> data that is no longer relevant. + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + If you are creating a partition for the first time and specifying its location, for maximum efficiency, use + a single <code class="ph codeph">ALTER TABLE</code> statement including both the <code class="ph codeph">ADD PARTITION</code> and + <code class="ph codeph">LOCATION</code> clauses, rather than separate statements with <code class="ph codeph">ADD PARTITION</code> and + <code class="ph codeph">SET LOCATION</code> clauses. + </div> + </li> + + <li class="li"> + <code class="ph codeph"><a class="xref" href="impala_insert.html#insert">INSERT</a></code>: When you insert data into a partitioned table, you identify + the partitioning columns. One or more values from each inserted row are not stored in data files, but instead determine the + directory where that row value is stored. You can also specify which partition to load a set of data into, with <code class="ph codeph">INSERT + OVERWRITE</code> statements; you can replace the contents of a specific partition but you cannot append data to a specific + partition. + <p class="p"> + By default, if an <code class="ph codeph">INSERT</code> statement creates any new subdirectories underneath a partitioned + table, those subdirectories are assigned default HDFS permissions for the <code class="ph codeph">impala</code> user. To + make each subdirectory have the same permissions as its parent directory in HDFS, specify the + <code class="ph codeph">--insert_inherit_permissions</code> startup option for the <span class="keyword cmdname">impalad</span> daemon. + </p> + </li> + + <li class="li"> + Although the syntax of the <code class="ph codeph"><a class="xref" href="impala_select.html#select">SELECT</a></code> statement is the same whether or + not the table is partitioned, the way queries interact with partitioned tables can have a dramatic impact on performance and + scalability. The mechanism that lets queries skip certain partitions during a query is known as partition pruning; see + <a class="xref" href="impala_partitioning.html#partition_pruning">Partition Pruning for Queries</a> for details. + </li> + + <li class="li"> + In Impala 1.4 and later, there is a <code class="ph codeph">SHOW PARTITIONS</code> statement that displays information about each partition in a + table. See <a class="xref" href="impala_show.html#show">SHOW Statement</a> for details. + </li> + </ul> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="partitioning__partition_static_dynamic"> + + <h2 class="title topictitle2" id="ariaid-title4">Static and Dynamic Partitioning Clauses</h2> + + <div class="body conbody"> + + <p class="p"> + Specifying all the partition columns in a SQL statement is called <dfn class="term">static partitioning</dfn>, because the statement affects a + single predictable partition. For example, you use static partitioning with an <code class="ph codeph">ALTER TABLE</code> statement that affects + only one partition, or with an <code class="ph codeph">INSERT</code> statement that inserts all values into the same partition: + </p> + +<pre class="pre codeblock"><code>insert into t1 <strong class="ph b">partition(x=10, y='a')</strong> select c1 from some_other_table; +</code></pre> + + <p class="p"> + When you specify some partition key columns in an <code class="ph codeph">INSERT</code> statement, but leave out the values, Impala determines + which partition to insert. This technique is called <dfn class="term">dynamic partitioning</dfn>: + </p> + +<pre class="pre codeblock"><code>insert into t1 <strong class="ph b">partition(x, y='b')</strong> select c1, c2 from some_other_table; +-- Create new partition if necessary based on variable year, month, and day; insert a single value. +insert into weather <strong class="ph b">partition (year, month, day)</strong> select 'cloudy',2014,4,21; +-- Create new partition if necessary for specified year and month but variable day; insert a single value. +insert into weather <strong class="ph b">partition (year=2014, month=04, day)</strong> select 'sunny',22; +</code></pre> + + <p class="p"> + The more key columns you specify in the <code class="ph codeph">PARTITION</code> clause, the fewer columns you need in the <code class="ph codeph">SELECT</code> + list. The trailing columns in the <code class="ph codeph">SELECT</code> list are substituted in order for the partition key columns with no + specified value. + </p> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title5" id="partitioning__partition_refresh"> + + <h2 class="title topictitle2" id="ariaid-title5">Refreshing a Single Partition</h2> + + <div class="body conbody"> + + <p class="p"> + The <code class="ph codeph">REFRESH</code> statement is typically used with partitioned tables when new data files are loaded into a partition by + some non-Impala mechanism, such as a Hive or Spark job. The <code class="ph codeph">REFRESH</code> statement makes Impala aware of the new data + files so that they can be used in Impala queries. Because partitioned tables typically contain a high volume of data, the + <code class="ph codeph">REFRESH</code> operation for a full partitioned table can take significant time. + </p> + + <p class="p"> + In <span class="keyword">Impala 2.7</span> and higher, you can include a <code class="ph codeph">PARTITION (<var class="keyword varname">partition_spec</var>)</code> clause in the + <code class="ph codeph">REFRESH</code> statement so that only a single partition is refreshed. For example, <code class="ph codeph">REFRESH big_table PARTITION + (year=2017, month=9, day=30)</code>. The partition spec must include all the partition key columns. See + <a class="xref" href="impala_refresh.html#refresh">REFRESH Statement</a> for more details and examples of <code class="ph codeph">REFRESH</code> syntax and usage. + </p> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="partitioning__partition_permissions"> + + <h2 class="title topictitle2" id="ariaid-title6">Permissions for Partition Subdirectories</h2> + + <div class="body conbody"> + + <p class="p"> + By default, if an <code class="ph codeph">INSERT</code> statement creates any new subdirectories underneath a partitioned + table, those subdirectories are assigned default HDFS permissions for the <code class="ph codeph">impala</code> user. To + make each subdirectory have the same permissions as its parent directory in HDFS, specify the + <code class="ph codeph">--insert_inherit_permissions</code> startup option for the <span class="keyword cmdname">impalad</span> daemon. + </p> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title7" id="partitioning__partition_pruning"> + + <h2 class="title topictitle2" id="ariaid-title7">Partition Pruning for Queries</h2> + + <div class="body conbody"> + + <p class="p"> + Partition pruning refers to the mechanism where a query can skip reading the data files corresponding to one or more partitions. If + you can arrange for queries to prune large numbers of unnecessary partitions from the query execution plan, the queries use fewer + resources and are thus proportionally faster and more scalable. + </p> + + <p class="p"> + For example, if a table is partitioned by columns <code class="ph codeph">YEAR</code>, <code class="ph codeph">MONTH</code>, and <code class="ph codeph">DAY</code>, then + <code class="ph codeph">WHERE</code> clauses such as <code class="ph codeph">WHERE year = 2013</code>, <code class="ph codeph">WHERE year < 2010</code>, or <code class="ph codeph">WHERE + year BETWEEN 1995 AND 1998</code> allow Impala to skip the data files in all partitions outside the specified range. Likewise, + <code class="ph codeph">WHERE year = 2013 AND month BETWEEN 1 AND 3</code> could prune even more partitions, reading the data files for only a + portion of one year. + </p> + + <p class="p toc inpage"></p> + + </div> + + <article class="topic concept nested2" aria-labelledby="ariaid-title8" id="partition_pruning__partition_pruning_checking"> + + <h3 class="title topictitle3" id="ariaid-title8">Checking if Partition Pruning Happens for a Query</h3> + + <div class="body conbody"> + + <p class="p"> + To check the effectiveness of partition pruning for a query, check the <code class="ph codeph">EXPLAIN</code> output for the query before + running it. For example, this example shows a table with 3 partitions, where the query only reads 1 of them. The notation + <code class="ph codeph">#partitions=1/3</code> in the <code class="ph codeph">EXPLAIN</code> plan confirms that Impala can do the appropriate partition + pruning. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > insert into census partition (year=2010) values ('Smith'),('Jones'); +[localhost:21000] > insert into census partition (year=2011) values ('Smith'),('Jones'),('Doe'); +[localhost:21000] > insert into census partition (year=2012) values ('Smith'),('Doe'); +[localhost:21000] > select name from census where year=2010; ++-------+ +| name | ++-------+ +| Smith | +| Jones | ++-------+ +[localhost:21000] > explain select name from census <strong class="ph b">where year=2010</strong>; ++------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------+ +| PLAN FRAGMENT 0 | +| PARTITION: UNPARTITIONED | +| | +| 1:EXCHANGE | +| | +| PLAN FRAGMENT 1 | +| PARTITION: RANDOM | +| | +| STREAM DATA SINK | +| EXCHANGE ID: 1 | +| UNPARTITIONED | +| | +| 0:SCAN HDFS | +| table=predicate_propagation.census <strong class="ph b">#partitions=1/3</strong> size=12B | ++------------------------------------------------------------------+</code></pre> + + <p class="p"> + For a report of the volume of data that was actually read and processed at each stage of the query, check the output of the + <code class="ph codeph">SUMMARY</code> command immediately after running the query. For a more detailed analysis, look at the output of the + <code class="ph codeph">PROFILE</code> command; it includes this same summary report near the start of the profile output. + </p> + + </div> + + </article> + + <article class="topic concept nested2" aria-labelledby="ariaid-title9" id="partition_pruning__partition_pruning_sql"> + + <h3 class="title topictitle3" id="ariaid-title9">What SQL Constructs Work with Partition Pruning</h3> + + <div class="body conbody"> + + <p class="p"> + + Impala can even do partition pruning in cases where the partition key column is not directly compared to a constant, by applying + the transitive property to other parts of the <code class="ph codeph">WHERE</code> clause. This technique is known as predicate propagation, and + is available in Impala 1.2.2 and later. In this example, the census table includes another column indicating when the data was + collected, which happens in 10-year intervals. Even though the query does not compare the partition key column + (<code class="ph codeph">YEAR</code>) to a constant value, Impala can deduce that only the partition <code class="ph codeph">YEAR=2010</code> is required, and + again only reads 1 out of 3 partitions. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > drop table census; +[localhost:21000] > create table census (name string, census_year int) partitioned by (year int); +[localhost:21000] > insert into census partition (year=2010) values ('Smith',2010),('Jones',2010); +[localhost:21000] > insert into census partition (year=2011) values ('Smith',2020),('Jones',2020),('Doe',2020); +[localhost:21000] > insert into census partition (year=2012) values ('Smith',2020),('Doe',2020); +[localhost:21000] > select name from census where year = census_year and census_year=2010; ++-------+ +| name | ++-------+ +| Smith | +| Jones | ++-------+ +[localhost:21000] > explain select name from census <strong class="ph b">where year = census_year and census_year=2010</strong>; ++------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------+ +| PLAN FRAGMENT 0 | +| PARTITION: UNPARTITIONED | +| | +| 1:EXCHANGE | +| | +| PLAN FRAGMENT 1 | +| PARTITION: RANDOM | +| | +| STREAM DATA SINK | +| EXCHANGE ID: 1 | +| UNPARTITIONED | +| | +| 0:SCAN HDFS | +| table=predicate_propagation.census <strong class="ph b">#partitions=1/3</strong> size=22B | +| predicates: census_year = 2010, year = census_year | ++------------------------------------------------------------------+ +</code></pre> + + <p class="p"> + If a view applies to a partitioned table, any partition pruning considers the clauses on both + the original query and any additional <code class="ph codeph">WHERE</code> predicates in the query that refers to the view. + Prior to Impala 1.4, only the <code class="ph codeph">WHERE</code> clauses on the original query from the + <code class="ph codeph">CREATE VIEW</code> statement were used for partition pruning. + </p> + + <p class="p"> + In queries involving both analytic functions and partitioned tables, partition pruning only occurs for columns named in the <code class="ph codeph">PARTITION BY</code> + clause of the analytic function call. For example, if an analytic function query has a clause such as <code class="ph codeph">WHERE year=2016</code>, + the way to make the query prune all other <code class="ph codeph">YEAR</code> partitions is to include <code class="ph codeph">PARTITION BY year</code>in the analytic function call; + for example, <code class="ph codeph">OVER (PARTITION BY year,<var class="keyword varname">other_columns</var> <var class="keyword varname">other_analytic_clauses</var>)</code>. + + </p> + + </div> + + </article> + + <article class="topic concept nested2" aria-labelledby="ariaid-title10" id="partition_pruning__dynamic_partition_pruning"> + + <h3 class="title topictitle3" id="ariaid-title10">Dynamic Partition Pruning</h3> + + <div class="body conbody"> + + <p class="p"> + The original mechanism uses to prune partitions is <dfn class="term">static partition pruning</dfn>, in which the conditions in the + <code class="ph codeph">WHERE</code> clause are analyzed to determine in advance which partitions can be safely skipped. In <span class="keyword">Impala 2.5</span> + and higher, Impala can perform <dfn class="term">dynamic partition pruning</dfn>, where information about the partitions is collected during + the query, and Impala prunes unnecessary partitions in ways that were impractical to predict in advance. + </p> + + <p class="p"> + For example, if partition key columns are compared to literal values in a <code class="ph codeph">WHERE</code> clause, Impala can perform static + partition pruning during the planning phase to only read the relevant partitions: + </p> + +<pre class="pre codeblock"><code> +-- The query only needs to read 3 partitions whose key values are known ahead of time. +-- That's static partition pruning. +SELECT COUNT(*) FROM sales_table WHERE year IN (2005, 2010, 2015); +</code></pre> + + <p class="p"> + Dynamic partition pruning involves using information only available at run time, such as the result of a subquery: + </p> + +<pre class="pre codeblock"><code> +create table yy (s string) partitioned by (year int) stored as parquet; +insert into yy partition (year) values ('1999', 1999), ('2000', 2000), + ('2001', 2001), ('2010',2010); +compute stats yy; + +create table yy2 (s string) partitioned by (year int) stored as parquet; +insert into yy2 partition (year) values ('1999', 1999), ('2000', 2000), + ('2001', 2001); +compute stats yy2; + +-- The query reads an unknown number of partitions, whose key values are only +-- known at run time. The 'runtime filters' lines show how the information about +-- the partitions is calculated in query fragment 02, and then used in query +-- fragment 00 to decide which partitions to skip. +explain select s from yy2 where year in (select year from yy where year between 2000 and 2005); ++----------------------------------------------------------+ +| Explain String | ++----------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=16.00MB VCores=2 | +| | +| 04:EXCHANGE [UNPARTITIONED] | +| | | +| 02:HASH JOIN [LEFT SEMI JOIN, BROADCAST] | +| | hash predicates: year = year | +| | <strong class="ph b">runtime filters: RF000 <- year</strong> | +| | | +| |--03:EXCHANGE [BROADCAST] | +| | | | +| | 01:SCAN HDFS [dpp.yy] | +| | partitions=2/4 files=2 size=468B | +| | | +| 00:SCAN HDFS [dpp.yy2] | +| partitions=2/3 files=2 size=468B | +| <strong class="ph b">runtime filters: RF000 -> year</strong> | ++----------------------------------------------------------+ +</code></pre> + + + + <p class="p"> + In this case, Impala evaluates the subquery, sends the subquery results to all Impala nodes participating in the query, and then + each <span class="keyword cmdname">impalad</span> daemon uses the dynamic partition pruning optimization to read only the partitions with the + relevant key values. + </p> + + <p class="p"> + Dynamic partition pruning is especially effective for queries involving joins of several large partitioned tables. Evaluating the + <code class="ph codeph">ON</code> clauses of the join predicates might normally require reading data from all partitions of certain tables. If + the <code class="ph codeph">WHERE</code> clauses of the query refer to the partition key columns, Impala can now often skip reading many of the + partitions while evaluating the <code class="ph codeph">ON</code> clauses. The dynamic partition pruning optimization reduces the amount of I/O + and the amount of intermediate data stored and transmitted across the network during the query. + </p> + + <p class="p"> + When the spill-to-disk feature is activated for a join node within a query, Impala does not + produce any runtime filters for that join operation on that host. Other join nodes within + the query are not affected. + </p> + + <p class="p"> + Dynamic partition pruning is part of the runtime filtering feature, which applies to other kinds of queries in addition to queries + against partitioned tables. See <a class="xref" href="impala_runtime_filtering.html#runtime_filtering">Runtime Filtering for Impala Queries (Impala 2.5 or higher only)</a> for full details about this feature. + </p> + + </div> + + </article> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title11" id="partitioning__partition_key_columns"> + + <h2 class="title topictitle2" id="ariaid-title11">Partition Key Columns</h2> + + <div class="body conbody"> + + <p class="p"> + The columns you choose as the partition keys should be ones that are frequently used to filter query results in important, + large-scale queries. Popular examples are some combination of year, month, and day when the data has associated time values, and + geographic region when the data is associated with some place. + </p> + + <ul class="ul"> + <li class="li"> + <p class="p"> + For time-based data, split out the separate parts into their own columns, because Impala cannot partition based on a + <code class="ph codeph">TIMESTAMP</code> column. + </p> + </li> + + <li class="li"> + <p class="p"> + The data type of the partition columns does not have a significant effect on the storage required, because the values from those + columns are not stored in the data files, rather they are represented as strings inside HDFS directory names. + </p> + </li> + + <li class="li"> + <p class="p"> + In <span class="keyword">Impala 2.5</span> and higher, you can enable the <code class="ph codeph">OPTIMIZE_PARTITION_KEY_SCANS</code> query option to speed up + queries that only refer to partition key columns, such as <code class="ph codeph">SELECT MAX(year)</code>. This setting is not enabled by + default because the query behavior is slightly different if the table contains partition directories without actual data inside. + See <a class="xref" href="impala_optimize_partition_key_scans.html#optimize_partition_key_scans">OPTIMIZE_PARTITION_KEY_SCANS Query Option (Impala 2.5 or higher only)</a> for details. + </p> + </li> + + <li class="li"> + <p class="p"> + Partitioned tables can contain complex type columns. + All the partition key columns must be scalar types. + </p> + </li> + + <li class="li"> + <p class="p"> + Remember that when Impala queries data stored in HDFS, it is most efficient to use multi-megabyte files to take advantage of the + HDFS block size. For Parquet tables, the block size (and ideal size of the data files) is <span class="ph">256 MB in + Impala 2.0 and later</span>. Therefore, avoid specifying too many partition key columns, which could result in individual + partitions containing only small amounts of data. For example, if you receive 1 GB of data per day, you might partition by year, + month, and day; while if you receive 5 GB of data per minute, you might partition by year, month, day, hour, and minute. If you + have data with a geographic component, you might partition based on postal code if you have many megabytes of data for each + postal code, but if not, you might partition by some larger region such as city, state, or country. state + </p> + </li> + </ul> + + <p class="p"> + If you frequently run aggregate functions such as <code class="ph codeph">MIN()</code>, <code class="ph codeph">MAX()</code>, and + <code class="ph codeph">COUNT(DISTINCT)</code> on partition key columns, consider enabling the <code class="ph codeph">OPTIMIZE_PARTITION_KEY_SCANS</code> + query option, which optimizes such queries. This feature is available in <span class="keyword">Impala 2.5</span> and higher. + See <a class="xref" href="../shared/../topics/impala_optimize_partition_key_scans.html">OPTIMIZE_PARTITION_KEY_SCANS Query Option (Impala 2.5 or higher only)</a> + for the kinds of queries that this option applies to, and slight differences in how partitions are + evaluated when this query option is enabled. + </p> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title12" id="partitioning__mixed_format_partitions"> + + <h2 class="title topictitle2" id="ariaid-title12">Setting Different File Formats for Partitions</h2> + + <div class="body conbody"> + + <p class="p"> + Partitioned tables have the flexibility to use different file formats for different partitions. (For background information about + the different file formats Impala supports, see <a class="xref" href="impala_file_formats.html#file_formats">How Impala Works with Hadoop File Formats</a>.) For example, if you originally + received data in text format, then received new data in RCFile format, and eventually began receiving data in Parquet format, all + that data could reside in the same table for queries. You just need to ensure that the table is structured so that the data files + that use different file formats reside in separate partitions. + </p> + + <p class="p"> + For example, here is how you might switch from text to Parquet data as you receive data for different years: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > create table census (name string) partitioned by (year smallint); +[localhost:21000] > alter table census add partition (year=2012); -- Text format; + +[localhost:21000] > alter table census add partition (year=2013); -- Text format switches to Parquet before data loaded; +[localhost:21000] > alter table census partition (year=2013) set fileformat parquet; + +[localhost:21000] > insert into census partition (year=2012) values ('Smith'),('Jones'),('Lee'),('Singh'); +[localhost:21000] > insert into census partition (year=2013) values ('Flores'),('Bogomolov'),('Cooper'),('Appiah');</code></pre> + + <p class="p"> + At this point, the HDFS directory for <code class="ph codeph">year=2012</code> contains a text-format data file, while the HDFS directory for + <code class="ph codeph">year=2013</code> contains a Parquet data file. As always, when loading non-trivial data, you would use <code class="ph codeph">INSERT ... + SELECT</code> or <code class="ph codeph">LOAD DATA</code> to import data in large batches, rather than <code class="ph codeph">INSERT ... VALUES</code> which + produces small files that are inefficient for real-world queries. + </p> + + <p class="p"> + For other file types that Impala cannot create natively, you can switch into Hive and issue the <code class="ph codeph">ALTER TABLE ... SET + FILEFORMAT</code> statements and <code class="ph codeph">INSERT</code> or <code class="ph codeph">LOAD DATA</code> statements there. After switching back to + Impala, issue a <code class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code> statement so that Impala recognizes any partitions or new + data added through Hive. + </p> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title13" id="partitioning__partition_management"> + + <h2 class="title topictitle2" id="ariaid-title13">Managing Partitions</h2> + + <div class="body conbody"> + + <p class="p"> + You can add, drop, set the expected file format, or set the HDFS location of the data files for individual partitions within an + Impala table. See <a class="xref" href="impala_alter_table.html#alter_table">ALTER TABLE Statement</a> for syntax details, and + <a class="xref" href="impala_partitioning.html#mixed_format_partitions">Setting Different File Formats for Partitions</a> for tips on managing tables containing partitions with different file + formats. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + If you are creating a partition for the first time and specifying its location, for maximum efficiency, use + a single <code class="ph codeph">ALTER TABLE</code> statement including both the <code class="ph codeph">ADD PARTITION</code> and + <code class="ph codeph">LOCATION</code> clauses, rather than separate statements with <code class="ph codeph">ADD PARTITION</code> and + <code class="ph codeph">SET LOCATION</code> clauses. + </div> + + <p class="p"> + What happens to the data files when a partition is dropped depends on whether the partitioned table is designated as internal or + external. For an internal (managed) table, the data files are deleted. For example, if data in the partitioned table is a copy of + raw data files stored elsewhere, you might save disk space by dropping older partitions that are no longer required for reporting, + knowing that the original data is still available if needed later. For an external table, the data files are left alone. For + example, dropping a partition without deleting the associated files lets Impala consider a smaller set of partitions, improving + query efficiency and reducing overhead for DDL operations on the table; if the data is needed again later, you can add the partition + again. See <a class="xref" href="impala_tables.html#tables">Overview of Impala Tables</a> for details and examples. + </p> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title14" id="partitioning__partition_kudu"> + + <h2 class="title topictitle2" id="ariaid-title14">Using Partitioning with Kudu Tables</h2> + + + + <div class="body conbody"> + + <p class="p"> + Kudu tables use a more fine-grained partitioning scheme than tables containing HDFS data files. You specify a <code class="ph codeph">PARTITION + BY</code> clause with the <code class="ph codeph">CREATE TABLE</code> statement to identify how to divide the values from the partition key + columns. + </p> + + <p class="p"> + See <a class="xref" href="impala_kudu.html#kudu_partitioning">Partitioning for Kudu Tables</a> for + details and examples of the partitioning techniques + for Kudu tables. + </p> + + </div> + + </article> + +</article></main></body></html> \ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_perf_benchmarking.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_perf_benchmarking.html b/docs/build/html/topics/impala_perf_benchmarking.html new file mode 100644 index 0000000..ce9d995 --- /dev/null +++ b/docs/build/html/topics/impala_perf_benchmarking.html @@ -0,0 +1,27 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_performance.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="perf_benchmarks"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Benchmarking Impala Queries</title></head><body id="perf_benchmarks"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Benchmarking Impala Queries</h1> + + + + <div class="body conbody"> + + <p class="p"> + Because Impala, like other Hadoop components, is designed to handle large data volumes in a distributed + environment, conduct any performance tests using realistic data and cluster configurations. Use a multi-node + cluster rather than a single node; run queries against tables containing terabytes of data rather than tens + of gigabytes. The parallel processing techniques used by Impala are most appropriate for workloads that are + beyond the capacity of a single server. + </p> + + <p class="p"> + When you run queries returning large numbers of rows, the CPU time to pretty-print the output can be + substantial, giving an inaccurate measurement of the actual query time. Consider using the + <code class="ph codeph">-B</code> option on the <code class="ph codeph">impala-shell</code> command to turn off the pretty-printing, and + optionally the <code class="ph codeph">-o</code> option to store query results in a file rather than printing to the + screen. See <a class="xref" href="impala_shell_options.html#shell_options">impala-shell Configuration Options</a> for details. + </p> + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_performance.html">Tuning Impala for Performance</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_perf_cookbook.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_perf_cookbook.html b/docs/build/html/topics/impala_perf_cookbook.html new file mode 100644 index 0000000..fc68b45 --- /dev/null +++ b/docs/build/html/topics/impala_perf_cookbook.html @@ -0,0 +1,256 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_performance.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="perf_cookbook"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Impala Performance Guidelines and Best Practices</title></head><body id="perf_cookbook"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Impala Performance Guidelines and Best Practices</h1> + + + + <div class="body conbody"> + + <p class="p"> + Here are performance guidelines and best practices that you can use during planning, experimentation, and + performance tuning for an Impala-enabled <span class="keyword"></span> cluster. All of this information is also available in more + detail elsewhere in the Impala documentation; it is gathered together here to serve as a cookbook and + emphasize which performance techniques typically provide the highest return on investment + </p> + + <p class="p toc inpage"></p> + + <section class="section" id="perf_cookbook__perf_cookbook_file_format"><h2 class="title sectiontitle">Choose the appropriate file format for the data.</h2> + + + + <p class="p"> + Typically, for large volumes of data (multiple gigabytes per table or partition), the Parquet file format + performs best because of its combination of columnar storage layout, large I/O request size, and + compression and encoding. See <a class="xref" href="impala_file_formats.html#file_formats">How Impala Works with Hadoop File Formats</a> for comparisons of all + file formats supported by Impala, and <a class="xref" href="impala_parquet.html#parquet">Using the Parquet File Format with Impala Tables</a> for details about the + Parquet file format. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + For smaller volumes of data, a few gigabytes or less for each table or partition, you might not see + significant performance differences between file formats. At small data volumes, reduced I/O from an + efficient compressed file format can be counterbalanced by reduced opportunity for parallel execution. When + planning for a production deployment or conducting benchmarks, always use realistic data volumes to get a + true picture of performance and scalability. + </div> + </section> + + <section class="section" id="perf_cookbook__perf_cookbook_small_files"><h2 class="title sectiontitle">Avoid data ingestion processes that produce many small files.</h2> + + + + <p class="p"> + When producing data files outside of Impala, prefer either text format or Avro, where you can build up the + files row by row. Once the data is in Impala, you can convert it to the more efficient Parquet format and + split into multiple data files using a single <code class="ph codeph">INSERT ... SELECT</code> statement. Or, if you have + the infrastructure to produce multi-megabyte Parquet files as part of your data preparation process, do + that and skip the conversion step inside Impala. + </p> + + <p class="p"> + Always use <code class="ph codeph">INSERT ... SELECT</code> to copy significant volumes of data from table to table + within Impala. Avoid <code class="ph codeph">INSERT ... VALUES</code> for any substantial volume of data or + performance-critical tables, because each such statement produces a separate tiny data file. See + <a class="xref" href="impala_insert.html#insert">INSERT Statement</a> for examples of the <code class="ph codeph">INSERT ... SELECT</code> syntax. + </p> + + <p class="p"> + For example, if you have thousands of partitions in a Parquet table, each with less than + <span class="ph">256 MB</span> of data, consider partitioning in a less granular way, such as by + year / month rather than year / month / day. If an inefficient data ingestion process produces thousands of + data files in the same table or partition, consider compacting the data by performing an <code class="ph codeph">INSERT ... + SELECT</code> to copy all the data to a different table; the data will be reorganized into a smaller + number of larger files by this process. + </p> + </section> + + <section class="section" id="perf_cookbook__perf_cookbook_partitioning"><h2 class="title sectiontitle">Choose partitioning granularity based on actual data volume.</h2> + + + + <p class="p"> + Partitioning is a technique that physically divides the data based on values of one or more columns, such + as by year, month, day, region, city, section of a web site, and so on. When you issue queries that request + a specific value or range of values for the partition key columns, Impala can avoid reading the irrelevant + data, potentially yielding a huge savings in disk I/O. + </p> + + <p class="p"> + When deciding which column(s) to use for partitioning, choose the right level of granularity. For example, + should you partition by year, month, and day, or only by year and month? Choose a partitioning strategy + that puts at least <span class="ph">256 MB</span> of data in each partition, to take advantage of + HDFS bulk I/O and Impala distributed queries. + </p> + + <p class="p"> + Over-partitioning can also cause query planning to take longer than necessary, as Impala prunes the + unnecessary partitions. Ideally, keep the number of partitions in the table under 30 thousand. + </p> + + <p class="p"> + When preparing data files to go in a partition directory, create several large files rather than many small + ones. If you receive data in the form of many small files and have no control over the input format, + consider using the <code class="ph codeph">INSERT ... SELECT</code> syntax to copy data from one table or partition to + another, which compacts the files into a relatively small number (based on the number of nodes in the + cluster). + </p> + + <p class="p"> + If you need to reduce the overall number of partitions and increase the amount of data in each partition, + first look for partition key columns that are rarely referenced or are referenced in non-critical queries + (not subject to an SLA). For example, your web site log data might be partitioned by year, month, day, and + hour, but if most queries roll up the results by day, perhaps you only need to partition by year, month, + and day. + </p> + + <p class="p"> + If you need to reduce the granularity even more, consider creating <span class="q">"buckets"</span>, computed values + corresponding to different sets of partition key values. For example, you can use the + <code class="ph codeph">TRUNC()</code> function with a <code class="ph codeph">TIMESTAMP</code> column to group date and time values + based on intervals such as week or quarter. See + <a class="xref" href="impala_datetime_functions.html#datetime_functions">Impala Date and Time Functions</a> for details. + </p> + + <p class="p"> + See <a class="xref" href="impala_partitioning.html#partitioning">Partitioning for Impala Tables</a> for full details and performance considerations for + partitioning. + </p> + </section> + + <section class="section" id="perf_cookbook__perf_cookbook_partition_keys"><h2 class="title sectiontitle">Use smallest appropriate integer types for partition key columns.</h2> + + + + <p class="p"> + Although it is tempting to use strings for partition key columns, since those values are turned into HDFS + directory names anyway, you can minimize memory usage by using numeric values for common partition key + fields such as <code class="ph codeph">YEAR</code>, <code class="ph codeph">MONTH</code>, and <code class="ph codeph">DAY</code>. Use the smallest + integer type that holds the appropriate range of values, typically <code class="ph codeph">TINYINT</code> for + <code class="ph codeph">MONTH</code> and <code class="ph codeph">DAY</code>, and <code class="ph codeph">SMALLINT</code> for <code class="ph codeph">YEAR</code>. + Use the <code class="ph codeph">EXTRACT()</code> function to pull out individual date and time fields from a + <code class="ph codeph">TIMESTAMP</code> value, and <code class="ph codeph">CAST()</code> the return value to the appropriate integer + type. + </p> + </section> + + <section class="section" id="perf_cookbook__perf_cookbook_parquet_block_size"><h2 class="title sectiontitle">Choose an appropriate Parquet block size.</h2> + + + + <p class="p"> + By default, the Impala <code class="ph codeph">INSERT ... SELECT</code> statement creates Parquet files with a 256 MB + block size. (This default was changed in Impala 2.0. Formerly, the limit was 1 GB, but Impala made + conservative estimates about compression, resulting in files that were smaller than 1 GB.) + </p> + + <p class="p"> + Each Parquet file written by Impala is a single block, allowing the whole file to be processed as a unit by a single host. + As you copy Parquet files into HDFS or between HDFS filesystems, use <code class="ph codeph">hdfs dfs -pb</code> to preserve the original + block size. + </p> + + <p class="p"> + If there is only one or a few data block in your Parquet table, or in a partition that is the only one + accessed by a query, then you might experience a slowdown for a different reason: not enough data to take + advantage of Impala's parallel distributed queries. Each data block is processed by a single core on one of + the DataNodes. In a 100-node cluster of 16-core machines, you could potentially process thousands of data + files simultaneously. You want to find a sweet spot between <span class="q">"many tiny files"</span> and <span class="q">"single giant + file"</span> that balances bulk I/O and parallel processing. You can set the <code class="ph codeph">PARQUET_FILE_SIZE</code> + query option before doing an <code class="ph codeph">INSERT ... SELECT</code> statement to reduce the size of each + generated Parquet file. <span class="ph">(Specify the file size as an absolute number of bytes, or in Impala + 2.0 and later, in units ending with <code class="ph codeph">m</code> for megabytes or <code class="ph codeph">g</code> for + gigabytes.)</span> Run benchmarks with different file sizes to find the right balance point for your + particular data volume. + </p> + </section> + + <section class="section" id="perf_cookbook__perf_cookbook_stats"><h2 class="title sectiontitle">Gather statistics for all tables used in performance-critical or high-volume join queries.</h2> + + + + <p class="p"> + Gather the statistics with the <code class="ph codeph">COMPUTE STATS</code> statement. See + <a class="xref" href="impala_perf_joins.html#perf_joins">Performance Considerations for Join Queries</a> for details. + </p> + </section> + + <section class="section" id="perf_cookbook__perf_cookbook_network"><h2 class="title sectiontitle">Minimize the overhead of transmitting results back to the client.</h2> + + + + <p class="p"> + Use techniques such as: + </p> + + <ul class="ul"> + <li class="li"> + Aggregation. If you need to know how many rows match a condition, the total values of matching values + from some column, the lowest or highest matching value, and so on, call aggregate functions such as + <code class="ph codeph">COUNT()</code>, <code class="ph codeph">SUM()</code>, and <code class="ph codeph">MAX()</code> in the query rather than + sending the result set to an application and doing those computations there. Remember that the size of an + unaggregated result set could be huge, requiring substantial time to transmit across the network. + </li> + + <li class="li"> + Filtering. Use all applicable tests in the <code class="ph codeph">WHERE</code> clause of a query to eliminate rows + that are not relevant, rather than producing a big result set and filtering it using application logic. + </li> + + <li class="li"> + <code class="ph codeph">LIMIT</code> clause. If you only need to see a few sample values from a result set, or the top + or bottom values from a query using <code class="ph codeph">ORDER BY</code>, include the <code class="ph codeph">LIMIT</code> clause + to reduce the size of the result set rather than asking for the full result set and then throwing most of + the rows away. + </li> + + <li class="li"> + Avoid overhead from pretty-printing the result set and displaying it on the screen. When you retrieve the + results through <span class="keyword cmdname">impala-shell</span>, use <span class="keyword cmdname">impala-shell</span> options such as + <code class="ph codeph">-B</code> and <code class="ph codeph">--output_delimiter</code> to produce results without special + formatting, and redirect output to a file rather than printing to the screen. Consider using + <code class="ph codeph">INSERT ... SELECT</code> to write the results directly to new files in HDFS. See + <a class="xref" href="impala_shell_options.html#shell_options">impala-shell Configuration Options</a> for details about the + <span class="keyword cmdname">impala-shell</span> command-line options. + </li> + </ul> + </section> + + <section class="section" id="perf_cookbook__perf_cookbook_explain"><h2 class="title sectiontitle">Verify that your queries are planned in an efficient logical manner.</h2> + + + + <p class="p"> + Examine the <code class="ph codeph">EXPLAIN</code> plan for a query before actually running it. See + <a class="xref" href="impala_explain.html#explain">EXPLAIN Statement</a> and <a class="xref" href="impala_explain_plan.html#perf_explain">Using the EXPLAIN Plan for Performance Tuning</a> for + details. + </p> + </section> + + <section class="section" id="perf_cookbook__perf_cookbook_profile"><h2 class="title sectiontitle">Verify performance characteristics of queries.</h2> + + + + <p class="p"> + Verify that the low-level aspects of I/O, memory usage, network bandwidth, CPU utilization, and so on are + within expected ranges by examining the query profile for a query after running it. See + <a class="xref" href="impala_explain_plan.html#perf_profile">Using the Query Profile for Performance Tuning</a> for details. + </p> + </section> + + <section class="section" id="perf_cookbook__perf_cookbook_os"><h2 class="title sectiontitle">Use appropriate operating system settings.</h2> + + + + <p class="p"> + See <span class="xref">the documentation for your Apache Hadoop distribution</span> for recommendations about operating system + settings that you can change to influence Impala performance. In particular, you might find + that changing the <code class="ph codeph">vm.swappiness</code> Linux kernel setting to a non-zero value improves + overall performance. + </p> + </section> + + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_performance.html">Tuning Impala for Performance</a></div></div></nav></article></main></body></html> \ No newline at end of file
