http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_perf_hdfs_caching.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_perf_hdfs_caching.html b/docs/build/html/topics/impala_perf_hdfs_caching.html new file mode 100644 index 0000000..9de003e --- /dev/null +++ b/docs/build/html/topics/impala_perf_hdfs_caching.html @@ -0,0 +1,578 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_performance.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="hdfs_caching"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Using HDFS Caching with Impala (Impala 2.1 or higher only)</title></head><body id="hdfs_caching"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Using HDFS Caching with Impala (<span class="keyword">Impala 2.1</span> or higher only)</h1> + + + + <div class="body conbody"> + + <p class="p"> + HDFS caching provides performance and scalability benefits in production environments where Impala queries + and other Hadoop jobs operate on quantities of data much larger than the physical RAM on the DataNodes, + making it impractical to rely on the Linux OS cache, which only keeps the most recently used data in memory. + Data read from the HDFS cache avoids the overhead of checksumming and memory-to-memory copying involved when + using data from the Linux OS cache. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + <p class="p"> + On a small or lightly loaded cluster, HDFS caching might not produce any speedup. It might even lead to + slower queries, if I/O read operations that were performed in parallel across the entire cluster are replaced by in-memory + operations operating on a smaller number of hosts. The hosts where the HDFS blocks are cached can become + bottlenecks because they experience high CPU load while processing the cached data blocks, while other hosts remain idle. + Therefore, always compare performance with and without this feature enabled, using a realistic workload. + </p> + <p class="p"> + In <span class="keyword">Impala 2.2</span> and higher, you can spread the CPU load more evenly by specifying the <code class="ph codeph">WITH REPLICATION</code> + clause of the <code class="ph codeph">CREATE TABLE</code> and <code class="ph codeph">ALTER TABLE</code> statements. + This clause lets you control the replication factor for + HDFS caching for a specific table or partition. By default, each cached block is + only present on a single host, which can lead to CPU contention if the same host + processes each cached block. Increasing the replication factor lets Impala choose + different hosts to process different cached blocks, to better distribute the CPU load. + Always use a <code class="ph codeph">WITH REPLICATION</code> setting of at least 3, and adjust upward + if necessary to match the replication factor for the underlying HDFS data files. + </p> + <p class="p"> + In <span class="keyword">Impala 2.5</span> and higher, Impala automatically randomizes which host processes + a cached HDFS block, to avoid CPU hotspots. For tables where HDFS caching is not applied, + Impala designates which host to process a data block using an algorithm that estimates + the load on each host. If CPU hotspots still arise during queries, + you can enable additional randomization for the scheduling algorithm for non-HDFS cached data + by setting the <code class="ph codeph">SCHEDULE_RANDOM_REPLICA</code> query option. + </p> + </div> + + <p class="p toc inpage"></p> + + + + <p class="p"> + For background information about how to set up and manage HDFS caching for a <span class="keyword"></span> cluster, see + <span class="xref">the documentation for your Apache Hadoop distribution</span>. + </p> + </div> + + <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_performance.html">Tuning Impala for Performance</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="hdfs_caching__hdfs_caching_overview"> + + <h2 class="title topictitle2" id="ariaid-title2">Overview of HDFS Caching for Impala</h2> + + + <div class="body conbody"> + + <p class="p"> + In <span class="keyword">Impala 1.4</span> and higher, Impala can use the HDFS caching feature to make more effective use of RAM, so that + repeated queries can take advantage of data <span class="q">"pinned"</span> in memory regardless of how much data is + processed overall. The HDFS caching feature lets you designate a subset of frequently accessed data to be + pinned permanently in memory, remaining in the cache across multiple queries and never being evicted. This + technique is suitable for tables or partitions that are frequently accessed and are small enough to fit + entirely within the HDFS memory cache. For example, you might designate several dimension tables to be + pinned in the cache, to speed up many different join queries that reference them. Or in a partitioned + table, you might pin a partition holding data from the most recent time period because that data will be + queried intensively; then when the next set of data arrives, you could unpin the previous partition and pin + the partition holding the new data. + </p> + + <p class="p"> + Because this Impala performance feature relies on HDFS infrastructure, it only applies to Impala tables + that use HDFS data files. HDFS caching for Impala does not apply to HBase tables, S3 tables, + Kudu tables, + or Isilon tables. + </p> + + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="hdfs_caching__hdfs_caching_prereqs"> + + <h2 class="title topictitle2" id="ariaid-title3">Setting Up HDFS Caching for Impala</h2> + + <div class="body conbody"> + + <p class="p"> + To use HDFS caching with Impala, first set up that feature for your <span class="keyword"></span> cluster: + </p> + + <ul class="ul"> + <li class="li"> + <p class="p"> + Decide how much memory to devote to the HDFS cache on each host. Remember that the total memory available + for cached data is the sum of the cache sizes on all the hosts. By default, any data block is only cached on one + host, although you can cache a block across multiple hosts by increasing the replication factor. + + </p> + </li> + + <li class="li"> + <div class="p"> + Issue <span class="keyword cmdname">hdfs cacheadmin</span> commands to set up one or more cache pools, owned by the same + user as the <span class="keyword cmdname">impalad</span> daemon (typically <code class="ph codeph">impala</code>). For example: +<pre class="pre codeblock"><code>hdfs cacheadmin -addPool four_gig_pool -owner impala -limit 4000000000 +</code></pre> + For details about the <span class="keyword cmdname">hdfs cacheadmin</span> command, see + <span class="xref">the documentation for your Apache Hadoop distribution</span>. + </div> + </li> + </ul> + + <p class="p"> + Once HDFS caching is enabled and one or more pools are available, see + <a class="xref" href="impala_perf_hdfs_caching.html#hdfs_caching_ddl">Enabling HDFS Caching for Impala Tables and Partitions</a> for how to choose which Impala data to load + into the HDFS cache. On the Impala side, you specify the cache pool name defined by the <code class="ph codeph">hdfs + cacheadmin</code> command in the Impala DDL statements that enable HDFS caching for a table or partition, + such as <code class="ph codeph">CREATE TABLE ... CACHED IN <var class="keyword varname">pool</var></code> or <code class="ph codeph">ALTER TABLE ... SET + CACHED IN <var class="keyword varname">pool</var></code>. + </p> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="hdfs_caching__hdfs_caching_ddl"> + + <h2 class="title topictitle2" id="ariaid-title4">Enabling HDFS Caching for Impala Tables and Partitions</h2> + + <div class="body conbody"> + + <p class="p"> + Begin by choosing which tables or partitions to cache. For example, these might be lookup tables that are + accessed by many different join queries, or partitions corresponding to the most recent time period that + are analyzed by different reports or ad hoc queries. + </p> + + <p class="p"> + In your SQL statements, you specify logical divisions such as tables and partitions to be cached. Impala + translates these requests into HDFS-level directives that apply to particular directories and files. For + example, given a partitioned table <code class="ph codeph">CENSUS</code> with a partition key column + <code class="ph codeph">YEAR</code>, you could choose to cache all or part of the data as follows: + </p> + + <p class="p"> + In <span class="keyword">Impala 2.2</span> and higher, the optional <code class="ph codeph">WITH REPLICATION</code> clause + for <code class="ph codeph">CREATE TABLE</code> and <code class="ph codeph">ALTER TABLE</code> lets you specify + a <dfn class="term">replication factor</dfn>, the number of hosts on which to cache the same data blocks. + When Impala processes a cached data block, where the cache replication factor is greater than 1, Impala randomly + selects a host that has a cached copy of that data block. This optimization avoids excessive CPU + usage on a single host when the same cached data block is processed multiple times. + Where practical, specify a value greater than or equal to the HDFS block replication factor. + </p> + +<pre class="pre codeblock"><code>-- Cache the entire table (all partitions). +alter table census set cached in '<var class="keyword varname">pool_name</var>'; + +-- Remove the entire table from the cache. +alter table census set uncached; + +-- Cache a portion of the table (a single partition). +-- If the table is partitioned by multiple columns (such as year, month, day), +-- the ALTER TABLE command must specify values for all those columns. +alter table census partition (year=1960) set cached in '<var class="keyword varname">pool_name</var>'; + +<span class="ph">-- Cache the data from one partition on up to 4 hosts, to minimize CPU load on any +-- single host when the same data block is processed multiple times. +alter table census partition (year=1970) + set cached in '<var class="keyword varname">pool_name</var>' with replication = 4;</span> + +-- At each stage, check the volume of cached data. +-- For large tables or partitions, the background loading might take some time, +-- so you might have to wait and reissue the statement until all the data +-- has finished being loaded into the cache. +show table stats census; ++-------+-------+--------+------+--------------+--------+ +| year | #Rows | #Files | Size | Bytes Cached | Format | ++-------+-------+--------+------+--------------+--------+ +| 1900 | -1 | 1 | 11B | NOT CACHED | TEXT | +| 1940 | -1 | 1 | 11B | NOT CACHED | TEXT | +| 1960 | -1 | 1 | 11B | 11B | TEXT | +| 1970 | -1 | 1 | 11B | NOT CACHED | TEXT | +| Total | -1 | 4 | 44B | 11B | | ++-------+-------+--------+------+--------------+--------+ +</code></pre> + + <p class="p"> + <strong class="ph b">CREATE TABLE considerations:</strong> + </p> + + <p class="p"> + The HDFS caching feature affects the Impala <code class="ph codeph">CREATE TABLE</code> statement as follows: + </p> + + <ul class="ul"> + <li class="li"> + <p class="p"> + You can put a <code class="ph codeph">CACHED IN '<var class="keyword varname">pool_name</var>'</code> clause + <span class="ph">and optionally a <code class="ph codeph">WITH REPLICATION = <var class="keyword varname">number_of_hosts</var></code> clause</span> + at the end of a + <code class="ph codeph">CREATE TABLE</code> statement to automatically cache the entire contents of the table, + including any partitions added later. The <var class="keyword varname">pool_name</var> is a pool that you previously set + up with the <span class="keyword cmdname">hdfs cacheadmin</span> command. + </p> + </li> + + <li class="li"> + <p class="p"> + Once a table is designated for HDFS caching through the <code class="ph codeph">CREATE TABLE</code> statement, if new + partitions are added later through <code class="ph codeph">ALTER TABLE ... ADD PARTITION</code> statements, the data in + those new partitions is automatically cached in the same pool. + </p> + </li> + + <li class="li"> + <p class="p"> + If you want to perform repetitive queries on a subset of data from a large table, and it is not practical + to designate the entire table or specific partitions for HDFS caching, you can create a new cached table + with just a subset of the data by using <code class="ph codeph">CREATE TABLE ... CACHED IN '<var class="keyword varname">pool_name</var>' + AS SELECT ... WHERE ...</code>. When you are finished with generating reports from this subset of data, + drop the table and both the data files and the data cached in RAM are automatically deleted. + </p> + </li> + </ul> + + <p class="p"> + See <a class="xref" href="impala_create_table.html#create_table">CREATE TABLE Statement</a> for the full syntax. + </p> + + <p class="p"> + <strong class="ph b">Other memory considerations:</strong> + </p> + + <p class="p"> + Certain DDL operations, such as <code class="ph codeph">ALTER TABLE ... SET LOCATION</code>, are blocked while the + underlying HDFS directories contain cached files. You must uncache the files first, before changing the + location, dropping the table, and so on. + </p> + + <p class="p"> + When data is requested to be pinned in memory, that process happens in the background without blocking + access to the data while the caching is in progress. Loading the data from disk could take some time. + Impala reads each HDFS data block from memory if it has been pinned already, or from disk if it has not + been pinned yet. When files are added to a table or partition whose contents are cached, Impala + automatically detects those changes and performs a <code class="ph codeph">REFRESH</code> automatically once the relevant + data is cached. + </p> + + <p class="p"> + The amount of data that you can pin on each node through the HDFS caching mechanism is subject to a quota + that is enforced by the underlying HDFS service. Before requesting to pin an Impala table or partition in + memory, check that its size does not exceed this quota. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + Because the HDFS cache consists of combined memory from all the DataNodes in the cluster, cached tables or + partitions can be bigger than the amount of HDFS cache memory on any single host. + </div> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title5" id="hdfs_caching__hdfs_caching_etl"> + + <h2 class="title topictitle2" id="ariaid-title5">Loading and Removing Data with HDFS Caching Enabled</h2> + + + <div class="body conbody"> + + <p class="p"> + When HDFS caching is enabled, extra processing happens in the background when you add or remove data + through statements such as <code class="ph codeph">INSERT</code> and <code class="ph codeph">DROP TABLE</code>. + </p> + + <p class="p"> + <strong class="ph b">Inserting or loading data:</strong> + </p> + + <ul class="ul"> + <li class="li"> + When Impala performs an <code class="ph codeph"><a class="xref" href="impala_insert.html#insert">INSERT</a></code> or + <code class="ph codeph"><a class="xref" href="impala_load_data.html#load_data">LOAD DATA</a></code> statement for a table or + partition that is cached, the new data files are automatically cached and Impala recognizes that fact + automatically. + </li> + + <li class="li"> + If you perform an <code class="ph codeph">INSERT</code> or <code class="ph codeph">LOAD DATA</code> through Hive, as always, Impala + only recognizes the new data files after a <code class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code> + statement in Impala. + </li> + + <li class="li"> + If the cache pool is entirely full, or becomes full before all the requested data can be cached, the + Impala DDL statement returns an error. This is to avoid situations where only some of the requested data + could be cached. + </li> + + <li class="li"> + When HDFS caching is enabled for a table or partition, new data files are cached automatically when they + are added to the appropriate directory in HDFS, without the need for a <code class="ph codeph">REFRESH</code> statement + in Impala. Impala automatically performs a <code class="ph codeph">REFRESH</code> once the new data is loaded into the + HDFS cache. + </li> + </ul> + + <p class="p"> + <strong class="ph b">Dropping tables, partitions, or cache pools:</strong> + </p> + + <p class="p"> + The HDFS caching feature interacts with the Impala + <code class="ph codeph"><a class="xref" href="impala_drop_table.html#drop_table">DROP TABLE</a></code> and + <code class="ph codeph"><a class="xref" href="impala_alter_table.html#alter_table">ALTER TABLE ... DROP PARTITION</a></code> + statements as follows: + </p> + + <ul class="ul"> + <li class="li"> + When you issue a <code class="ph codeph">DROP TABLE</code> for a table that is entirely cached, or has some partitions + cached, the <code class="ph codeph">DROP TABLE</code> succeeds and all the cache directives Impala submitted for that + table are removed from the HDFS cache system. + </li> + + <li class="li"> + The same applies to <code class="ph codeph">ALTER TABLE ... DROP PARTITION</code>. The operation succeeds and any cache + directives are removed. + </li> + + <li class="li"> + As always, the underlying data files are removed if the dropped table is an internal table, or the + dropped partition is in its default location underneath an internal table. The data files are left alone + if the dropped table is an external table, or if the dropped partition is in a non-default location. + </li> + + <li class="li"> + If you designated the data files as cached through the <span class="keyword cmdname">hdfs cacheadmin</span> command, and + the data files are left behind as described in the previous item, the data files remain cached. Impala + only removes the cache directives submitted by Impala through the <code class="ph codeph">CREATE TABLE</code> or + <code class="ph codeph">ALTER TABLE</code> statements. It is OK to have multiple redundant cache directives pertaining + to the same files; the directives all have unique IDs and owners so that the system can tell them apart. + </li> + + <li class="li"> + If you drop an HDFS cache pool through the <span class="keyword cmdname">hdfs cacheadmin</span> command, all the Impala + data files are preserved, just no longer cached. After a subsequent <code class="ph codeph">REFRESH</code>, + <code class="ph codeph">SHOW TABLE STATS</code> reports 0 bytes cached for each associated Impala table or partition. + </li> + </ul> + + <p class="p"> + <strong class="ph b">Relocating a table or partition:</strong> + </p> + + <p class="p"> + The HDFS caching feature interacts with the Impala + <code class="ph codeph"><a class="xref" href="impala_alter_table.html#alter_table">ALTER TABLE ... SET LOCATION</a></code> + statement as follows: + </p> + + <ul class="ul"> + <li class="li"> + If you have designated a table or partition as cached through the <code class="ph codeph">CREATE TABLE</code> or + <code class="ph codeph">ALTER TABLE</code> statements, subsequent attempts to relocate the table or partition through + an <code class="ph codeph">ALTER TABLE ... SET LOCATION</code> statement will fail. You must issue an <code class="ph codeph">ALTER + TABLE ... SET UNCACHED</code> statement for the table or partition first. Otherwise, Impala would lose + track of some cached data files and have no way to uncache them later. + </li> + </ul> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="hdfs_caching__hdfs_caching_admin"> + + <h2 class="title topictitle2" id="ariaid-title6">Administration for HDFS Caching with Impala</h2> + + <div class="body conbody"> + + <p class="p"> + Here are the guidelines and steps to check or change the status of HDFS caching for Impala data: + </p> + + <p class="p"> + <strong class="ph b">hdfs cacheadmin command:</strong> + </p> + + <ul class="ul"> + <li class="li"> + If you drop a cache pool with the <span class="keyword cmdname">hdfs cacheadmin</span> command, Impala queries against the + associated data files will still work, by falling back to reading the files from disk. After performing a + <code class="ph codeph">REFRESH</code> on the table, Impala reports the number of bytes cached as 0 for all associated + tables and partitions. + </li> + + <li class="li"> + You might use <span class="keyword cmdname">hdfs cacheadmin</span> to get a list of existing cache pools, or detailed + information about the pools, as follows: +<pre class="pre codeblock"><code>hdfs cacheadmin -listDirectives # Basic info +Found 122 entries + ID POOL REPL EXPIRY PATH + 123 testPool 1 never /user/hive/warehouse/tpcds.store_sales + 124 testPool 1 never /user/hive/warehouse/tpcds.store_sales/ss_date=1998-01-15 + 125 testPool 1 never /user/hive/warehouse/tpcds.store_sales/ss_date=1998-02-01 +... + +hdfs cacheadmin -listDirectives -stats # More details +Found 122 entries + ID POOL REPL EXPIRY PATH BYTES_NEEDED BYTES_CACHED FILES_NEEDED FILES_CACHED + 123 testPool 1 never /user/hive/warehouse/tpcds.store_sales 0 0 0 0 + 124 testPool 1 never /user/hive/warehouse/tpcds.store_sales/ss_date=1998-01-15 143169 143169 1 1 + 125 testPool 1 never /user/hive/warehouse/tpcds.store_sales/ss_date=1998-02-01 112447 112447 1 1 +... +</code></pre> + </li> + </ul> + + <p class="p"> + <strong class="ph b">Impala SHOW statement:</strong> + </p> + + <ul class="ul"> + <li class="li"> + For each table or partition, the <code class="ph codeph">SHOW TABLE STATS</code> or <code class="ph codeph">SHOW PARTITIONS</code> + statement displays the number of bytes currently cached by the HDFS caching feature. If there are no + cache directives in place for that table or partition, the result set displays <code class="ph codeph">NOT + CACHED</code>. A value of 0, or a smaller number than the overall size of the table or partition, + indicates that the cache request has been submitted but the data has not been entirely loaded into memory + yet. See <a class="xref" href="impala_show.html#show">SHOW Statement</a> for details. + </li> + </ul> + + <p class="p"> + <strong class="ph b">Impala memory limits:</strong> + </p> + + <p class="p"> + The Impala HDFS caching feature interacts with the Impala memory limits as follows: + </p> + + <ul class="ul"> + <li class="li"> + The maximum size of each HDFS cache pool is specified externally to Impala, through the <span class="keyword cmdname">hdfs + cacheadmin</span> command. + </li> + + <li class="li"> + All the memory used for HDFS caching is separate from the <span class="keyword cmdname">impalad</span> daemon address space + and does not count towards the limits of the <code class="ph codeph">--mem_limit</code> startup option, + <code class="ph codeph">MEM_LIMIT</code> query option, or further limits imposed through YARN resource management or + the Linux <code class="ph codeph">cgroups</code> mechanism. + </li> + + <li class="li"> + Because accessing HDFS cached data avoids a memory-to-memory copy operation, queries involving cached + data require less memory on the Impala side than the equivalent queries on uncached data. In addition to + any performance benefits in a single-user environment, the reduced memory helps to improve scalability + under high-concurrency workloads. + </li> + </ul> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title7" id="hdfs_caching__hdfs_caching_performance"> + + <h2 class="title topictitle2" id="ariaid-title7">Performance Considerations for HDFS Caching with Impala</h2> + + <div class="body conbody"> + + <p class="p"> + In Impala 1.4.0 and higher, Impala supports efficient reads from data that is pinned in memory through HDFS + caching. Impala takes advantage of the HDFS API and reads the data from memory rather than from disk + whether the data files are pinned using Impala DDL statements, or using the command-line mechanism where + you specify HDFS paths. + </p> + + <p class="p"> + When you examine the output of the <span class="keyword cmdname">impala-shell</span> <span class="keyword cmdname">SUMMARY</span> command, or + look in the metrics report for the <span class="keyword cmdname">impalad</span> daemon, you see how many bytes are read from + the HDFS cache. For example, this excerpt from a query profile illustrates that all the data read during a + particular phase of the query came from the HDFS cache, because the <code class="ph codeph">BytesRead</code> and + <code class="ph codeph">BytesReadDataNodeCache</code> values are identical. + </p> + +<pre class="pre codeblock"><code>HDFS_SCAN_NODE (id=0):(Total: 11s114ms, non-child: 11s114ms, % non-child: 100.00%) + - AverageHdfsReadThreadConcurrency: 0.00 + - AverageScannerThreadConcurrency: 32.75 +<strong class="ph b"> - BytesRead: 10.47 GB (11240756479) + - BytesReadDataNodeCache: 10.47 GB (11240756479)</strong> + - BytesReadLocal: 10.47 GB (11240756479) + - BytesReadShortCircuit: 10.47 GB (11240756479) + - DecompressionTime: 27s572ms +</code></pre> + + <p class="p"> + For queries involving smaller amounts of data, or in single-user workloads, you might not notice a + significant difference in query response time with or without HDFS caching. Even with HDFS caching turned + off, the data for the query might still be in the Linux OS buffer cache. The benefits become clearer as + data volume increases, and especially as the system processes more concurrent queries. HDFS caching + improves the scalability of the overall system. That is, it prevents query performance from declining when + the workload outstrips the capacity of the Linux OS cache. + </p> + + <p class="p"> + Due to a limitation of HDFS, zero-copy reads are not supported with + encryption. Where practical, avoid HDFS caching for Impala data + files in encryption zones. The queries fall back to the normal read + path during query execution, which might cause some performance overhead. + </p> + + <p class="p"> + <strong class="ph b">SELECT considerations:</strong> + </p> + + <p class="p"> + The Impala HDFS caching feature interacts with the + <code class="ph codeph"><a class="xref" href="impala_select.html#select">SELECT</a></code> statement and query performance as + follows: + </p> + + <ul class="ul"> + <li class="li"> + Impala automatically reads from memory any data that has been designated as cached and actually loaded + into the HDFS cache. (It could take some time after the initial request to fully populate the cache for a + table with large size or many partitions.) The speedup comes from two aspects: reading from RAM instead + of disk, and accessing the data straight from the cache area instead of copying from one RAM area to + another. This second aspect yields further performance improvement over the standard OS caching + mechanism, which still results in memory-to-memory copying of cached data. + </li> + + <li class="li"> + For small amounts of data, the query speedup might not be noticeable in terms of wall clock time. The + performance might be roughly the same with HDFS caching turned on or off, due to recently used data being + held in the Linux OS cache. The difference is more pronounced with: + <ul class="ul"> + <li class="li"> + Data volumes (for all queries running concurrently) that exceed the size of the Linux OS cache. + </li> + + <li class="li"> + A busy cluster running many concurrent queries, where the reduction in memory-to-memory copying and + overall memory usage during queries results in greater scalability and throughput. + </li> + + <li class="li"> + Thus, to really exercise and benchmark this feature in a development environment, you might need to + simulate realistic workloads and concurrent queries that match your production environment. + </li> + + <li class="li"> + One way to simulate a heavy workload on a lightly loaded system is to flush the OS buffer cache (on + each DataNode) between iterations of queries against the same tables or partitions: +<pre class="pre codeblock"><code>$ sync +$ echo 1 > /proc/sys/vm/drop_caches +</code></pre> + </li> + </ul> + </li> + + <li class="li"> + Impala queries take advantage of HDFS cached data regardless of whether the cache directive was issued by + Impala or externally through the <span class="keyword cmdname">hdfs cacheadmin</span> command, for example for an external + table where the cached data files might be accessed by several different Hadoop components. + </li> + + <li class="li"> + If your query returns a large result set, the time reported for the query could be dominated by the time + needed to print the results on the screen. To measure the time for the underlying query processing, query + the <code class="ph codeph">COUNT()</code> of the big result set, which does all the same processing but only prints a + single line to the screen. + </li> + </ul> + </div> + </article> +</article></main></body></html> \ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_perf_joins.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_perf_joins.html b/docs/build/html/topics/impala_perf_joins.html new file mode 100644 index 0000000..064a8c5 --- /dev/null +++ b/docs/build/html/topics/impala_perf_joins.html @@ -0,0 +1,493 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_performance.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="perf_joins"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Performance Considerations for Join Queries</title></head><body id="perf_joins"><ma in role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Performance Considerations for Join Queries</h1> + + + + <div class="body conbody"> + + <p class="p"> + Queries involving join operations often require more tuning than queries that refer to only one table. The + maximum size of the result set from a join query is the product of the number of rows in all the joined + tables. When joining several tables with millions or billions of rows, any missed opportunity to filter the + result set, or other inefficiency in the query, could lead to an operation that does not finish in a + practical time and has to be cancelled. + </p> + + <p class="p"> + The simplest technique for tuning an Impala join query is to collect statistics on each table involved in the + join using the <code class="ph codeph"><a class="xref" href="impala_compute_stats.html#compute_stats">COMPUTE STATS</a></code> + statement, and then let Impala automatically optimize the query based on the size of each table, number of + distinct values of each column, and so on. The <code class="ph codeph">COMPUTE STATS</code> statement and the join + optimization are new features introduced in Impala 1.2.2. For accurate statistics about each table, issue the + <code class="ph codeph">COMPUTE STATS</code> statement after loading the data into that table, and again if the amount of + data changes substantially due to an <code class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, adding a partition, + and so on. + </p> + + <p class="p"> + If statistics are not available for all the tables in the join query, or if Impala chooses a join order that + is not the most efficient, you can override the automatic join order optimization by specifying the + <code class="ph codeph">STRAIGHT_JOIN</code> keyword immediately after the <code class="ph codeph">SELECT</code> keyword. In this case, + Impala uses the order the tables appear in the query to guide how the joins are processed. + </p> + + <p class="p"> + When you use the <code class="ph codeph">STRAIGHT_JOIN</code> technique, you must order the tables in the join query + manually instead of relying on the Impala optimizer. The optimizer uses sophisticated techniques to estimate + the size of the result set at each stage of the join. For manual ordering, use this heuristic approach to + start with, and then experiment to fine-tune the order: + </p> + + <ul class="ul"> + <li class="li"> + Specify the largest table first. This table is read from disk by each Impala node and so its size is not + significant in terms of memory usage during the query. + </li> + + <li class="li"> + Next, specify the smallest table. The contents of the second, third, and so on tables are all transmitted + across the network. You want to minimize the size of the result set from each subsequent stage of the join + query. The most likely approach involves joining a small table first, so that the result set remains small + even as subsequent larger tables are processed. + </li> + + <li class="li"> + Join the next smallest table, then the next smallest, and so on. + </li> + + <li class="li"> + For example, if you had tables <code class="ph codeph">BIG</code>, <code class="ph codeph">MEDIUM</code>, <code class="ph codeph">SMALL</code>, and + <code class="ph codeph">TINY</code>, the logical join order to try would be <code class="ph codeph">BIG</code>, <code class="ph codeph">TINY</code>, + <code class="ph codeph">SMALL</code>, <code class="ph codeph">MEDIUM</code>. + </li> + </ul> + + <p class="p"> + The terms <span class="q">"largest"</span> and <span class="q">"smallest"</span> refers to the size of the intermediate result set based on the + number of rows and columns from each table that are part of the result set. For example, if you join one + table <code class="ph codeph">sales</code> with another table <code class="ph codeph">customers</code>, a query might find results from + 100 different customers who made a total of 5000 purchases. In that case, you would specify <code class="ph codeph">SELECT + ... FROM sales JOIN customers ...</code>, putting <code class="ph codeph">customers</code> on the right side because it + is smaller in the context of this query. + </p> + + <p class="p"> + The Impala query planner chooses between different techniques for performing join queries, depending on the + absolute and relative sizes of the tables. <strong class="ph b">Broadcast joins</strong> are the default, where the right-hand table + is considered to be smaller than the left-hand table, and its contents are sent to all the other nodes + involved in the query. The alternative technique is known as a <strong class="ph b">partitioned join</strong> (not related to a + partitioned table), which is more suitable for large tables of roughly equal size. With this technique, + portions of each table are sent to appropriate other nodes where those subsets of rows can be processed in + parallel. The choice of broadcast or partitioned join also depends on statistics being available for all + tables in the join, gathered by the <code class="ph codeph">COMPUTE STATS</code> statement. + </p> + + <p class="p"> + To see which join strategy is used for a particular query, issue an <code class="ph codeph">EXPLAIN</code> statement for + the query. If you find that a query uses a broadcast join when you know through benchmarking that a + partitioned join would be more efficient, or vice versa, add a hint to the query to specify the precise join + mechanism to use. See <a class="xref" href="impala_hints.html#hints">Query Hints in Impala SELECT Statements</a> for details. + </p> + </div> + + <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_performance.html">Tuning Impala for Performance</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="perf_joins__joins_no_stats"> + + <h2 class="title topictitle2" id="ariaid-title2">How Joins Are Processed when Statistics Are Unavailable</h2> + + + <div class="body conbody"> + + <p class="p"> + If table or column statistics are not available for some tables in a join, Impala still reorders the tables + using the information that is available. Tables with statistics are placed on the left side of the join + order, in descending order of cost based on overall size and cardinality. Tables without statistics are + treated as zero-size, that is, they are always placed on the right side of the join order. + </p> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="perf_joins__straight_join"> + + <h2 class="title topictitle2" id="ariaid-title3">Overriding Join Reordering with STRAIGHT_JOIN</h2> + + <div class="body conbody"> + + <p class="p"> + If an Impala join query is inefficient because of outdated statistics or unexpected data distribution, you + can keep Impala from reordering the joined tables by using the <code class="ph codeph">STRAIGHT_JOIN</code> keyword + immediately after the <code class="ph codeph">SELECT</code> keyword. The <code class="ph codeph">STRAIGHT_JOIN</code> keyword turns off + the reordering of join clauses that Impala does internally, and produces a plan that relies on the join + clauses being ordered optimally in the query text. In this case, rewrite the query so that the largest + table is on the left, followed by the next largest, and so on until the smallest table is on the right. + </p> + + <p class="p"> + In this example, the subselect from the <code class="ph codeph">BIG</code> table produces a very small result set, but + the table might still be treated as if it were the biggest and placed first in the join order. Using + <code class="ph codeph">STRAIGHT_JOIN</code> for the last join clause prevents the final table from being reordered, + keeping it as the rightmost table in the join order. + </p> + +<pre class="pre codeblock"><code>select straight_join x from medium join small join (select * from big where c1 < 10) as big + where medium.id = small.id and small.id = big.id;</code></pre> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="perf_joins__perf_joins_examples"> + + <h2 class="title topictitle2" id="ariaid-title4">Examples of Join Order Optimization</h2> + + <div class="body conbody"> + + <p class="p"> + Here are examples showing joins between tables with 1 billion, 200 million, and 1 million rows. (In this + case, the tables are unpartitioned and using Parquet format.) The smaller tables contain subsets of data + from the largest one, for convenience of joining on the unique <code class="ph codeph">ID</code> column. The smallest + table only contains a subset of columns from the others. + </p> + + <p class="p"></p> + +<pre class="pre codeblock"><code>[localhost:21000] > create table big stored as parquet as select * from raw_data; ++----------------------------+ +| summary | ++----------------------------+ +| Inserted 1000000000 row(s) | ++----------------------------+ +Returned 1 row(s) in 671.56s +[localhost:21000] > desc big; ++-----------+---------+---------+ +| name | type | comment | ++-----------+---------+---------+ +| id | int | | +| val | int | | +| zfill | string | | +| name | string | | +| assertion | boolean | | ++-----------+---------+---------+ +Returned 5 row(s) in 0.01s +[localhost:21000] > create table medium stored as parquet as select * from big limit 200 * floor(1e6); ++---------------------------+ +| summary | ++---------------------------+ +| Inserted 200000000 row(s) | ++---------------------------+ +Returned 1 row(s) in 138.31s +[localhost:21000] > create table small stored as parquet as select id,val,name from big where assertion = true limit 1 * floor(1e6); ++-------------------------+ +| summary | ++-------------------------+ +| Inserted 1000000 row(s) | ++-------------------------+ +Returned 1 row(s) in 6.32s</code></pre> + + <p class="p"> + For any kind of performance experimentation, use the <code class="ph codeph">EXPLAIN</code> statement to see how any + expensive query will be performed without actually running it, and enable verbose <code class="ph codeph">EXPLAIN</code> + plans containing more performance-oriented detail: The most interesting plan lines are highlighted in bold, + showing that without statistics for the joined tables, Impala cannot make a good estimate of the number of + rows involved at each stage of processing, and is likely to stick with the <code class="ph codeph">BROADCAST</code> join + mechanism that sends a complete copy of one of the tables to each node. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > set explain_level=verbose; +EXPLAIN_LEVEL set to verbose +[localhost:21000] > explain select count(*) from big join medium where big.id = medium.id; ++----------------------------------------------------------+ +| Explain String | ++----------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=2.10GB VCores=2 | +| | +| PLAN FRAGMENT 0 | +| PARTITION: UNPARTITIONED | +| | +| 6:AGGREGATE (merge finalize) | +| | output: SUM(COUNT(*)) | +| | cardinality: 1 | +| | per-host memory: unavailable | +| | tuple ids: 2 | +| | | +| 5:EXCHANGE | +| cardinality: 1 | +| per-host memory: unavailable | +| tuple ids: 2 | +| | +| PLAN FRAGMENT 1 | +| PARTITION: RANDOM | +| | +| STREAM DATA SINK | +| EXCHANGE ID: 5 | +| UNPARTITIONED | +| | +| 3:AGGREGATE | +| | output: COUNT(*) | +| | cardinality: 1 | +| | per-host memory: 10.00MB | +| | tuple ids: 2 | +| | | +| 2:HASH JOIN | +<strong class="ph b">| | join op: INNER JOIN (BROADCAST) |</strong> +| | hash predicates: | +| | big.id = medium.id | +<strong class="ph b">| | cardinality: unavailable |</strong> +| | per-host memory: 2.00GB | +| | tuple ids: 0 1 | +| | | +| |----4:EXCHANGE | +| | cardinality: unavailable | +| | per-host memory: 0B | +| | tuple ids: 1 | +| | | +| 0:SCAN HDFS | +<strong class="ph b">| table=join_order.big #partitions=1/1 size=23.12GB | +| table stats: unavailable | +| column stats: unavailable | +| cardinality: unavailable |</strong> +| per-host memory: 88.00MB | +| tuple ids: 0 | +| | +| PLAN FRAGMENT 2 | +| PARTITION: RANDOM | +| | +| STREAM DATA SINK | +| EXCHANGE ID: 4 | +| UNPARTITIONED | +| | +| 1:SCAN HDFS | +<strong class="ph b">| table=join_order.medium #partitions=1/1 size=4.62GB | +| table stats: unavailable | +| column stats: unavailable | +| cardinality: unavailable |</strong> +| per-host memory: 88.00MB | +| tuple ids: 1 | ++----------------------------------------------------------+ +Returned 64 row(s) in 0.04s</code></pre> + + <p class="p"> + Gathering statistics for all the tables is straightforward, one <code class="ph codeph">COMPUTE STATS</code> statement + per table: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > compute stats small; ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 1 partition(s) and 3 column(s). | ++-----------------------------------------+ +Returned 1 row(s) in 4.26s +[localhost:21000] > compute stats medium; ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 1 partition(s) and 5 column(s). | ++-----------------------------------------+ +Returned 1 row(s) in 42.11s +[localhost:21000] > compute stats big; ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 1 partition(s) and 5 column(s). | ++-----------------------------------------+ +Returned 1 row(s) in 165.44s</code></pre> + + <p class="p"> + With statistics in place, Impala can choose a more effective join order rather than following the + left-to-right sequence of tables in the query, and can choose <code class="ph codeph">BROADCAST</code> or + <code class="ph codeph">PARTITIONED</code> join strategies based on the overall sizes and number of rows in the table: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > explain select count(*) from medium join big where big.id = medium.id; +Query: explain select count(*) from medium join big where big.id = medium.id ++-----------------------------------------------------------+ +| Explain String | ++-----------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=937.23MB VCores=2 | +| | +| PLAN FRAGMENT 0 | +| PARTITION: UNPARTITIONED | +| | +| 6:AGGREGATE (merge finalize) | +| | output: SUM(COUNT(*)) | +| | cardinality: 1 | +| | per-host memory: unavailable | +| | tuple ids: 2 | +| | | +| 5:EXCHANGE | +| cardinality: 1 | +| per-host memory: unavailable | +| tuple ids: 2 | +| | +| PLAN FRAGMENT 1 | +| PARTITION: RANDOM | +| | +| STREAM DATA SINK | +| EXCHANGE ID: 5 | +| UNPARTITIONED | +| | +| 3:AGGREGATE | +| | output: COUNT(*) | +| | cardinality: 1 | +| | per-host memory: 10.00MB | +| | tuple ids: 2 | +| | | +| 2:HASH JOIN | +| | join op: INNER JOIN (BROADCAST) | +| | hash predicates: | +| | big.id = medium.id | +| | cardinality: 1443004441 | +| | per-host memory: 839.23MB | +| | tuple ids: 1 0 | +| | | +| |----4:EXCHANGE | +| | cardinality: 200000000 | +| | per-host memory: 0B | +| | tuple ids: 0 | +| | | +| 1:SCAN HDFS | +| table=join_order.big #partitions=1/1 size=23.12GB | +| table stats: 1000000000 rows total | +| column stats: all | +| cardinality: 1000000000 | +| per-host memory: 88.00MB | +| tuple ids: 1 | +| | +| PLAN FRAGMENT 2 | +| PARTITION: RANDOM | +| | +| STREAM DATA SINK | +| EXCHANGE ID: 4 | +| UNPARTITIONED | +| | +| 0:SCAN HDFS | +| table=join_order.medium #partitions=1/1 size=4.62GB | +| table stats: 200000000 rows total | +| column stats: all | +| cardinality: 200000000 | +| per-host memory: 88.00MB | +| tuple ids: 0 | ++-----------------------------------------------------------+ +Returned 64 row(s) in 0.04s + +[localhost:21000] > explain select count(*) from small join big where big.id = small.id; +Query: explain select count(*) from small join big where big.id = small.id ++-----------------------------------------------------------+ +| Explain String | ++-----------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=101.15MB VCores=2 | +| | +| PLAN FRAGMENT 0 | +| PARTITION: UNPARTITIONED | +| | +| 6:AGGREGATE (merge finalize) | +| | output: SUM(COUNT(*)) | +| | cardinality: 1 | +| | per-host memory: unavailable | +| | tuple ids: 2 | +| | | +| 5:EXCHANGE | +| cardinality: 1 | +| per-host memory: unavailable | +| tuple ids: 2 | +| | +| PLAN FRAGMENT 1 | +| PARTITION: RANDOM | +| | +| STREAM DATA SINK | +| EXCHANGE ID: 5 | +| UNPARTITIONED | +| | +| 3:AGGREGATE | +| | output: COUNT(*) | +| | cardinality: 1 | +| | per-host memory: 10.00MB | +| | tuple ids: 2 | +| | | +| 2:HASH JOIN | +| | join op: INNER JOIN (BROADCAST) | +| | hash predicates: | +| | big.id = small.id | +| | cardinality: 1000000000 | +| | per-host memory: 3.15MB | +| | tuple ids: 1 0 | +| | | +| |----4:EXCHANGE | +| | cardinality: 1000000 | +| | per-host memory: 0B | +| | tuple ids: 0 | +| | | +| 1:SCAN HDFS | +| table=join_order.big #partitions=1/1 size=23.12GB | +| table stats: 1000000000 rows total | +| column stats: all | +| cardinality: 1000000000 | +| per-host memory: 88.00MB | +| tuple ids: 1 | +| | +| PLAN FRAGMENT 2 | +| PARTITION: RANDOM | +| | +| STREAM DATA SINK | +| EXCHANGE ID: 4 | +| UNPARTITIONED | +| | +| 0:SCAN HDFS | +| table=join_order.small #partitions=1/1 size=17.93MB | +| table stats: 1000000 rows total | +| column stats: all | +| cardinality: 1000000 | +| per-host memory: 32.00MB | +| tuple ids: 0 | ++-----------------------------------------------------------+ +Returned 64 row(s) in 0.03s</code></pre> + + <p class="p"> + When queries like these are actually run, the execution times are relatively consistent regardless of the + table order in the query text. Here are examples using both the unique <code class="ph codeph">ID</code> column and the + <code class="ph codeph">VAL</code> column containing duplicate values: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > select count(*) from big join small on (big.id = small.id); +Query: select count(*) from big join small on (big.id = small.id) ++----------+ +| count(*) | ++----------+ +| 1000000 | ++----------+ +Returned 1 row(s) in 21.68s +[localhost:21000] > select count(*) from small join big on (big.id = small.id); +Query: select count(*) from small join big on (big.id = small.id) ++----------+ +| count(*) | ++----------+ +| 1000000 | ++----------+ +Returned 1 row(s) in 20.45s + +[localhost:21000] > select count(*) from big join small on (big.val = small.val); ++------------+ +| count(*) | ++------------+ +| 2000948962 | ++------------+ +Returned 1 row(s) in 108.85s +[localhost:21000] > select count(*) from small join big on (big.val = small.val); ++------------+ +| count(*) | ++------------+ +| 2000948962 | ++------------+ +Returned 1 row(s) in 100.76s</code></pre> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + When examining the performance of join queries and the effectiveness of the join order optimization, make + sure the query involves enough data and cluster resources to see a difference depending on the query plan. + For example, a single data file of just a few megabytes will reside in a single HDFS block and be processed + on a single node. Likewise, if you use a single-node or two-node cluster, there might not be much + difference in efficiency for the broadcast or partitioned join strategies. + </div> + </div> + </article> +</article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_perf_resources.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_perf_resources.html b/docs/build/html/topics/impala_perf_resources.html new file mode 100644 index 0000000..ab0fadb --- /dev/null +++ b/docs/build/html/topics/impala_perf_resources.html @@ -0,0 +1,47 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_performance.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="mem_limits"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Controlling Impala Resource Usage</title></head><body id="mem_limits"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Controlling Impala Resource Usage</h1> + + + + <div class="body conbody"> + + <p class="p"> + Sometimes, balancing raw query performance against scalability requires limiting the amount of resources, + such as memory or CPU, used by a single query or group of queries. Impala can use several mechanisms that + help to smooth out the load during heavy concurrent usage, resulting in faster overall query times and + sharing of resources across Impala queries, MapReduce jobs, and other kinds of workloads across a <span class="keyword"></span> + cluster: + </p> + + <ul class="ul"> + <li class="li"> + The Impala admission control feature uses a fast, distributed mechanism to hold back queries that exceed + limits on the number of concurrent queries or the amount of memory used. The queries are queued, and + executed as other queries finish and resources become available. You can control the concurrency limits, + and specify different limits for different groups of users to divide cluster resources according to the + priorities of different classes of users. This feature is new in Impala 1.3. + See <a class="xref" href="impala_admission.html#admission_control">Admission Control and Query Queuing</a> for details. + </li> + + <li class="li"> + <p class="p"> + You can restrict the amount of memory Impala reserves during query execution by specifying the + <code class="ph codeph">-mem_limit</code> option for the <code class="ph codeph">impalad</code> daemon. See + <a class="xref" href="impala_config_options.html#config_options">Modifying Impala Startup Options</a> for details. This limit applies only to the + memory that is directly consumed by queries; Impala reserves additional memory at startup, for example to + hold cached metadata. + </p> + </li> + + <li class="li"> + <p class="p"> + For production deployments, implement resource isolation using your cluster management + tool. + </p> + </li> + </ul> + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_performance.html">Tuning Impala for Performance</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_perf_skew.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_perf_skew.html b/docs/build/html/topics/impala_perf_skew.html new file mode 100644 index 0000000..cb4726e --- /dev/null +++ b/docs/build/html/topics/impala_perf_skew.html @@ -0,0 +1,139 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_performance.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="perf_skew"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Detecting and Correcting HDFS Block Skew Conditions</title></head><body id="perf_skew"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Detecting and Correcting HDFS Block Skew Conditions</h1> + + + + <div class="body conbody"> + + <p class="p"> + For best performance of Impala parallel queries, the work is divided equally across hosts in the cluster, and + all hosts take approximately equal time to finish their work. If one host takes substantially longer than + others, the extra time needed for the slow host can become the dominant factor in query performance. + Therefore, one of the first steps in performance tuning for Impala is to detect and correct such conditions. + </p> + + <p class="p"> + The main cause of uneven performance that you can correct within Impala is <dfn class="term">skew</dfn> in the number of + HDFS data blocks processed by each host, where some hosts process substantially more data blocks than others. + This condition can occur because of uneven distribution of the data values themselves, for example causing + certain data files or partitions to be large while others are very small. (Although it is possible to have + unevenly distributed data without any problems with the distribution of HDFS blocks.) Block skew could also + be due to the underlying block allocation policies within HDFS, the replication factor of the data files, and + the way that Impala chooses the host to process each data block. + </p> + + <p class="p"> + The most convenient way to detect block skew, or slow-host issues in general, is to examine the <span class="q">"executive + summary"</span> information from the query profile after running a query: + </p> + + <ul class="ul"> + <li class="li"> + <p class="p"> + In <span class="keyword cmdname">impala-shell</span>, issue the <code class="ph codeph">SUMMARY</code> command immediately after the + query is complete, to see just the summary information. If you detect issues involving skew, you might + switch to issuing the <code class="ph codeph">PROFILE</code> command, which displays the summary information followed + by a detailed performance analysis. + </p> + </li> + + <li class="li"> + <p class="p"> + In the Impala debug web UI, click on the <span class="ph uicontrol">Profile</span> link associated with the query after it is + complete. The executive summary information is displayed early in the profile output. + </p> + </li> + </ul> + + <p class="p"> + For each phase of the query, you see an <span class="ph uicontrol">Avg Time</span> and a <span class="ph uicontrol">Max Time</span> + value, along with <span class="ph uicontrol">#Hosts</span> indicating how many hosts are involved in that query phase. + For all the phases with <span class="ph uicontrol">#Hosts</span> greater than one, look for cases where the maximum time + is substantially greater than the average time. Focus on the phases that took the longest, for example, those + taking multiple seconds rather than milliseconds or microseconds. + </p> + + <p class="p"> + If you detect that some hosts take longer than others, first rule out non-Impala causes. One reason that some + hosts could be slower than others is if those hosts have less capacity than the others, or if they are + substantially busier due to unevenly distributed non-Impala workloads: + </p> + + <ul class="ul"> + <li class="li"> + <p class="p"> + For clusters running Impala, keep the relative capacities of all hosts roughly equal. Any cost savings + from including some underpowered hosts in the cluster will likely be outweighed by poor or uneven + performance, and the time spent diagnosing performance issues. + </p> + </li> + + <li class="li"> + <p class="p"> + If non-Impala workloads cause slowdowns on some hosts but not others, use the appropriate load-balancing + techniques for the non-Impala components to smooth out the load across the cluster. + </p> + </li> + </ul> + + <p class="p"> + If the hosts on your cluster are evenly powered and evenly loaded, examine the detailed profile output to + determine which host is taking longer than others for the query phase in question. Examine how many bytes are + processed during that phase on that host, how much memory is used, and how many bytes are transmitted across + the network. + </p> + + <p class="p"> + The most common symptom is a higher number of bytes read on one host than others, due to one host being + requested to process a higher number of HDFS data blocks. This condition is more likely to occur when the + number of blocks accessed by the query is relatively small. For example, if you have a 10-node cluster and + the query processes 10 HDFS blocks, each node might not process exactly one block. If one node sits idle + while another node processes two blocks, the query could take twice as long as if the data was perfectly + distributed. + </p> + + <p class="p"> + Possible solutions in this case include: + </p> + + <ul class="ul"> + <li class="li"> + <p class="p"> + If the query is artificially small, perhaps for benchmarking purposes, scale it up to process a larger + data set. For example, if some nodes read 10 HDFS data blocks while others read 11, the overall effect of + the uneven distribution is much lower than when some nodes did twice as much work as others. As a + guideline, aim for a <span class="q">"sweet spot"</span> where each node reads 2 GB or more from HDFS per query. Queries + that process lower volumes than that could experience inconsistent performance that smooths out as + queries become more data-intensive. + </p> + </li> + + <li class="li"> + <p class="p"> + If the query processes only a few large blocks, so that many nodes sit idle and cannot help to + parallelize the query, consider reducing the overall block size. For example, you might adjust the + <code class="ph codeph">PARQUET_FILE_SIZE</code> query option before copying or converting data into a Parquet table. + Or you might adjust the granularity of data files produced earlier in the ETL pipeline by non-Impala + components. In Impala 2.0 and later, the default Parquet block size is 256 MB, reduced from 1 GB, to + improve parallelism for common cluster sizes and data volumes. + </p> + </li> + + <li class="li"> + <p class="p"> + Reduce the amount of compression applied to the data. For text data files, the highest degree of + compression (gzip) produces unsplittable files that are more difficult for Impala to process in parallel, + and require extra memory during processing to hold the compressed and uncompressed data simultaneously. + For binary formats such as Parquet and Avro, compression can result in fewer data blocks overall, but + remember that when queries process relatively few blocks, there is less opportunity for parallel + execution and many nodes in the cluster might sit idle. Note that when Impala writes Parquet data with + the query option <code class="ph codeph">COMPRESSION_CODEC=NONE</code> enabled, the data is still typically compact due + to the encoding schemes used by Parquet, independent of the final compression step. + </p> + </li> + </ul> + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_performance.html">Tuning Impala for Performance</a></div></div></nav></article></main></body></html> \ No newline at end of file
