http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_partitioning.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_partitioning.xml b/docs/topics/impala_partitioning.xml new file mode 100644 index 0000000..1aedfa1 --- /dev/null +++ b/docs/topics/impala_partitioning.xml @@ -0,0 +1,582 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="partitioning"> + + <title>Partitioning for Impala Tables</title> + + <titlealts audience="PDF"> + + <navtitle>Partitioning</navtitle> + + </titlealts> + + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">partitioning</indexterm> + By default, all the data files for a table are located in a single directory. Partitioning is a technique for physically dividing the + data during loading, based on values from one or more columns, to speed up queries that test those columns. For example, with a + <codeph>school_records</codeph> table partitioned on a <codeph>year</codeph> column, there is a separate data directory for each + different year value, and all the data for that year is stored in a data file in that directory. A query that includes a + <codeph>WHERE</codeph> condition such as <codeph>YEAR=1966</codeph>, <codeph>YEAR IN (1989,1999)</codeph>, or <codeph>YEAR BETWEEN + 1984 AND 1989</codeph> can examine only the data files from the appropriate directory or directories, greatly reducing the amount of + data to read and test. + </p> + + <p outputclass="toc inpage"/> + + <p> + See <xref href="impala_tutorial.xml#tut_external_partition_data"/> for an example that illustrates the syntax for creating partitioned + tables, the underlying directory structure in HDFS, and how to attach a partitioned Impala external table to data files stored + elsewhere in HDFS. + </p> + + <p> + Parquet is a popular format for partitioned Impala tables because it is well suited to handle huge data volumes. See + <xref href="impala_parquet.xml#parquet_performance"/> for performance considerations for partitioned Parquet tables. + </p> + + <p> + See <xref href="impala_literals.xml#null"/> for details about how <codeph>NULL</codeph> values are represented in partitioned tables. + </p> + + <p rev="2.2.0"> + See <xref href="impala_s3.xml#s3"/> for details about setting up tables where some or all partitions reside on the Amazon Simple + Storage Service (S3). + </p> + + </conbody> + + <concept id="partitioning_choosing"> + + <title>When to Use Partitioned Tables</title> + + <conbody> + + <p> + Partitioning is typically appropriate for: + </p> + + <ul> + <li> + Tables that are very large, where reading the entire data set takes an impractical amount of time. + </li> + + <li> + Tables that are always or almost always queried with conditions on the partitioning columns. In our example of a table partitioned + by year, <codeph>SELECT COUNT(*) FROM school_records WHERE year = 1985</codeph> is efficient, only examining a small fraction of + the data; but <codeph>SELECT COUNT(*) FROM school_records</codeph> has to process a separate data file for each year, resulting in + more overall work than in an unpartitioned table. You would probably not partition this way if you frequently queried the table + based on last name, student ID, and so on without testing the year. + </li> + + <li> + Columns that have reasonable cardinality (number of different values). If a column only has a small number of values, for example + <codeph>Male</codeph> or <codeph>Female</codeph>, you do not gain much efficiency by eliminating only about 50% of the data to + read for each query. If a column has only a few rows matching each value, the number of directories to process can become a + limiting factor, and the data file in each directory could be too small to take advantage of the Hadoop mechanism for transmitting + data in multi-megabyte blocks. For example, you might partition census data by year, store sales data by year and month, and web + traffic data by year, month, and day. (Some users with high volumes of incoming data might even partition down to the individual + hour and minute.) + </li> + + <li> + Data that already passes through an extract, transform, and load (ETL) pipeline. The values of the partitioning columns are + stripped from the original data files and represented by directory names, so loading data into a partitioned table involves some + sort of transformation or preprocessing. + </li> + </ul> + + </conbody> + + </concept> + + <concept id="partition_sql"> + + <title>SQL Statements for Partitioned Tables</title> + + <conbody> + + <p> + In terms of Impala SQL syntax, partitioning affects these statements: + </p> + + <ul> + <li> + <codeph><xref href="impala_create_table.xml#create_table">CREATE TABLE</xref></codeph>: you specify a <codeph>PARTITIONED + BY</codeph> clause when creating the table to identify names and data types of the partitioning columns. These columns are not + included in the main list of columns for the table. + </li> + + <li rev="2.5.0"> + In <keyword keyref="impala25_full"/> and higher, you can also use the <codeph>PARTITIONED BY</codeph> clause in a <codeph>CREATE TABLE AS + SELECT</codeph> statement. This syntax lets you use a single statement to create a partitioned table, copy data into it, and + create new partitions based on the values in the inserted data. + </li> + + <li> + <codeph><xref href="impala_alter_table.xml#alter_table">ALTER TABLE</xref></codeph>: you can add or drop partitions, to work with + different portions of a huge data set. You can designate the HDFS directory that holds the data files for a specific partition. + With data partitioned by date values, you might <q>age out</q> data that is no longer relevant. + <note conref="../shared/impala_common.xml#common/add_partition_set_location"/> + </li> + + <li> + <codeph><xref href="impala_insert.xml#insert">INSERT</xref></codeph>: When you insert data into a partitioned table, you identify + the partitioning columns. One or more values from each inserted row are not stored in data files, but instead determine the + directory where that row value is stored. You can also specify which partition to load a set of data into, with <codeph>INSERT + OVERWRITE</codeph> statements; you can replace the contents of a specific partition but you cannot append data to a specific + partition. + <p rev="1.3.1" conref="../shared/impala_common.xml#common/insert_inherit_permissions"/> + </li> + + <li> + Although the syntax of the <codeph><xref href="impala_select.xml#select">SELECT</xref></codeph> statement is the same whether or + not the table is partitioned, the way queries interact with partitioned tables can have a dramatic impact on performance and + scalability. The mechanism that lets queries skip certain partitions during a query is known as partition pruning; see + <xref href="impala_partitioning.xml#partition_pruning"/> for details. + </li> + + <li rev="1.4.0"> + In Impala 1.4 and later, there is a <codeph>SHOW PARTITIONS</codeph> statement that displays information about each partition in a + table. See <xref href="impala_show.xml#show"/> for details. + </li> + </ul> + + </conbody> + + </concept> + + <concept id="partition_static_dynamic"> + + <title>Static and Dynamic Partitioning Clauses</title> + + <conbody> + + <p> + Specifying all the partition columns in a SQL statement is called <term>static partitioning</term>, because the statement affects a + single predictable partition. For example, you use static partitioning with an <codeph>ALTER TABLE</codeph> statement that affects + only one partition, or with an <codeph>INSERT</codeph> statement that inserts all values into the same partition: + </p> + +<codeblock>insert into t1 <b>partition(x=10, y='a')</b> select c1 from some_other_table; +</codeblock> + + <p> + When you specify some partition key columns in an <codeph>INSERT</codeph> statement, but leave out the values, Impala determines + which partition to insert. This technique is called <term>dynamic partitioning</term>: + </p> + +<codeblock>insert into t1 <b>partition(x, y='b')</b> select c1, c2 from some_other_table; +-- Create new partition if necessary based on variable year, month, and day; insert a single value. +insert into weather <b>partition (year, month, day)</b> select 'cloudy',2014,4,21; +-- Create new partition if necessary for specified year and month but variable day; insert a single value. +insert into weather <b>partition (year=2014, month=04, day)</b> select 'sunny',22; +</codeblock> + + <p> + The more key columns you specify in the <codeph>PARTITION</codeph> clause, the fewer columns you need in the <codeph>SELECT</codeph> + list. The trailing columns in the <codeph>SELECT</codeph> list are substituted in order for the partition key columns with no + specified value. + </p> + + </conbody> + + </concept> + + <concept id="partition_refresh" rev="2.7.0 IMPALA-1683 CDH-43732"> + + <title>Refreshing a Single Partition</title> + + <conbody> + + <p> + The <codeph>REFRESH</codeph> statement is typically used with partitioned tables when new data files are loaded into a partition by + some non-Impala mechanism, such as a Hive or Spark job. The <codeph>REFRESH</codeph> statement makes Impala aware of the new data + files so that they can be used in Impala queries. Because partitioned tables typically contain a high volume of data, the + <codeph>REFRESH</codeph> operation for a full partitioned table can take significant time. + </p> + + <p> + In <keyword keyref="impala27_full"/> and higher, you can include a <codeph>PARTITION (<varname>partition_spec</varname>)</codeph> clause in the + <codeph>REFRESH</codeph> statement so that only a single partition is refreshed. For example, <codeph>REFRESH big_table PARTITION + (year=2017, month=9, day=30)</codeph>. The partition spec must include all the partition key columns. See + <xref href="impala_refresh.xml#refresh"/> for more details and examples of <codeph>REFRESH</codeph> syntax and usage. + </p> + + </conbody> + + </concept> + + <concept id="partition_permissions"> + + <title>Permissions for Partition Subdirectories</title> + + <conbody> + + <p rev="1.3.1" conref="../shared/impala_common.xml#common/insert_inherit_permissions"/> + + </conbody> + + </concept> + + <concept id="partition_pruning"> + + <title>Partition Pruning for Queries</title> + + <conbody> + + <p> + Partition pruning refers to the mechanism where a query can skip reading the data files corresponding to one or more partitions. If + you can arrange for queries to prune large numbers of unnecessary partitions from the query execution plan, the queries use fewer + resources and are thus proportionally faster and more scalable. + </p> + + <p> + For example, if a table is partitioned by columns <codeph>YEAR</codeph>, <codeph>MONTH</codeph>, and <codeph>DAY</codeph>, then + <codeph>WHERE</codeph> clauses such as <codeph>WHERE year = 2013</codeph>, <codeph>WHERE year < 2010</codeph>, or <codeph>WHERE + year BETWEEN 1995 AND 1998</codeph> allow Impala to skip the data files in all partitions outside the specified range. Likewise, + <codeph>WHERE year = 2013 AND month BETWEEN 1 AND 3</codeph> could prune even more partitions, reading the data files for only a + portion of one year. + </p> + + <p outputclass="toc inpage"/> + + </conbody> + + <concept id="partition_pruning_checking"> + + <title>Checking if Partition Pruning Happens for a Query</title> + + <conbody> + + <p> + To check the effectiveness of partition pruning for a query, check the <codeph>EXPLAIN</codeph> output for the query before + running it. For example, this example shows a table with 3 partitions, where the query only reads 1 of them. The notation + <codeph>#partitions=1/3</codeph> in the <codeph>EXPLAIN</codeph> plan confirms that Impala can do the appropriate partition + pruning. + </p> + +<codeblock>[localhost:21000] > insert into census partition (year=2010) values ('Smith'),('Jones'); +[localhost:21000] > insert into census partition (year=2011) values ('Smith'),('Jones'),('Doe'); +[localhost:21000] > insert into census partition (year=2012) values ('Smith'),('Doe'); +[localhost:21000] > select name from census where year=2010; ++-------+ +| name | ++-------+ +| Smith | +| Jones | ++-------+ +[localhost:21000] > explain select name from census <b>where year=2010</b>; ++------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------+ +| PLAN FRAGMENT 0 | +| PARTITION: UNPARTITIONED | +| | +| 1:EXCHANGE | +| | +| PLAN FRAGMENT 1 | +| PARTITION: RANDOM | +| | +| STREAM DATA SINK | +| EXCHANGE ID: 1 | +| UNPARTITIONED | +| | +| 0:SCAN HDFS | +| table=predicate_propagation.census <b>#partitions=1/3</b> size=12B | ++------------------------------------------------------------------+</codeblock> + + <p rev="1.4.0"> + For a report of the volume of data that was actually read and processed at each stage of the query, check the output of the + <codeph>SUMMARY</codeph> command immediately after running the query. For a more detailed analysis, look at the output of the + <codeph>PROFILE</codeph> command; it includes this same summary report near the start of the profile output. + </p> + + </conbody> + + </concept> + + <concept id="partition_pruning_sql"> + + <title>What SQL Constructs Work with Partition Pruning</title> + + <conbody> + + <p rev="1.2.2"> + <indexterm audience="Cloudera">predicate propagation</indexterm> + Impala can even do partition pruning in cases where the partition key column is not directly compared to a constant, by applying + the transitive property to other parts of the <codeph>WHERE</codeph> clause. This technique is known as predicate propagation, and + is available in Impala 1.2.2 and later. In this example, the census table includes another column indicating when the data was + collected, which happens in 10-year intervals. Even though the query does not compare the partition key column + (<codeph>YEAR</codeph>) to a constant value, Impala can deduce that only the partition <codeph>YEAR=2010</codeph> is required, and + again only reads 1 out of 3 partitions. + </p> + +<codeblock rev="1.2.2">[localhost:21000] > drop table census; +[localhost:21000] > create table census (name string, census_year int) partitioned by (year int); +[localhost:21000] > insert into census partition (year=2010) values ('Smith',2010),('Jones',2010); +[localhost:21000] > insert into census partition (year=2011) values ('Smith',2020),('Jones',2020),('Doe',2020); +[localhost:21000] > insert into census partition (year=2012) values ('Smith',2020),('Doe',2020); +[localhost:21000] > select name from census where year = census_year and census_year=2010; ++-------+ +| name | ++-------+ +| Smith | +| Jones | ++-------+ +[localhost:21000] > explain select name from census <b>where year = census_year and census_year=2010</b>; ++------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------+ +| PLAN FRAGMENT 0 | +| PARTITION: UNPARTITIONED | +| | +| 1:EXCHANGE | +| | +| PLAN FRAGMENT 1 | +| PARTITION: RANDOM | +| | +| STREAM DATA SINK | +| EXCHANGE ID: 1 | +| UNPARTITIONED | +| | +| 0:SCAN HDFS | +| table=predicate_propagation.census <b>#partitions=1/3</b> size=22B | +| predicates: census_year = 2010, year = census_year | ++------------------------------------------------------------------+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/partitions_and_views"/> + + <p conref="../shared/impala_common.xml#common/analytic_partition_pruning_caveat"/> + + </conbody> + + </concept> + + <concept id="dynamic_partition_pruning"> + + <title>Dynamic Partition Pruning</title> + + <conbody> + + <p> + The original mechanism uses to prune partitions is <term>static partition pruning</term>, in which the conditions in the + <codeph>WHERE</codeph> clause are analyzed to determine in advance which partitions can be safely skipped. In Impala 2.5 / CDH 5.7 + and higher, Impala can perform <term>dynamic partition pruning</term>, where information about the partitions is collected during + the query, and Impala prunes unnecessary partitions in ways that were impractical to predict in advance. + </p> + + <p> + For example, if partition key columns are compared to literal values in a <codeph>WHERE</codeph> clause, Impala can perform static + partition pruning during the planning phase to only read the relevant partitions: + </p> + +<codeblock> +-- The query only needs to read 3 partitions whose key values are known ahead of time. +-- That's static partition pruning. +SELECT COUNT(*) FROM sales_table WHERE year IN (2005, 2010, 2015); +</codeblock> + + <p> + Dynamic partition pruning involves using information only available at run time, such as the result of a subquery: + </p> + +<codeblock conref="../shared/impala_common.xml#common/simple_dpp_example"/> + +<!-- Former example. Not sure it really would trigger DPP. SELECT COUNT(*) FROM sales_table WHERE year = (SELECT MAX(year) FROM some_other_table); --> + + <p> + In this case, Impala evaluates the subquery, sends the subquery results to all Impala nodes participating in the query, and then + each <cmdname>impalad</cmdname> daemon uses the dynamic partition pruning optimization to read only the partitions with the + relevant key values. + </p> + + <p> + Dynamic partition pruning is especially effective for queries involving joins of several large partitioned tables. Evaluating the + <codeph>ON</codeph> clauses of the join predicates might normally require reading data from all partitions of certain tables. If + the <codeph>WHERE</codeph> clauses of the query refer to the partition key columns, Impala can now often skip reading many of the + partitions while evaluating the <codeph>ON</codeph> clauses. The dynamic partition pruning optimization reduces the amount of I/O + and the amount of intermediate data stored and transmitted across the network during the query. + </p> + + <p conref="../shared/impala_common.xml#common/spill_to_disk_vs_dynamic_partition_pruning"/> + + <p> + Dynamic partition pruning is part of the runtime filtering feature, which applies to other kinds of queries in addition to queries + against partitioned tables. See <xref href="impala_runtime_filtering.xml#runtime_filtering"/> for full details about this feature. + </p> + + </conbody> + + </concept> + + </concept> + + <concept id="partition_key_columns"> + + <title>Partition Key Columns</title> + + <conbody> + + <p> + The columns you choose as the partition keys should be ones that are frequently used to filter query results in important, + large-scale queries. Popular examples are some combination of year, month, and day when the data has associated time values, and + geographic region when the data is associated with some place. + </p> + + <ul> + <li> + <p> + For time-based data, split out the separate parts into their own columns, because Impala cannot partition based on a + <codeph>TIMESTAMP</codeph> column. + </p> + </li> + + <li> + <p> + The data type of the partition columns does not have a significant effect on the storage required, because the values from those + columns are not stored in the data files, rather they are represented as strings inside HDFS directory names. + </p> + </li> + + <li rev="IMPALA-2499"> + <p> + In <keyword keyref="impala25_full"/> and higher, you can enable the <codeph>OPTIMIZE_PARTITION_KEY_SCANS</codeph> query option to speed up + queries that only refer to partition key columns, such as <codeph>SELECT MAX(year)</codeph>. This setting is not enabled by + default because the query behavior is slightly different if the table contains partition directories without actual data inside. + See <xref href="impala_optimize_partition_key_scans.xml#optimize_partition_key_scans"/> for details. + </p> + </li> + + <li> + <p conref="../shared/impala_common.xml#common/complex_types_partitioning"/> + </li> + + <li> + <p> + Remember that when Impala queries data stored in HDFS, it is most efficient to use multi-megabyte files to take advantage of the + HDFS block size. For Parquet tables, the block size (and ideal size of the data files) is <ph rev="parquet_block_size">256 MB in + Impala 2.0 and later</ph>. Therefore, avoid specifying too many partition key columns, which could result in individual + partitions containing only small amounts of data. For example, if you receive 1 GB of data per day, you might partition by year, + month, and day; while if you receive 5 GB of data per minute, you might partition by year, month, day, hour, and minute. If you + have data with a geographic component, you might partition based on postal code if you have many megabytes of data for each + postal code, but if not, you might partition by some larger region such as city, state, or country. state + </p> + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/partition_key_optimization"/> + + </conbody> + + </concept> + + <concept id="mixed_format_partitions"> + + <title>Setting Different File Formats for Partitions</title> + + <conbody> + + <p> + Partitioned tables have the flexibility to use different file formats for different partitions. (For background information about + the different file formats Impala supports, see <xref href="impala_file_formats.xml#file_formats"/>.) For example, if you originally + received data in text format, then received new data in RCFile format, and eventually began receiving data in Parquet format, all + that data could reside in the same table for queries. You just need to ensure that the table is structured so that the data files + that use different file formats reside in separate partitions. + </p> + + <p> + For example, here is how you might switch from text to Parquet data as you receive data for different years: + </p> + +<codeblock>[localhost:21000] > create table census (name string) partitioned by (year smallint); +[localhost:21000] > alter table census add partition (year=2012); -- Text format; + +[localhost:21000] > alter table census add partition (year=2013); -- Text format switches to Parquet before data loaded; +[localhost:21000] > alter table census partition (year=2013) set fileformat parquet; + +[localhost:21000] > insert into census partition (year=2012) values ('Smith'),('Jones'),('Lee'),('Singh'); +[localhost:21000] > insert into census partition (year=2013) values ('Flores'),('Bogomolov'),('Cooper'),('Appiah');</codeblock> + + <p> + At this point, the HDFS directory for <codeph>year=2012</codeph> contains a text-format data file, while the HDFS directory for + <codeph>year=2013</codeph> contains a Parquet data file. As always, when loading non-trivial data, you would use <codeph>INSERT ... + SELECT</codeph> or <codeph>LOAD DATA</codeph> to import data in large batches, rather than <codeph>INSERT ... VALUES</codeph> which + produces small files that are inefficient for real-world queries. + </p> + + <p> + For other file types that Impala cannot create natively, you can switch into Hive and issue the <codeph>ALTER TABLE ... SET + FILEFORMAT</codeph> statements and <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> statements there. After switching back to + Impala, issue a <codeph>REFRESH <varname>table_name</varname></codeph> statement so that Impala recognizes any partitions or new + data added through Hive. + </p> + + </conbody> + + </concept> + + <concept id="partition_management"> + + <title>Managing Partitions</title> + + <conbody> + + <p> + You can add, drop, set the expected file format, or set the HDFS location of the data files for individual partitions within an + Impala table. See <xref href="impala_alter_table.xml#alter_table"/> for syntax details, and + <xref href="impala_partitioning.xml#mixed_format_partitions"/> for tips on managing tables containing partitions with different file + formats. + </p> + + <note conref="../shared/impala_common.xml#common/add_partition_set_location"/> + + <p> + What happens to the data files when a partition is dropped depends on whether the partitioned table is designated as internal or + external. For an internal (managed) table, the data files are deleted. For example, if data in the partitioned table is a copy of + raw data files stored elsewhere, you might save disk space by dropping older partitions that are no longer required for reporting, + knowing that the original data is still available if needed later. For an external table, the data files are left alone. For + example, dropping a partition without deleting the associated files lets Impala consider a smaller set of partitions, improving + query efficiency and reducing overhead for DDL operations on the table; if the data is needed again later, you can add the partition + again. See <xref href="impala_tables.xml#tables" /> for details and examples. + </p> + + </conbody> + + </concept> + + <concept rev="kudu" id="partition_kudu" audience="Cloudera"> + + <title>Using Partitioning with Kudu Tables</title> + + <prolog> + <metadata> + <data name="Category" value="Kudu"/> + </metadata> + </prolog> + + <conbody> + + <p> + Kudu tables use a more fine-grained partitioning scheme than tables containing HDFS data files. You specify a <codeph>DISTRIBUTE + BY</codeph> clause with the <codeph>CREATE TABLE</codeph> statement to identify how to divide the values from the partition key + columns. + </p> + + </conbody> + + </concept> + +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_perf_benchmarking.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_perf_benchmarking.xml b/docs/topics/impala_perf_benchmarking.xml new file mode 100644 index 0000000..b2e058d --- /dev/null +++ b/docs/topics/impala_perf_benchmarking.xml @@ -0,0 +1,36 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="perf_benchmarks"> + + <title>Benchmarking Impala Queries</title> + <titlealts audience="PDF"><navtitle>Benchmarking</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Performance"/> + <data name="Category" value="Impala"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Proof of Concept"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + Because Impala, like other Hadoop components, is designed to handle large data volumes in a distributed + environment, conduct any performance tests using realistic data and cluster configurations. Use a multi-node + cluster rather than a single node; run queries against tables containing terabytes of data rather than tens + of gigabytes. The parallel processing techniques used by Impala are most appropriate for workloads that are + beyond the capacity of a single server. + </p> + + <p> + When you run queries returning large numbers of rows, the CPU time to pretty-print the output can be + substantial, giving an inaccurate measurement of the actual query time. Consider using the + <codeph>-B</codeph> option on the <codeph>impala-shell</codeph> command to turn off the pretty-printing, and + optionally the <codeph>-o</codeph> option to store query results in a file rather than printing to the + screen. See <xref href="impala_shell_options.xml#shell_options"/> for details. + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_perf_cookbook.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_perf_cookbook.xml b/docs/topics/impala_perf_cookbook.xml new file mode 100644 index 0000000..a42f7c9 --- /dev/null +++ b/docs/topics/impala_perf_cookbook.xml @@ -0,0 +1,269 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="perf_cookbook"> + + <title>Impala Performance Guidelines and Best Practices</title> + <titlealts audience="PDF"><navtitle>Performance Best Practices</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Performance"/> + <data name="Category" value="Impala"/> + <data name="Category" value="Planning"/> + <data name="Category" value="Proof of Concept"/> + <data name="Category" value="Guidelines"/> + <data name="Category" value="Best Practices"/> + <data name="Category" value="Proof of Concept"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + Here are performance guidelines and best practices that you can use during planning, experimentation, and + performance tuning for an Impala-enabled CDH cluster. All of this information is also available in more + detail elsewhere in the Impala documentation; it is gathered together here to serve as a cookbook and + emphasize which performance techniques typically provide the highest return on investment + </p> + + <p outputclass="toc inpage"/> + + <section id="perf_cookbook_file_format"> + + <title>Choose the appropriate file format for the data.</title> + + <p> + Typically, for large volumes of data (multiple gigabytes per table or partition), the Parquet file format + performs best because of its combination of columnar storage layout, large I/O request size, and + compression and encoding. See <xref href="impala_file_formats.xml#file_formats"/> for comparisons of all + file formats supported by Impala, and <xref href="impala_parquet.xml#parquet"/> for details about the + Parquet file format. + </p> + + <note> + For smaller volumes of data, a few gigabytes or less for each table or partition, you might not see + significant performance differences between file formats. At small data volumes, reduced I/O from an + efficient compressed file format can be counterbalanced by reduced opportunity for parallel execution. When + planning for a production deployment or conducting benchmarks, always use realistic data volumes to get a + true picture of performance and scalability. + </note> + </section> + + <section id="perf_cookbook_small_files"> + + <title>Avoid data ingestion processes that produce many small files.</title> + + <p> + When producing data files outside of Impala, prefer either text format or Avro, where you can build up the + files row by row. Once the data is in Impala, you can convert it to the more efficient Parquet format and + split into multiple data files using a single <codeph>INSERT ... SELECT</codeph> statement. Or, if you have + the infrastructure to produce multi-megabyte Parquet files as part of your data preparation process, do + that and skip the conversion step inside Impala. + </p> + + <p> + Always use <codeph>INSERT ... SELECT</codeph> to copy significant volumes of data from table to table + within Impala. Avoid <codeph>INSERT ... VALUES</codeph> for any substantial volume of data or + performance-critical tables, because each such statement produces a separate tiny data file. See + <xref href="impala_insert.xml#insert"/> for examples of the <codeph>INSERT ... SELECT</codeph> syntax. + </p> + + <p> + For example, if you have thousands of partitions in a Parquet table, each with less than + <ph rev="parquet_block_size">256 MB</ph> of data, consider partitioning in a less granular way, such as by + year / month rather than year / month / day. If an inefficient data ingestion process produces thousands of + data files in the same table or partition, consider compacting the data by performing an <codeph>INSERT ... + SELECT</codeph> to copy all the data to a different table; the data will be reorganized into a smaller + number of larger files by this process. + </p> + </section> + + <section id="perf_cookbook_partitioning"> + + <title>Choose partitioning granularity based on actual data volume.</title> + + <p> + Partitioning is a technique that physically divides the data based on values of one or more columns, such + as by year, month, day, region, city, section of a web site, and so on. When you issue queries that request + a specific value or range of values for the partition key columns, Impala can avoid reading the irrelevant + data, potentially yielding a huge savings in disk I/O. + </p> + + <p> + When deciding which column(s) to use for partitioning, choose the right level of granularity. For example, + should you partition by year, month, and day, or only by year and month? Choose a partitioning strategy + that puts at least <ph rev="parquet_block_size">256 MB</ph> of data in each partition, to take advantage of + HDFS bulk I/O and Impala distributed queries. + </p> + + <p> + Over-partitioning can also cause query planning to take longer than necessary, as Impala prunes the + unnecessary partitions. Ideally, keep the number of partitions in the table under 30 thousand. + </p> + + <p> + When preparing data files to go in a partition directory, create several large files rather than many small + ones. If you receive data in the form of many small files and have no control over the input format, + consider using the <codeph>INSERT ... SELECT</codeph> syntax to copy data from one table or partition to + another, which compacts the files into a relatively small number (based on the number of nodes in the + cluster). + </p> + + <p> + If you need to reduce the overall number of partitions and increase the amount of data in each partition, + first look for partition key columns that are rarely referenced or are referenced in non-critical queries + (not subject to an SLA). For example, your web site log data might be partitioned by year, month, day, and + hour, but if most queries roll up the results by day, perhaps you only need to partition by year, month, + and day. + </p> + + <p> + If you need to reduce the granularity even more, consider creating <q>buckets</q>, computed values + corresponding to different sets of partition key values. For example, you can use the + <codeph>TRUNC()</codeph> function with a <codeph>TIMESTAMP</codeph> column to group date and time values + based on intervals such as week or quarter. See + <xref href="impala_datetime_functions.xml#datetime_functions"/> for details. + </p> + + <p> + See <xref href="impala_partitioning.xml#partitioning"/> for full details and performance considerations for + partitioning. + </p> + </section> + + <section id="perf_cookbook_partition_keys"> + + <title>Use smallest appropriate integer types for partition key columns.</title> + + <p> + Although it is tempting to use strings for partition key columns, since those values are turned into HDFS + directory names anyway, you can minimize memory usage by using numeric values for common partition key + fields such as <codeph>YEAR</codeph>, <codeph>MONTH</codeph>, and <codeph>DAY</codeph>. Use the smallest + integer type that holds the appropriate range of values, typically <codeph>TINYINT</codeph> for + <codeph>MONTH</codeph> and <codeph>DAY</codeph>, and <codeph>SMALLINT</codeph> for <codeph>YEAR</codeph>. + Use the <codeph>EXTRACT()</codeph> function to pull out individual date and time fields from a + <codeph>TIMESTAMP</codeph> value, and <codeph>CAST()</codeph> the return value to the appropriate integer + type. + </p> + </section> + + <section id="perf_cookbook_parquet_block_size"> + + <title>Choose an appropriate Parquet block size.</title> + + <p rev="parquet_block_size"> + By default, the Impala <codeph>INSERT ... SELECT</codeph> statement creates Parquet files with a 256 MB + block size. (This default was changed in Impala 2.0. Formerly, the limit was 1 GB, but Impala made + conservative estimates about compression, resulting in files that were smaller than 1 GB.) + </p> + + <p> + Each Parquet file written by Impala is a single block, allowing the whole file to be processed as a unit by a single host. + As you copy Parquet files into HDFS or between HDFS filesystems, use <codeph>hdfs dfs -pb</codeph> to preserve the original + block size. + </p> + + <p> + If there is only one or a few data block in your Parquet table, or in a partition that is the only one + accessed by a query, then you might experience a slowdown for a different reason: not enough data to take + advantage of Impala's parallel distributed queries. Each data block is processed by a single core on one of + the DataNodes. In a 100-node cluster of 16-core machines, you could potentially process thousands of data + files simultaneously. You want to find a sweet spot between <q>many tiny files</q> and <q>single giant + file</q> that balances bulk I/O and parallel processing. You can set the <codeph>PARQUET_FILE_SIZE</codeph> + query option before doing an <codeph>INSERT ... SELECT</codeph> statement to reduce the size of each + generated Parquet file. <ph rev="2.0.0">(Specify the file size as an absolute number of bytes, or in Impala + 2.0 and later, in units ending with <codeph>m</codeph> for megabytes or <codeph>g</codeph> for + gigabytes.)</ph> Run benchmarks with different file sizes to find the right balance point for your + particular data volume. + </p> + </section> + + <section id="perf_cookbook_stats"> + + <title>Gather statistics for all tables used in performance-critical or high-volume join queries.</title> + + <p> + Gather the statistics with the <codeph>COMPUTE STATS</codeph> statement. See + <xref href="impala_perf_joins.xml#perf_joins"/> for details. + </p> + </section> + + <section id="perf_cookbook_network"> + + <title>Minimize the overhead of transmitting results back to the client.</title> + + <p> + Use techniques such as: + </p> + + <ul> + <li> + Aggregation. If you need to know how many rows match a condition, the total values of matching values + from some column, the lowest or highest matching value, and so on, call aggregate functions such as + <codeph>COUNT()</codeph>, <codeph>SUM()</codeph>, and <codeph>MAX()</codeph> in the query rather than + sending the result set to an application and doing those computations there. Remember that the size of an + unaggregated result set could be huge, requiring substantial time to transmit across the network. + </li> + + <li> + Filtering. Use all applicable tests in the <codeph>WHERE</codeph> clause of a query to eliminate rows + that are not relevant, rather than producing a big result set and filtering it using application logic. + </li> + + <li> + <codeph>LIMIT</codeph> clause. If you only need to see a few sample values from a result set, or the top + or bottom values from a query using <codeph>ORDER BY</codeph>, include the <codeph>LIMIT</codeph> clause + to reduce the size of the result set rather than asking for the full result set and then throwing most of + the rows away. + </li> + + <li> + Avoid overhead from pretty-printing the result set and displaying it on the screen. When you retrieve the + results through <cmdname>impala-shell</cmdname>, use <cmdname>impala-shell</cmdname> options such as + <codeph>-B</codeph> and <codeph>--output_delimiter</codeph> to produce results without special + formatting, and redirect output to a file rather than printing to the screen. Consider using + <codeph>INSERT ... SELECT</codeph> to write the results directly to new files in HDFS. See + <xref href="impala_shell_options.xml#shell_options"/> for details about the + <cmdname>impala-shell</cmdname> command-line options. + </li> + </ul> + </section> + + <section id="perf_cookbook_explain"> + + <title>Verify that your queries are planned in an efficient logical manner.</title> + + <p> + Examine the <codeph>EXPLAIN</codeph> plan for a query before actually running it. See + <xref href="impala_explain.xml#explain"/> and <xref href="impala_explain_plan.xml#perf_explain"/> for + details. + </p> + </section> + + <section id="perf_cookbook_profile"> + + <title>Verify performance characteristics of queries.</title> + + <p> + Verify that the low-level aspects of I/O, memory usage, network bandwidth, CPU utilization, and so on are + within expected ranges by examining the query profile for a query after running it. See + <xref href="impala_explain_plan.xml#perf_profile"/> for details. + </p> + </section> + + <section id="perf_cookbook_os"> + + <title>Use appropriate operating system settings.</title> + + <p> + See <xref href="http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_admin_performance.html" scope="external" format="html">Optimizing Performance in CDH</xref> + for recommendations about operating system + settings that you can change to influence Impala performance. In particular, you might find + that changing the <codeph>vm.swappiness</codeph> Linux kernel setting to a non-zero value improves + overall performance. + </p> + </section> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_perf_ddl.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_perf_ddl.xml b/docs/topics/impala_perf_ddl.xml new file mode 100644 index 0000000..d075cd2 --- /dev/null +++ b/docs/topics/impala_perf_ddl.xml @@ -0,0 +1,42 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="perf_ddl"> + + <title>Performance Considerations for DDL Statements</title> + <prolog> + <metadata> + <data name="Category" value="Performance"/> + <data name="Category" value="Impala"/> + <data name="Category" value="DDL"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + These tips and guidelines apply to the Impala DDL statements, which are listed in + <xref href="impala_ddl.xml#ddl"/>. + </p> + + <p> + Because Impala DDL statements operate on the metastore database, the performance considerations for those + statements are totally different than for distributed queries that operate on HDFS + <ph rev="2.2.0">or S3</ph> data files, or on HBase tables. + </p> + + <p> + Each DDL statement makes a relatively small update to the metastore database. The overhead for each statement + is proportional to the overall number of Impala and Hive tables, and (for a partitioned table) to the overall + number of partitions in that table. Issuing large numbers of DDL statements (such as one for each table or + one for each partition) also has the potential to encounter a bottleneck with access to the metastore + database. Therefore, for efficient DDL, try to design your application logic and ETL pipeline to avoid a huge + number of tables and a huge number of partitions within each table. In this context, <q>huge</q> is in the + range of tens of thousands or hundreds of thousands. + </p> + + <note conref="../shared/impala_common.xml#common/add_partition_set_location"/> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_perf_hdfs_caching.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_perf_hdfs_caching.xml b/docs/topics/impala_perf_hdfs_caching.xml new file mode 100644 index 0000000..883bc8d --- /dev/null +++ b/docs/topics/impala_perf_hdfs_caching.xml @@ -0,0 +1,609 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.4" id="hdfs_caching"> + + <title>Using HDFS Caching with Impala (<keyword keyref="impala21"/> or higher only)</title> + <titlealts audience="PDF"><navtitle>HDFS Caching</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Scalability"/> + <data name="Category" value="HDFS"/> + <data name="Category" value="HDFS Caching"/> + <data name="Category" value="Memory"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + HDFS caching provides performance and scalability benefits in production environments where Impala queries + and other Hadoop jobs operate on quantities of data much larger than the physical RAM on the DataNodes, + making it impractical to rely on the Linux OS cache, which only keeps the most recently used data in memory. + Data read from the HDFS cache avoids the overhead of checksumming and memory-to-memory copying involved when + using data from the Linux OS cache. + </p> + + <note> + <p> + On a small or lightly loaded cluster, HDFS caching might not produce any speedup. It might even lead to + slower queries, if I/O read operations that were performed in parallel across the entire cluster are replaced by in-memory + operations operating on a smaller number of hosts. The hosts where the HDFS blocks are cached can become + bottlenecks because they experience high CPU load while processing the cached data blocks, while other hosts remain idle. + Therefore, always compare performance with and without this feature enabled, using a realistic workload. + </p> + <p rev="2.2.0"> + In <keyword keyref="impala22_full"/> and higher, you can spread the CPU load more evenly by specifying the <codeph>WITH REPLICATION</codeph> + clause of the <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements. + This clause lets you control the replication factor for + HDFS caching for a specific table or partition. By default, each cached block is + only present on a single host, which can lead to CPU contention if the same host + processes each cached block. Increasing the replication factor lets Impala choose + different hosts to process different cached blocks, to better distribute the CPU load. + Always use a <codeph>WITH REPLICATION</codeph> setting of at least 3, and adjust upward + if necessary to match the replication factor for the underlying HDFS data files. + </p> + <p rev="2.5.0"> + In <keyword keyref="impala25_full"/> and higher, Impala automatically randomizes which host processes + a cached HDFS block, to avoid CPU hotspots. For tables where HDFS caching is not applied, + Impala designates which host to process a data block using an algorithm that estimates + the load on each host. If CPU hotspots still arise during queries, + you can enable additional randomization for the scheduling algorithm for non-HDFS cached data + by setting the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option. + </p> + </note> + + <p outputclass="toc inpage"/> + +<!-- Could conref this background link; haven't decided yet the best place or if it's needed twice. --> + + <p> + For background information about how to set up and manage HDFS caching for a CDH cluster, see +<!-- Original URL: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh_ig_hdfs_caching.html --> + <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hdfs_caching.html" scope="external" format="html">the + CDH documentation</xref>. + </p> + </conbody> + + <concept id="hdfs_caching_overview"> + + <title>Overview of HDFS Caching for Impala</title> + <prolog> + <metadata> + <data name="Category" value="Concepts"/> + </metadata> + </prolog> + + <conbody> + + <p> + On <ph rev="upstream">CDH 5.1</ph> and higher, Impala can use the HDFS caching feature to make more effective use of RAM, so that + repeated queries can take advantage of data <q>pinned</q> in memory regardless of how much data is + processed overall. The HDFS caching feature lets you designate a subset of frequently accessed data to be + pinned permanently in memory, remaining in the cache across multiple queries and never being evicted. This + technique is suitable for tables or partitions that are frequently accessed and are small enough to fit + entirely within the HDFS memory cache. For example, you might designate several dimension tables to be + pinned in the cache, to speed up many different join queries that reference them. Or in a partitioned + table, you might pin a partition holding data from the most recent time period because that data will be + queried intensively; then when the next set of data arrives, you could unpin the previous partition and pin + the partition holding the new data. + </p> + + <p> + Because this Impala performance feature relies on HDFS infrastructure, it only applies to Impala tables + that use HDFS data files. HDFS caching for Impala does not apply to HBase tables, S3 tables, + Kudu tables, + or Isilon tables. + </p> + + </conbody> + </concept> + + <concept id="hdfs_caching_prereqs"> + + <title>Setting Up HDFS Caching for Impala</title> + + <conbody> + + <p> + To use HDFS caching with Impala, first set up that feature for your CDH cluster: + </p> + + <ul> + <li> + <p> + Decide how much memory to devote to the HDFS cache on each host. Remember that the total memory available + for cached data is the sum of the cache sizes on all the hosts. By default, any data block is only cached on one + host, although you can cache a block across multiple hosts by increasing the replication factor. + <!-- Obsoleted in Impala 2.2 and higher by IMPALA-1587. + Once a data block is cached on one host, all requests to process that block are routed to that same host.) + --> + </p> + </li> + + <li> + <p> + Issue <cmdname>hdfs cacheadmin</cmdname> commands to set up one or more cache pools, owned by the same + user as the <cmdname>impalad</cmdname> daemon (typically <codeph>impala</codeph>). For example: +<codeblock>hdfs cacheadmin -addPool four_gig_pool -owner impala -limit 4000000000 +</codeblock> + For details about the <cmdname>hdfs cacheadmin</cmdname> command, see +<!-- Original URL: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh_ig_hdfs_caching.html --> + <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hdfs_caching.html" scope="external" format="html">the + CDH documentation</xref>. + </p> + </li> + </ul> + + <p> + Once HDFS caching is enabled and one or more pools are available, see + <xref href="impala_perf_hdfs_caching.xml#hdfs_caching_ddl"/> for how to choose which Impala data to load + into the HDFS cache. On the Impala side, you specify the cache pool name defined by the <codeph>hdfs + cacheadmin</codeph> command in the Impala DDL statements that enable HDFS caching for a table or partition, + such as <codeph>CREATE TABLE ... CACHED IN <varname>pool</varname></codeph> or <codeph>ALTER TABLE ... SET + CACHED IN <varname>pool</varname></codeph>. + </p> + </conbody> + </concept> + + <concept id="hdfs_caching_ddl"> + + <title>Enabling HDFS Caching for Impala Tables and Partitions</title> + + <conbody> + + <p> + Begin by choosing which tables or partitions to cache. For example, these might be lookup tables that are + accessed by many different join queries, or partitions corresponding to the most recent time period that + are analyzed by different reports or ad hoc queries. + </p> + + <p> + In your SQL statements, you specify logical divisions such as tables and partitions to be cached. Impala + translates these requests into HDFS-level directives that apply to particular directories and files. For + example, given a partitioned table <codeph>CENSUS</codeph> with a partition key column + <codeph>YEAR</codeph>, you could choose to cache all or part of the data as follows: + </p> + + <p conref="../shared/impala_common.xml#common/impala_cache_replication_factor"/> + +<codeblock>-- Cache the entire table (all partitions). +alter table census set cached in '<varname>pool_name</varname>'; + +-- Remove the entire table from the cache. +alter table census set uncached; + +-- Cache a portion of the table (a single partition). +-- If the table is partitioned by multiple columns (such as year, month, day), +-- the ALTER TABLE command must specify values for all those columns. +alter table census partition (year=1960) set cached in '<varname>pool_name</varname>'; + +<ph rev="2.2.0">-- Cache the data from one partition on up to 4 hosts, to minimize CPU load on any +-- single host when the same data block is processed multiple times. +alter table census partition (year=1970) + set cached in '<varname>pool_name</varname>' with replication = 4;</ph> + +-- At each stage, check the volume of cached data. +-- For large tables or partitions, the background loading might take some time, +-- so you might have to wait and reissue the statement until all the data +-- has finished being loaded into the cache. +show table stats census; ++-------+-------+--------+------+--------------+--------+ +| year | #Rows | #Files | Size | Bytes Cached | Format | ++-------+-------+--------+------+--------------+--------+ +| 1900 | -1 | 1 | 11B | NOT CACHED | TEXT | +| 1940 | -1 | 1 | 11B | NOT CACHED | TEXT | +| 1960 | -1 | 1 | 11B | 11B | TEXT | +| 1970 | -1 | 1 | 11B | NOT CACHED | TEXT | +| Total | -1 | 4 | 44B | 11B | | ++-------+-------+--------+------+--------------+--------+ +</codeblock> + + <p> + <b>CREATE TABLE considerations:</b> + </p> + + <p> + The HDFS caching feature affects the Impala <codeph>CREATE TABLE</codeph> statement as follows: + </p> + + <ul> + <li> + <p> + You can put a <codeph>CACHED IN '<varname>pool_name</varname>'</codeph> clause + <ph rev="2.2.0">and optionally a <codeph>WITH REPLICATION = <varname>number_of_hosts</varname></codeph> clause</ph> + at the end of a + <codeph>CREATE TABLE</codeph> statement to automatically cache the entire contents of the table, + including any partitions added later. The <varname>pool_name</varname> is a pool that you previously set + up with the <cmdname>hdfs cacheadmin</cmdname> command. + </p> + </li> + + <li> + <p> + Once a table is designated for HDFS caching through the <codeph>CREATE TABLE</codeph> statement, if new + partitions are added later through <codeph>ALTER TABLE ... ADD PARTITION</codeph> statements, the data in + those new partitions is automatically cached in the same pool. + </p> + </li> + + <li> + <p> + If you want to perform repetitive queries on a subset of data from a large table, and it is not practical + to designate the entire table or specific partitions for HDFS caching, you can create a new cached table + with just a subset of the data by using <codeph>CREATE TABLE ... CACHED IN '<varname>pool_name</varname>' + AS SELECT ... WHERE ...</codeph>. When you are finished with generating reports from this subset of data, + drop the table and both the data files and the data cached in RAM are automatically deleted. + </p> + </li> + </ul> + + <p> + See <xref href="impala_create_table.xml#create_table"/> for the full syntax. + </p> + + <p> + <b>Other memory considerations:</b> + </p> + + <p> + Certain DDL operations, such as <codeph>ALTER TABLE ... SET LOCATION</codeph>, are blocked while the + underlying HDFS directories contain cached files. You must uncache the files first, before changing the + location, dropping the table, and so on. + </p> + + <p> + When data is requested to be pinned in memory, that process happens in the background without blocking + access to the data while the caching is in progress. Loading the data from disk could take some time. + Impala reads each HDFS data block from memory if it has been pinned already, or from disk if it has not + been pinned yet. When files are added to a table or partition whose contents are cached, Impala + automatically detects those changes and performs a <codeph>REFRESH</codeph> automatically once the relevant + data is cached. + </p> + + <p> + The amount of data that you can pin on each node through the HDFS caching mechanism is subject to a quota + that is enforced by the underlying HDFS service. Before requesting to pin an Impala table or partition in + memory, check that its size does not exceed this quota. + </p> + + <note> + Because the HDFS cache consists of combined memory from all the DataNodes in the cluster, cached tables or + partitions can be bigger than the amount of HDFS cache memory on any single host. + </note> + </conbody> + </concept> + + <concept id="hdfs_caching_etl"> + + <title>Loading and Removing Data with HDFS Caching Enabled</title> + <prolog> + <metadata> + <data name="Category" value="ETL"/> + </metadata> + </prolog> + + <conbody> + + <p> + When HDFS caching is enabled, extra processing happens in the background when you add or remove data + through statements such as <codeph>INSERT</codeph> and <codeph>DROP TABLE</codeph>. + </p> + + <p> + <b>Inserting or loading data:</b> + </p> + + <ul> + <li> + When Impala performs an <codeph><xref href="impala_insert.xml#insert">INSERT</xref></codeph> or + <codeph><xref href="impala_load_data.xml#load_data">LOAD DATA</xref></codeph> statement for a table or + partition that is cached, the new data files are automatically cached and Impala recognizes that fact + automatically. + </li> + + <li> + If you perform an <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> through Hive, as always, Impala + only recognizes the new data files after a <codeph>REFRESH <varname>table_name</varname></codeph> + statement in Impala. + </li> + + <li> + If the cache pool is entirely full, or becomes full before all the requested data can be cached, the + Impala DDL statement returns an error. This is to avoid situations where only some of the requested data + could be cached. + </li> + + <li> + When HDFS caching is enabled for a table or partition, new data files are cached automatically when they + are added to the appropriate directory in HDFS, without the need for a <codeph>REFRESH</codeph> statement + in Impala. Impala automatically performs a <codeph>REFRESH</codeph> once the new data is loaded into the + HDFS cache. + </li> + </ul> + + <p> + <b>Dropping tables, partitions, or cache pools:</b> + </p> + + <p> + The HDFS caching feature interacts with the Impala + <codeph><xref href="impala_drop_table.xml#drop_table">DROP TABLE</xref></codeph> and + <codeph><xref href="impala_alter_table.xml#alter_table">ALTER TABLE ... DROP PARTITION</xref></codeph> + statements as follows: + </p> + + <ul> + <li> + When you issue a <codeph>DROP TABLE</codeph> for a table that is entirely cached, or has some partitions + cached, the <codeph>DROP TABLE</codeph> succeeds and all the cache directives Impala submitted for that + table are removed from the HDFS cache system. + </li> + + <li> + The same applies to <codeph>ALTER TABLE ... DROP PARTITION</codeph>. The operation succeeds and any cache + directives are removed. + </li> + + <li> + As always, the underlying data files are removed if the dropped table is an internal table, or the + dropped partition is in its default location underneath an internal table. The data files are left alone + if the dropped table is an external table, or if the dropped partition is in a non-default location. + </li> + + <li> + If you designated the data files as cached through the <cmdname>hdfs cacheadmin</cmdname> command, and + the data files are left behind as described in the previous item, the data files remain cached. Impala + only removes the cache directives submitted by Impala through the <codeph>CREATE TABLE</codeph> or + <codeph>ALTER TABLE</codeph> statements. It is OK to have multiple redundant cache directives pertaining + to the same files; the directives all have unique IDs and owners so that the system can tell them apart. + </li> + + <li> + If you drop an HDFS cache pool through the <cmdname>hdfs cacheadmin</cmdname> command, all the Impala + data files are preserved, just no longer cached. After a subsequent <codeph>REFRESH</codeph>, + <codeph>SHOW TABLE STATS</codeph> reports 0 bytes cached for each associated Impala table or partition. + </li> + </ul> + + <p> + <b>Relocating a table or partition:</b> + </p> + + <p> + The HDFS caching feature interacts with the Impala + <codeph><xref href="impala_alter_table.xml#alter_table">ALTER TABLE ... SET LOCATION</xref></codeph> + statement as follows: + </p> + + <ul> + <li> + If you have designated a table or partition as cached through the <codeph>CREATE TABLE</codeph> or + <codeph>ALTER TABLE</codeph> statements, subsequent attempts to relocate the table or partition through + an <codeph>ALTER TABLE ... SET LOCATION</codeph> statement will fail. You must issue an <codeph>ALTER + TABLE ... SET UNCACHED</codeph> statement for the table or partition first. Otherwise, Impala would lose + track of some cached data files and have no way to uncache them later. + </li> + </ul> + </conbody> + </concept> + + <concept id="hdfs_caching_admin"> + + <title>Administration for HDFS Caching with Impala</title> + + <conbody> + + <p> + Here are the guidelines and steps to check or change the status of HDFS caching for Impala data: + </p> + + <p> + <b>hdfs cacheadmin command:</b> + </p> + + <ul> + <li> + If you drop a cache pool with the <cmdname>hdfs cacheadmin</cmdname> command, Impala queries against the + associated data files will still work, by falling back to reading the files from disk. After performing a + <codeph>REFRESH</codeph> on the table, Impala reports the number of bytes cached as 0 for all associated + tables and partitions. + </li> + + <li> + You might use <cmdname>hdfs cacheadmin</cmdname> to get a list of existing cache pools, or detailed + information about the pools, as follows: +<codeblock scale="60">hdfs cacheadmin -listDirectives # Basic info +Found 122 entries + ID POOL REPL EXPIRY PATH + 123 testPool 1 never /user/hive/warehouse/tpcds.store_sales + 124 testPool 1 never /user/hive/warehouse/tpcds.store_sales/ss_date=1998-01-15 + 125 testPool 1 never /user/hive/warehouse/tpcds.store_sales/ss_date=1998-02-01 +... + +hdfs cacheadmin -listDirectives -stats # More details +Found 122 entries + ID POOL REPL EXPIRY PATH BYTES_NEEDED BYTES_CACHED FILES_NEEDED FILES_CACHED + 123 testPool 1 never /user/hive/warehouse/tpcds.store_sales 0 0 0 0 + 124 testPool 1 never /user/hive/warehouse/tpcds.store_sales/ss_date=1998-01-15 143169 143169 1 1 + 125 testPool 1 never /user/hive/warehouse/tpcds.store_sales/ss_date=1998-02-01 112447 112447 1 1 +... +</codeblock> + </li> + </ul> + + <p> + <b>Impala SHOW statement:</b> + </p> + + <ul> + <li> + For each table or partition, the <codeph>SHOW TABLE STATS</codeph> or <codeph>SHOW PARTITIONS</codeph> + statement displays the number of bytes currently cached by the HDFS caching feature. If there are no + cache directives in place for that table or partition, the result set displays <codeph>NOT + CACHED</codeph>. A value of 0, or a smaller number than the overall size of the table or partition, + indicates that the cache request has been submitted but the data has not been entirely loaded into memory + yet. See <xref href="impala_show.xml#show"/> for details. + </li> + </ul> + + <p> + <b>Cloudera Manager:</b> + </p> + + <ul> + <li> + You can enable or disable HDFS caching through Cloudera Manager, using the configuration setting + <uicontrol>Maximum Memory Used for Caching</uicontrol> for the HDFS service. This control sets the HDFS + configuration parameter <codeph>dfs_datanode_max_locked_memory</codeph>, which specifies the upper limit + of HDFS cache size on each node. + </li> + + <li> + All the other manipulation of the HDFS caching settings, such as what files are cached, is done through + the command line, either Impala DDL statements or the Linux <cmdname>hdfs cacheadmin</cmdname> command. + </li> + </ul> + + <p> + <b>Impala memory limits:</b> + </p> + + <p> + The Impala HDFS caching feature interacts with the Impala memory limits as follows: + </p> + + <ul> + <li> + The maximum size of each HDFS cache pool is specified externally to Impala, through the <cmdname>hdfs + cacheadmin</cmdname> command. + </li> + + <li> + All the memory used for HDFS caching is separate from the <cmdname>impalad</cmdname> daemon address space + and does not count towards the limits of the <codeph>--mem_limit</codeph> startup option, + <codeph>MEM_LIMIT</codeph> query option, or further limits imposed through YARN resource management or + the Linux <codeph>cgroups</codeph> mechanism. + </li> + + <li> + Because accessing HDFS cached data avoids a memory-to-memory copy operation, queries involving cached + data require less memory on the Impala side than the equivalent queries on uncached data. In addition to + any performance benefits in a single-user environment, the reduced memory helps to improve scalability + under high-concurrency workloads. + </li> + </ul> + </conbody> + </concept> + + <concept id="hdfs_caching_performance"> + + <title>Performance Considerations for HDFS Caching with Impala</title> + + <conbody> + + <p> + In Impala 1.4.0 and higher, Impala supports efficient reads from data that is pinned in memory through HDFS + caching. Impala takes advantage of the HDFS API and reads the data from memory rather than from disk + whether the data files are pinned using Impala DDL statements, or using the command-line mechanism where + you specify HDFS paths. + </p> + + <p> + When you examine the output of the <cmdname>impala-shell</cmdname> <cmdname>SUMMARY</cmdname> command, or + look in the metrics report for the <cmdname>impalad</cmdname> daemon, you see how many bytes are read from + the HDFS cache. For example, this excerpt from a query profile illustrates that all the data read during a + particular phase of the query came from the HDFS cache, because the <codeph>BytesRead</codeph> and + <codeph>BytesReadDataNodeCache</codeph> values are identical. + </p> + +<codeblock>HDFS_SCAN_NODE (id=0):(Total: 11s114ms, non-child: 11s114ms, % non-child: 100.00%) + - AverageHdfsReadThreadConcurrency: 0.00 + - AverageScannerThreadConcurrency: 32.75 +<b> - BytesRead: 10.47 GB (11240756479) + - BytesReadDataNodeCache: 10.47 GB (11240756479)</b> + - BytesReadLocal: 10.47 GB (11240756479) + - BytesReadShortCircuit: 10.47 GB (11240756479) + - DecompressionTime: 27s572ms +</codeblock> + + <p> + For queries involving smaller amounts of data, or in single-user workloads, you might not notice a + significant difference in query response time with or without HDFS caching. Even with HDFS caching turned + off, the data for the query might still be in the Linux OS buffer cache. The benefits become clearer as + data volume increases, and especially as the system processes more concurrent queries. HDFS caching + improves the scalability of the overall system. That is, it prevents query performance from declining when + the workload outstrips the capacity of the Linux OS cache. + </p> + + <p conref="../shared/impala_common.xml#common/hdfs_caching_encryption_caveat"/> + + <p> + <b>SELECT considerations:</b> + </p> + + <p> + The Impala HDFS caching feature interacts with the + <codeph><xref href="impala_select.xml#select">SELECT</xref></codeph> statement and query performance as + follows: + </p> + + <ul> + <li> + Impala automatically reads from memory any data that has been designated as cached and actually loaded + into the HDFS cache. (It could take some time after the initial request to fully populate the cache for a + table with large size or many partitions.) The speedup comes from two aspects: reading from RAM instead + of disk, and accessing the data straight from the cache area instead of copying from one RAM area to + another. This second aspect yields further performance improvement over the standard OS caching + mechanism, which still results in memory-to-memory copying of cached data. + </li> + + <li> + For small amounts of data, the query speedup might not be noticeable in terms of wall clock time. The + performance might be roughly the same with HDFS caching turned on or off, due to recently used data being + held in the Linux OS cache. The difference is more pronounced with: + <ul> + <li> + Data volumes (for all queries running concurrently) that exceed the size of the Linux OS cache. + </li> + + <li> + A busy cluster running many concurrent queries, where the reduction in memory-to-memory copying and + overall memory usage during queries results in greater scalability and throughput. + </li> + + <li> + Thus, to really exercise and benchmark this feature in a development environment, you might need to + simulate realistic workloads and concurrent queries that match your production environment. + </li> + + <li> + One way to simulate a heavy workload on a lightly loaded system is to flush the OS buffer cache (on + each DataNode) between iterations of queries against the same tables or partitions: +<codeblock>$ sync +$ echo 1 > /proc/sys/vm/drop_caches +</codeblock> + </li> + </ul> + </li> + + <li> + Impala queries take advantage of HDFS cached data regardless of whether the cache directive was issued by + Impala or externally through the <cmdname>hdfs cacheadmin</cmdname> command, for example for an external + table where the cached data files might be accessed by several different Hadoop components. + </li> + + <li> + If your query returns a large result set, the time reported for the query could be dominated by the time + needed to print the results on the screen. To measure the time for the underlying query processing, query + the <codeph>COUNT()</codeph> of the big result set, which does all the same processing but only prints a + single line to the screen. + </li> + </ul> + </conbody> + </concept> +</concept>
