http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_hbase.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_hbase.xml b/docs/topics/impala_hbase.xml new file mode 100644 index 0000000..8b8d013 --- /dev/null +++ b/docs/topics/impala_hbase.xml @@ -0,0 +1,904 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="impala_hbase"> + + <title id="hbase">Using Impala to Query HBase Tables</title> + <titlealts audience="PDF"><navtitle>HBase Tables</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="HBase"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Tables"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">HBase</indexterm> + You can use Impala to query HBase tables. This capability allows convenient access to a storage system that + is tuned for different kinds of workloads than the default with Impala. The default Impala tables use data + files stored on HDFS, which are ideal for bulk loads and queries using full-table scans. In contrast, HBase + can do efficient queries for data organized for OLTP-style workloads, with lookups of individual rows or + ranges of values. + </p> + + <p> + From the perspective of an Impala user, coming from an RDBMS background, HBase is a kind of key-value store + where the value consists of multiple fields. The key is mapped to one column in the Impala table, and the + various fields of the value are mapped to the other columns in the Impala table. + </p> + + <p> + For background information on HBase, see the snapshot of the Apache HBase site (including documentation) for + the level of HBase that comes with + <xref href="https://archive.cloudera.com/cdh5/cdh/5/hbase/" scope="external" format="html">CDH 5</xref>. To + install HBase on a CDH cluster, see the installation instructions for + <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hbase_installation.html" scope="external" format="html">CDH 5</xref>. + </p> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="hbase_using"> + + <title>Overview of Using HBase with Impala</title> + <prolog> + <metadata> + <data name="Category" value="Concepts"/> + </metadata> + </prolog> + + <conbody> + + <p> + When you use Impala with HBase: + </p> + + <ul> + <li> + You create the tables on the Impala side using the Hive shell, because the Impala <codeph>CREATE + TABLE</codeph> statement currently does not support custom SerDes and some other syntax needed for these + tables: + <ul> + <li> + You designate it as an HBase table using the <codeph>STORED BY + 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'</codeph> clause on the Hive <codeph>CREATE + TABLE</codeph> statement. + </li> + + <li> + You map these specially created tables to corresponding tables that exist in HBase, with the clause + <codeph>TBLPROPERTIES("hbase.table.name" = "<varname>table_name_in_hbase</varname>")</codeph> on the + Hive <codeph>CREATE TABLE</codeph> statement. + </li> + + <li> + See <xref href="#hbase_queries"/> for a full example. + </li> + </ul> + </li> + + <li> + You define the column corresponding to the HBase row key as a string with the <codeph>#string</codeph> + keyword, or map it to a <codeph>STRING</codeph> column. + </li> + + <li> + Because Impala and Hive share the same metastore database, once you create the table in Hive, you can + query or insert into it through Impala. (After creating a new table through Hive, issue the + <codeph>INVALIDATE METADATA</codeph> statement in <cmdname>impala-shell</cmdname> to make Impala aware of + the new table.) + </li> + + <li> + You issue queries against the Impala tables. For efficient queries, use <codeph>WHERE</codeph> clauses to + find a single key value or a range of key values wherever practical, by testing the Impala column + corresponding to the HBase row key. Avoid queries that do full-table scans, which are efficient for + regular Impala tables but inefficient in HBase. + </li> + </ul> + + <p> + To work with an HBase table from Impala, ensure that the <codeph>impala</codeph> user has read/write + privileges for the HBase table, using the <codeph>GRANT</codeph> command in the HBase shell. For details + about HBase security, see the + <xref href="http://hbase.apache.org/book/ch08s04.html" format="html" scope="external">Security chapter in + the HBase Reference Guide</xref>. + </p> + </conbody> + </concept> + + <concept id="hbase_config"> + + <title>Configuring HBase for Use with Impala</title> + <prolog> + <metadata> + <data name="Category" value="Configuring"/> + </metadata> + </prolog> + + <conbody> + + <p> + HBase works out of the box with Impala. There is no mandatory configuration needed to use these two + components together. + </p> + + <p> + To avoid delays if HBase is unavailable during Impala startup or after an <codeph>INVALIDATE + METADATA</codeph> statement, <ph rev="upstream">Cloudera</ph> recommends setting timeout values as follows in + <filepath>/etc/impala/conf/hbase-site.xml</filepath> (for environments not managed by Cloudera Manager): + </p> + +<codeblock><property> + <name>hbase.client.retries.number</name> + <value>3</value> +</property> +<property> + <name>hbase.rpc.timeout</name> + <value>3000</value> +</property> +</codeblock> + + <p> + Currently, Cloudera Manager does not have an Impala-only override for HBase settings, so any HBase + configuration change you make through Cloudera Manager would take affect for all HBase applications. + Therefore, this change is not recommended on systems managed by Cloudera Manager. + </p> + </conbody> + </concept> + + <concept id="hbase_types"> + + <title>Supported Data Types for HBase Columns</title> + + <conbody> + + <p> + To understand how Impala column data types are mapped to fields in HBase, you should have some background + knowledge about HBase first. You set up the mapping by running the <codeph>CREATE TABLE</codeph> statement + in the Hive shell. See + <xref href="https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration" scope="external" format="html">the + Hive wiki</xref> for a starting point, and <xref href="#hbase_queries"/> for examples. + </p> + + <p> + HBase works as a kind of <q>bit bucket</q>, in the sense that HBase does not enforce any typing for the + key or value fields. All the type enforcement is done on the Impala side. + </p> + + <p> + For best performance of Impala queries against HBase tables, most queries will perform comparisons in the + <codeph>WHERE</codeph> against the column that corresponds to the HBase row key. When creating the table + through the Hive shell, use the <codeph>STRING</codeph> data type for the column that corresponds to the + HBase row key. Impala can translate conditional tests (through operators such as <codeph>=</codeph>, + <codeph><</codeph>, <codeph>BETWEEN</codeph>, and <codeph>IN</codeph>) against this column into fast + lookups in HBase, but this optimization (<q>predicate pushdown</q>) only works when that column is + defined as <codeph>STRING</codeph>. + </p> + + <p> + Starting in Impala 1.1, Impala also supports reading and writing to columns that are defined in the Hive + <codeph>CREATE TABLE</codeph> statement using binary data types, represented in the Hive table definition + using the <codeph>#binary</codeph> keyword, often abbreviated as <codeph>#b</codeph>. Defining numeric + columns as binary can reduce the overall data volume in the HBase tables. You should still define the + column that corresponds to the HBase row key as a <codeph>STRING</codeph>, to allow fast lookups using + those columns. + </p> + </conbody> + </concept> + + <concept id="hbase_performance"> + + <title>Performance Considerations for the Impala-HBase Integration</title> + <prolog> + <metadata> + <data name="Category" value="Performance"/> + </metadata> + </prolog> + + <conbody> + + <p> + To understand the performance characteristics of SQL queries against data stored in HBase, you should have + some background knowledge about how HBase interacts with SQL-oriented systems first. See + <xref href="https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration" scope="external" format="html">the + Hive wiki</xref> for a starting point; because Impala shares the same metastore database as Hive, the + information about mapping columns from Hive tables to HBase tables is generally applicable to Impala too. + </p> + + <p> + Impala uses the HBase client API via Java Native Interface (JNI) to query data stored in HBase. This + querying does not read HFiles directly. The extra communication overhead makes it important to choose what + data to store in HBase or in HDFS, and construct efficient queries that can retrieve the HBase data + efficiently: + </p> + + <ul> + <li> + Use HBase table for queries that return a single row or a range of rows, not queries that scan the entire + table. (If a query has no <codeph>WHERE</codeph> clause, that is a strong indicator that it is an + inefficient query for an HBase table.) + </li> + + <li> + If you have join queries that do aggregation operations on large fact tables and join the results against + small dimension tables, consider using Impala for the fact tables and HBase for the dimension tables. + (Because Impala does a full scan on the HBase table in this case, rather than doing single-row HBase + lookups based on the join column, only use this technique where the HBase table is small enough that + doing a full table scan does not cause a performance bottleneck for the query.) + </li> + </ul> + + <p> + Query predicates are applied to row keys as start and stop keys, thereby limiting the scope of a particular + lookup. If row keys are not mapped to string columns, then ordering is typically incorrect and comparison + operations do not work. For example, if row keys are not mapped to string columns, evaluating for greater + than (>) or less than (<) cannot be completed. + </p> + + <p> + Predicates on non-key columns can be sent to HBase to scan as <codeph>SingleColumnValueFilters</codeph>, + providing some performance gains. In such a case, HBase returns fewer rows than if those same predicates + were applied using Impala. While there is some improvement, it is not as great when start and stop rows are + used. This is because the number of rows that HBase must examine is not limited as it is when start and + stop rows are used. As long as the row key predicate only applies to a single row, HBase will locate and + return that row. Conversely, if a non-key predicate is used, even if it only applies to a single row, HBase + must still scan the entire table to find the correct result. + </p> + + <example> + + <title>Interpreting EXPLAIN Output for HBase Queries</title> + + <p> + For example, here are some queries against the following Impala table, which is mapped to an HBase table. + The examples show excerpts from the output of the <codeph>EXPLAIN</codeph> statement, demonstrating what + things to look for to indicate an efficient or inefficient query against an HBase table. + </p> + + <p> + The first column (<codeph>cust_id</codeph>) was specified as the key column in the <codeph>CREATE + EXTERNAL TABLE</codeph> statement; for performance, it is important to declare this column as + <codeph>STRING</codeph>. Other columns, such as <codeph>BIRTH_YEAR</codeph> and + <codeph>NEVER_LOGGED_ON</codeph>, are also declared as <codeph>STRING</codeph>, rather than their + <q>natural</q> types of <codeph>INT</codeph> or <codeph>BOOLEAN</codeph>, because Impala can optimize + those types more effectively in HBase tables. For comparison, we leave one column, + <codeph>YEAR_REGISTERED</codeph>, as <codeph>INT</codeph> to show that filtering on this column is + inefficient. + </p> + +<codeblock>describe hbase_table; +Query: describe hbase_table ++-----------------------+--------+---------+ +| name | type | comment | ++-----------------------+--------+---------+ +| cust_id | <b>string</b> | | +| birth_year | <b>string</b> | | +| never_logged_on | <b>string</b> | | +| private_email_address | string | | +| year_registered | <b>int</b> | | ++-----------------------+--------+---------+ +</codeblock> + + <p> + The best case for performance involves a single row lookup using an equality comparison on the column + defined as the row key: + </p> + +<codeblock>explain select count(*) from hbase_table where cust_id = '[email protected]'; ++------------------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 | +| WARNING: The following tables are missing relevant table and/or column statistics. | +| hbase.hbase_table | +| | +| 03:AGGREGATE [MERGE FINALIZE] | +| | output: sum(count(*)) | +| | | +| 02:EXCHANGE [PARTITION=UNPARTITIONED] | +| | | +| 01:AGGREGATE | +| | output: count(*) | +| | | +<b>| 00:SCAN HBASE [hbase.hbase_table] |</b> +<b>| start key: [email protected] |</b> +<b>| stop key: [email protected]\0 |</b> ++------------------------------------------------------------------------------------+ +</codeblock> + + <p> + Another type of efficient query involves a range lookup on the row key column, using SQL operators such + as greater than (or equal), less than (or equal), or <codeph>BETWEEN</codeph>. This example also includes + an equality test on a non-key column; because that column is a <codeph>STRING</codeph>, Impala can let + HBase perform that test, indicated by the <codeph>hbase filters:</codeph> line in the + <codeph>EXPLAIN</codeph> output. Doing the filtering within HBase is more efficient than transmitting all + the data to Impala and doing the filtering on the Impala side. + </p> + +<codeblock>explain select count(*) from hbase_table where cust_id between 'a' and 'b' + and never_logged_on = 'true'; ++------------------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------------------+ +... +<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 | +| WARNING: The following tables are missing relevant table and/or column statistics. | +| hbase.hbase_table | +| | +| 03:AGGREGATE [MERGE FINALIZE] | +| | output: sum(count(*)) | +| | | +| 02:EXCHANGE [PARTITION=UNPARTITIONED] | +| | |--> +| 01:AGGREGATE | +| | output: count(*) | +| | | +<b>| 00:SCAN HBASE [hbase.hbase_table] |</b> +<b>| start key: a |</b> +<b>| stop key: b\0 |</b> +<b>| hbase filters: cols:never_logged_on EQUAL 'true' |</b> ++------------------------------------------------------------------------------------+ +</codeblock> + + <p> + The query is less efficient if Impala has to evaluate any of the predicates, because Impala must scan the + entire HBase table. Impala can only push down predicates to HBase for columns declared as + <codeph>STRING</codeph>. This example tests a column declared as <codeph>INT</codeph>, and the + <codeph>predicates:</codeph> line in the <codeph>EXPLAIN</codeph> output indicates that the test is + performed after the data is transmitted to Impala. + </p> + +<codeblock>explain select count(*) from hbase_table where year_registered = 2010; ++------------------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------------------+ +... +<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 | +| WARNING: The following tables are missing relevant table and/or column statistics. | +| hbase.hbase_table | +| | +| 03:AGGREGATE [MERGE FINALIZE] | +| | output: sum(count(*)) | +| | | +| 02:EXCHANGE [PARTITION=UNPARTITIONED] | +| | |--> +| 01:AGGREGATE | +| | output: count(*) | +| | | +<b>| 00:SCAN HBASE [hbase.hbase_table] |</b> +<b>| predicates: year_registered = 2010 |</b> ++------------------------------------------------------------------------------------+ +</codeblock> + + <p> + The same inefficiency applies if the key column is compared to any non-constant value. Here, even though + the key column is a <codeph>STRING</codeph>, and is tested using an equality operator, Impala must scan + the entire HBase table because the key column is compared to another column value rather than a constant. + </p> + +<codeblock>explain select count(*) from hbase_table where cust_id = private_email_address; ++------------------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------------------+ +... +<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 | +| WARNING: The following tables are missing relevant table and/or column statistics. | +| hbase.hbase_table | +| | +| 03:AGGREGATE [MERGE FINALIZE] | +| | output: sum(count(*)) | +| | | +| 02:EXCHANGE [PARTITION=UNPARTITIONED] | +| | |--> +| 01:AGGREGATE | +| | output: count(*) | +| | | +<b>| 00:SCAN HBASE [hbase.hbase_table] |</b> +<b>| predicates: cust_id = private_email_address |</b> ++------------------------------------------------------------------------------------+ +</codeblock> + + <p> + Currently, tests on the row key using <codeph>OR</codeph> or <codeph>IN</codeph> clauses are not + optimized into direct lookups either. Such limitations might be lifted in the future, so always check the + <codeph>EXPLAIN</codeph> output to be sure whether a particular SQL construct results in an efficient + query or not for HBase tables. + </p> + +<codeblock>explain select count(*) from hbase_table where + cust_id = '[email protected]' or cust_id = '[email protected]'; ++----------------------------------------------------------------------------------------+ +| Explain String | ++----------------------------------------------------------------------------------------+ +... +<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 | +| WARNING: The following tables are missing relevant table and/or column statistics. | +| hbase.hbase_table | +| | +| 03:AGGREGATE [MERGE FINALIZE] | +| | output: sum(count(*)) | +| | | +| 02:EXCHANGE [PARTITION=UNPARTITIONED] | +| | |--> +| 01:AGGREGATE | +| | output: count(*) | +| | | +<b>| 00:SCAN HBASE [hbase.hbase_table] |</b> +<b>| predicates: cust_id = '[email protected]' OR cust_id = '[email protected]' |</b> ++----------------------------------------------------------------------------------------+ + +explain select count(*) from hbase_table where + cust_id in ('[email protected]', '[email protected]'); ++------------------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------------------+ +... +<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 | +| WARNING: The following tables are missing relevant table and/or column statistics. | +| hbase.hbase_table | +| | +| 03:AGGREGATE [MERGE FINALIZE] | +| | output: sum(count(*)) | +| | | +| 02:EXCHANGE [PARTITION=UNPARTITIONED] | +| | |--> +| 01:AGGREGATE | +| | output: count(*) | +| | | +<b>| 00:SCAN HBASE [hbase.hbase_table] |</b> +<b>| predicates: cust_id IN ('[email protected]', '[email protected]') |</b> ++------------------------------------------------------------------------------------+ +</codeblock> + + <p> + Either rewrite into separate queries for each value and combine the results in the application, or + combine the single-row queries using UNION ALL: + </p> + +<codeblock>select count(*) from hbase_table where cust_id = '[email protected]'; +select count(*) from hbase_table where cust_id = '[email protected]'; + +explain + select count(*) from hbase_table where cust_id = '[email protected]' + union all + select count(*) from hbase_table where cust_id = '[email protected]'; ++------------------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------------------+ +... +<!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 | +| WARNING: The following tables are missing relevant table and/or column statistics. | +| hbase.hbase_table | +| | +| 09:EXCHANGE [PARTITION=UNPARTITIONED] | +| | | +| |−−11:MERGE | +| | | | +| | 08:AGGREGATE [MERGE FINALIZE] | +| | | output: sum(count(*)) | +| | | | +| | 07:EXCHANGE [PARTITION=UNPARTITIONED] | +| | | |--> +| | 04:AGGREGATE | +| | | output: count(*) | +| | | | +<b>| | 03:SCAN HBASE [hbase.hbase_table] |</b> +<b>| | start key: [email protected] |</b> +<b>| | stop key: [email protected]\0 |</b> +| | | +| 10:MERGE | +... +<!--| | | +| 06:AGGREGATE [MERGE FINALIZE] | +| | output: sum(count(*)) | +| | | +| 05:EXCHANGE [PARTITION=UNPARTITIONED] | +| | |--> +| 02:AGGREGATE | +| | output: count(*) | +| | | +<b>| 01:SCAN HBASE [hbase.hbase_table] |</b> +<b>| start key: [email protected] |</b> +<b>| stop key: [email protected]\0 |</b> ++------------------------------------------------------------------------------------+ +</codeblock> + + </example> + + <example> + + <title>Configuration Options for Java HBase Applications</title> + + <p> If you have an HBase Java application that calls the + <codeph>setCacheBlocks</codeph> or <codeph>setCaching</codeph> + methods of the class <xref + href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html" + scope="external" format="html" + >org.apache.hadoop.hbase.client.Scan</xref>, you can set these same + caching behaviors through Impala query options, to control the memory + pressure on the HBase RegionServer. For example, when doing queries in + HBase that result in full-table scans (which by default are + inefficient for HBase), you can reduce memory usage and speed up the + queries by turning off the <codeph>HBASE_CACHE_BLOCKS</codeph> setting + and specifying a large number for the <codeph>HBASE_CACHING</codeph> + setting. + <!-- That recommendation w.r.t. full-table scans comes from the Cloudera HBase forum: http://community.cloudera.com/t5/Realtime-Random-Access-Apache/How-to-optimise-Full-Table-Scan-FTS-in-HBase/td-p/97 --> + </p> + + <p> + To set these options, issue commands like the following in <cmdname>impala-shell</cmdname>: + </p> + +<codeblock>-- Same as calling setCacheBlocks(true) or setCacheBlocks(false). +set hbase_cache_blocks=true; +set hbase_cache_blocks=false; + +-- Same as calling setCaching(rows). +set hbase_caching=1000; +</codeblock> + + <p> + Or update the <cmdname>impalad</cmdname> defaults file <filepath>/etc/default/impala</filepath> and + include settings for <codeph>HBASE_CACHE_BLOCKS</codeph> and/or <codeph>HBASE_CACHING</codeph> in the + <codeph>-default_query_options</codeph> setting for <codeph>IMPALA_SERVER_ARGS</codeph>. See + <xref href="impala_config_options.xml#config_options"/> for details. + </p> + + <note> + In Impala 2.0 and later, these options are settable through the JDBC or ODBC interfaces using the + <codeph>SET</codeph> statement. + </note> + + </example> + </conbody> + </concept> + + <concept id="hbase_scenarios"> + + <title>Use Cases for Querying HBase through Impala</title> + <prolog> + <metadata> + <data name="Category" value="Use Cases"/> + </metadata> + </prolog> + + <conbody> + + <p> + The following are popular use cases for using Impala to query HBase tables: + </p> + + <ul> + <li> + Keeping large fact tables in Impala, and smaller dimension tables in HBase. The fact tables use Parquet + or other binary file format optimized for scan operations. Join queries scan through the large Impala + fact tables, and cross-reference the dimension tables using efficient single-row lookups in HBase. + </li> + + <li> + Using HBase to store rapidly incrementing counters, such as how many times a web page has been viewed, or + on a social network, how many connections a user has or how many votes a post received. HBase is + efficient for capturing such changeable data: the append-only storage mechanism is efficient for writing + each change to disk, and a query always returns the latest value. An application could query specific + totals like these from HBase, and combine the results with a broader set of data queried from Impala. + </li> + + <li> + <p> + Storing very wide tables in HBase. Wide tables have many columns, possibly thousands, typically + recording many attributes for an important subject such as a user of an online service. These tables + are also often sparse, that is, most of the columns values are <codeph>NULL</codeph>, 0, + <codeph>false</codeph>, empty string, or other blank or placeholder value. (For example, any particular + web site user might have never used some site feature, filled in a certain field in their profile, + visited a particular part of the site, and so on.) A typical query against this kind of table is to + look up a single row to retrieve all the information about a specific subject, rather than summing, + averaging, or filtering millions of rows as in typical Impala-managed tables. + </p> + <p> + Or the HBase table could be joined with a larger Impala-managed table. For example, analyze the large + Impala table representing web traffic for a site and pick out 50 users who view the most pages. Join + that result with the wide user table in HBase to look up attributes of those users. The HBase side of + the join would result in 50 efficient single-row lookups in HBase, rather than scanning the entire user + table. + </p> + </li> + </ul> + </conbody> + </concept> + + <concept audience="Cloudera" id="hbase_create_new"> + + <title>Creating a New HBase Table for Impala to Use</title> + + <conbody> + + <p> + You can create an HBase-backed table through a <codeph>CREATE TABLE</codeph> statement in the Hive shell, + without going into the HBase shell at all: + </p> + + <!-- To do: + Add example. (Not critical because this subtopic is currently hidden.) + --> + </conbody> + </concept> + + <concept audience="Cloudera" id="hbase_reuse_existing"> + + <title>Associate Impala with an Existing HBase Table</title> + + <conbody> + + <p> + If you already have some HBase tables created through the HBase shell, you can make them accessible to + Impala through a <codeph>CREATE TABLE</codeph> statement in the Hive shell: + </p> + + <!-- To do: + Add example. (Not critical because this subtopic is currently hidden.) + --> + </conbody> + </concept> + + <concept audience="Cloudera" id="hbase_column_families"> + + <title>Map HBase Columns and Column Families to Impala Columns</title> + + <conbody> + + <p/> + </conbody> + </concept> + + <concept id="hbase_loading"> + + <title>Loading Data into an HBase Table</title> + <prolog> + <metadata> + <data name="Category" value="ETL"/> + <data name="Category" value="Ingest"/> + </metadata> + </prolog> + + <conbody> + + <p> + The Impala <codeph>INSERT</codeph> statement works for HBase tables. The <codeph>INSERT ... VALUES</codeph> + syntax is ideally suited to HBase tables, because inserting a single row is an efficient operation for an + HBase table. (For regular Impala tables, with data files in HDFS, the tiny data files produced by + <codeph>INSERT ... VALUES</codeph> are extremely inefficient, so you would not use that technique with + tables containing any significant data volume.) + </p> + + <!-- To do: + Add examples throughout this section. + --> + + <p> + When you use the <codeph>INSERT ... SELECT</codeph> syntax, the result in the HBase table could be fewer + rows than you expect. HBase only stores the most recent version of each unique row key, so if an + <codeph>INSERT ... SELECT</codeph> statement copies over multiple rows containing the same value for the + key column, subsequent queries will only return one row with each key column value: + </p> + + <p> + Although Impala does not have an <codeph>UPDATE</codeph> statement, you can achieve the same effect by + doing successive <codeph>INSERT</codeph> statements using the same value for the key column each time: + </p> + + </conbody> + </concept> + + <concept id="hbase_limitations"> + + <title>Limitations and Restrictions of the Impala and HBase Integration</title> + + <conbody> + + <p> + The Impala integration with HBase has the following limitations and restrictions, some inherited from the + integration between HBase and Hive, and some unique to Impala: + </p> + + <ul> + <li> + <p> + If you issue a <codeph>DROP TABLE</codeph> for an internal (Impala-managed) table that is mapped to an + HBase table, the underlying table is not removed in HBase. The Hive <codeph>DROP TABLE</codeph> + statement also removes the HBase table in this case. + </p> + </li> + + <li> + <p> + The <codeph>INSERT OVERWRITE</codeph> statement is not available for HBase tables. You can insert new + data, or modify an existing row by inserting a new row with the same key value, but not replace the + entire contents of the table. You can do an <codeph>INSERT OVERWRITE</codeph> in Hive if you need this + capability. + </p> + </li> + + <li> + <p> + If you issue a <codeph>CREATE TABLE LIKE</codeph> statement for a table mapped to an HBase table, the + new table is also an HBase table, but inherits the same underlying HBase table name as the original. + The new table is effectively an alias for the old one, not a new table with identical column structure. + Avoid using <codeph>CREATE TABLE LIKE</codeph> for HBase tables, to avoid any confusion. + </p> + </li> + + <li> + <p> + Copying data into an HBase table using the Impala <codeph>INSERT ... SELECT</codeph> syntax might + produce fewer new rows than are in the query result set. If the result set contains multiple rows with + the same value for the key column, each row supercedes any previous rows with the same key value. + Because the order of the inserted rows is unpredictable, you cannot rely on this technique to preserve + the <q>latest</q> version of a particular key value. + </p> + </li> + <li rev="2.3.0"> + <p> + Because the complex data types (<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>) + available in <keyword keyref="impala23_full"/> and higher are currently only supported in Parquet tables, you cannot + use these types in HBase tables that are queried through Impala. + </p> + </li> + <li> + <p conref="../shared/impala_common.xml#common/hbase_no_load_data"/> + </li> + </ul> + </conbody> + </concept> + + <concept id="hbase_queries"> + + <title>Examples of Querying HBase Tables from Impala</title> + + <conbody> + + <p> + The following examples create an HBase table with four column families, + create a corresponding table through Hive, + then insert and query the table through Impala. + </p> + <p> + In HBase shell, the table + name is quoted in <codeph>CREATE</codeph> and <codeph>DROP</codeph> statements. Tables created in HBase + begin in <q>enabled</q> state; before dropping them through the HBase shell, you must issue a + <codeph>disable '<varname>table_name</varname>'</codeph> statement. + </p> + +<codeblock>$ hbase shell +15/02/10 16:07:45 +HBase Shell; enter 'help<RETURN>' for list of supported commands. +Type "exit<RETURN>" to leave the HBase Shell +Version 0.94.2-cdh4.2.0, rUnknown, Fri Feb 15 11:51:18 PST 2013 + +hbase(main):001:0> create 'hbasealltypessmall', 'boolsCF', 'intsCF', 'floatsCF', 'stringsCF' +0 row(s) in 4.6520 seconds + +=> Hbase::Table - hbasealltypessmall +hbase(main):006:0> quit +</codeblock> + + <p> + Issue the following <codeph>CREATE TABLE</codeph> statement in the Hive shell. (The Impala <codeph>CREATE + TABLE</codeph> statement currently does not support the <codeph>STORED BY</codeph> clause, so you switch into Hive to + create the table, then back to Impala and the <cmdname>impala-shell</cmdname> interpreter to issue the + queries.) + </p> + + <p> + This example creates an external table mapped to the HBase table, usable by both Impala and Hive. It is + defined as an external table so that when dropped by Impala or Hive, the original HBase table is not touched at all. + </p> + + <p> + The <codeph>WITH SERDEPROPERTIES</codeph> clause + specifies that the first column (<codeph>ID</codeph>) represents the row key, and maps the remaining + columns of the SQL table to HBase column families. The mapping relies on the ordinal order of the + columns in the table, not the column names in the <codeph>CREATE TABLE</codeph> statement. + The first column is defined to be the lookup key; the + <codeph>STRING</codeph> data type produces the fastest key-based lookups for HBase tables. + </p> + + <note> + For Impala with HBase tables, the most important aspect to ensure good performance is to use a + <codeph>STRING</codeph> column as the row key, as shown in this example. + </note> + +<codeblock>$ hive +Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties +Hive history file=/tmp/cloudera/hive_job_log_cloudera_201502101610_1980712808.txt +hive> use hbase; +OK +Time taken: 4.095 seconds +hive> CREATE EXTERNAL TABLE hbasestringids ( + > id string, + > bool_col boolean, + > tinyint_col tinyint, + > smallint_col smallint, + > int_col int, + > bigint_col bigint, + > float_col float, + > double_col double, + > date_string_col string, + > string_col string, + > timestamp_col timestamp) + > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' + > WITH SERDEPROPERTIES ( + > "hbase.columns.mapping" = + > ":key,boolsCF:bool_col,intsCF:tinyint_col,intsCF:smallint_col,intsCF:int_col,intsCF:\ + > bigint_col,floatsCF:float_col,floatsCF:double_col,stringsCF:date_string_col,\ + > stringsCF:string_col,stringsCF:timestamp_col" + > ) + > TBLPROPERTIES("hbase.table.name" = "hbasealltypessmall"); +OK +Time taken: 2.879 seconds +hive> quit; +</codeblock> + + <p> + Once you have established the mapping to an HBase table, you can issue DML statements and queries + from Impala. The following example shows a series of <codeph>INSERT</codeph> + statements followed by a query. + The ideal kind of query from a performance standpoint + retrieves a row from the table based on a row key + mapped to a string column. + An initial <codeph>INVALIDATE METADATA <varname>table_name</varname></codeph> + statement makes the table created through Hive visible to Impala. + </p> + +<codeblock>$ impala-shell -i localhost -d hbase +Starting Impala Shell without Kerberos authentication +Connected to localhost:21000 +Server version: impalad version 2.1.0-cdh4 RELEASE (build d520a9cdea2fc97e8d5da9fbb0244e60ee416bfa) +Welcome to the Impala shell. Press TAB twice to see a list of available commands. + +Copyright (c) 2012 Cloudera, Inc. All rights reserved. + +(Shell build version: Impala Shell v2.1.0-cdh4 (d520a9c) built on Mon Dec 8 21:41:17 PST 2014) +Query: use `hbase` +[localhost:21000] > invalidate metadata hbasestringids; +Fetched 0 row(s) in 0.09s +[localhost:21000] > desc hbasestringids; ++-----------------+-----------+---------+ +| name | type | comment | ++-----------------+-----------+---------+ +| id | string | | +| bool_col | boolean | | +| double_col | double | | +| float_col | float | | +| bigint_col | bigint | | +| int_col | int | | +| smallint_col | smallint | | +| tinyint_col | tinyint | | +| date_string_col | string | | +| string_col | string | | +| timestamp_col | timestamp | | ++-----------------+-----------+---------+ +Fetched 11 row(s) in 0.02s +[localhost:21000] > insert into hbasestringids values ('0001',true,3.141,9.94,1234567,32768,4000,76,'2014-12-31','Hello world',now()); +Inserted 1 row(s) in 0.26s +[localhost:21000] > insert into hbasestringids values ('0002',false,2.004,6.196,1500,8000,129,127,'2014-01-01','Foo bar',now()); +Inserted 1 row(s) in 0.12s +[localhost:21000] > select * from hbasestringids where id = '0001'; ++------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+ +| id | bool_col | double_col | float_col | bigint_col | int_col | smallint_col | tinyint_col | date_string_col | string_col | timestamp_col | ++------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+ +| 0001 | true | 3.141 | 9.939999580383301 | 1234567 | 32768 | 4000 | 76 | 2014-12-31 | Hello world | 2015-02-10 16:36:59.764838000 | ++------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+ +Fetched 1 row(s) in 0.54s +</codeblock> + + <note conref="../shared/impala_common.xml#common/invalidate_metadata_hbase"/> +<!-- </section> --> + </conbody> + </concept> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_hbase_cache_blocks.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_hbase_cache_blocks.xml b/docs/topics/impala_hbase_cache_blocks.xml new file mode 100644 index 0000000..480b31c --- /dev/null +++ b/docs/topics/impala_hbase_cache_blocks.xml @@ -0,0 +1,40 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="hbase_cache_blocks"> + + <title>HBASE_CACHE_BLOCKS Query Option</title> + <titlealts audience="PDF"><navtitle>HBASE_CACHE_BLOCKS</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="HBase"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">HBASE_CACHE_BLOCKS query option</indexterm> + Setting this option is equivalent to calling the + <codeph>setCacheBlocks</codeph> method of the class <xref + href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html" + scope="external" format="html" + >org.apache.hadoop.hbase.client.Scan</xref>, in an HBase Java + application. Helps to control the memory pressure on the HBase + RegionServer, in conjunction with the <codeph>HBASE_CACHING</codeph> query + option. </p> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + <p conref="../shared/impala_common.xml#common/default_false_0"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_hbase.xml#impala_hbase"/>, + <xref href="impala_hbase_caching.xml#hbase_caching"/> + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_hbase_caching.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_hbase_caching.xml b/docs/topics/impala_hbase_caching.xml new file mode 100644 index 0000000..03755fd --- /dev/null +++ b/docs/topics/impala_hbase_caching.xml @@ -0,0 +1,45 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="hbase_caching"> + + <title>HBASE_CACHING Query Option</title> + <titlealts audience="PDF"><navtitle>HBASE_CACHING</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="HBase"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">HBASE_CACHING query option</indexterm> + Setting this option is equivalent to calling the + <codeph>setCaching</codeph> method of the class <xref + href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html" + scope="external" format="html" + >org.apache.hadoop.hbase.client.Scan</xref>, in an HBase Java + application. Helps to control the memory pressure on the HBase + RegionServer, in conjunction with the <codeph>HBASE_CACHE_BLOCKS</codeph> + query option. </p> + + <p> + <b>Type:</b> <codeph>BOOLEAN</codeph> + </p> + + <p> + <b>Default:</b> 0 + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_hbase.xml#impala_hbase"/>, + <xref href="impala_hbase_cache_blocks.xml#hbase_cache_blocks"/> + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_hints.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_hints.xml b/docs/topics/impala_hints.xml new file mode 100644 index 0000000..4c5145f --- /dev/null +++ b/docs/topics/impala_hints.xml @@ -0,0 +1,249 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="hints"> + + <title>Query Hints in Impala SELECT Statements</title> + <titlealts audience="PDF"><navtitle>Hints</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Troubleshooting"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">hints</indexterm> + The Impala SQL dialect supports query hints, for fine-tuning the inner workings of queries. Specify hints as + a temporary workaround for expensive queries, where missing statistics or other factors cause inefficient + performance. + </p> + + <p> + Hints are most often used for the most resource-intensive kinds of Impala queries: + </p> + + <ul> + <li> + Join queries involving large tables, where intermediate result sets are transmitted across the network to + evaluate the join conditions. + </li> + + <li> + Inserting into partitioned Parquet tables, where many memory buffers could be allocated on each host to + hold intermediate results for each partition. + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p> + You can represent the hints as keywords surrounded by <codeph>[]</codeph> square brackets; include the + brackets in the text of the SQL statement. + </p> + +<codeblock>SELECT STRAIGHT_JOIN <varname>select_list</varname> FROM +<varname>join_left_hand_table</varname> + JOIN [{BROADCAST|SHUFFLE}] +<varname>join_right_hand_table</varname> +<varname>remainder_of_query</varname>; + +INSERT <varname>insert_clauses</varname> + [{SHUFFLE|NOSHUFFLE}] + SELECT <varname>remainder_of_query</varname>; +</codeblock> + + <p rev="2.0.0"> + In <keyword keyref="impala20_full"/> and higher, you can also specify the hints inside comments that use + either the <codeph>/* */</codeph> or <codeph>--</codeph> notation. Specify a <codeph>+</codeph> symbol + immediately before the hint name. + </p> + +<codeblock rev="2.0.0">SELECT STRAIGHT_JOIN <varname>select_list</varname> FROM +<varname>join_left_hand_table</varname> + JOIN /* +BROADCAST|SHUFFLE */ +<varname>join_right_hand_table</varname> +<varname>remainder_of_query</varname>; + +SELECT <varname>select_list</varname> FROM +<varname>join_left_hand_table</varname> + JOIN -- +BROADCAST|SHUFFLE +<varname>join_right_hand_table</varname> +<varname>remainder_of_query</varname>; + +INSERT <varname>insert_clauses</varname> + /* +SHUFFLE|NOSHUFFLE */ + SELECT <varname>remainder_of_query</varname>; + +INSERT <varname>insert_clauses</varname> + -- +SHUFFLE|NOSHUFFLE + SELECT <varname>remainder_of_query</varname>; +</codeblock> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + With both forms of hint syntax, include the <codeph>STRAIGHT_JOIN</codeph> + keyword immediately after the <codeph>SELECT</codeph> keyword to prevent Impala from + reordering the tables in a way that makes the hint ineffective. + </p> + + <p> + To reduce the need to use hints, run the <codeph>COMPUTE STATS</codeph> statement against all tables involved + in joins, or used as the source tables for <codeph>INSERT ... SELECT</codeph> operations where the + destination is a partitioned Parquet table. Do this operation after loading data or making substantial + changes to the data within each table. Having up-to-date statistics helps Impala choose more efficient query + plans without the need for hinting. See <xref href="impala_perf_stats.xml#perf_stats"/> for details and + examples. + </p> + + <p> + To see which join strategy is used for a particular query, examine the <codeph>EXPLAIN</codeph> output for + that query. See <xref href="impala_explain_plan.xml#perf_explain"/> for details and examples. + </p> + + <p> + <b>Hints for join queries:</b> + </p> + + <p> + The <codeph>[BROADCAST]</codeph> and <codeph>[SHUFFLE]</codeph> hints control the execution strategy for join + queries. Specify one of the following constructs immediately after the <codeph>JOIN</codeph> keyword in a + query: + </p> + + <ul> + <li> + <codeph>[SHUFFLE]</codeph> - Makes that join operation use the <q>partitioned</q> technique, which divides + up corresponding rows from both tables using a hashing algorithm, sending subsets of the rows to other + nodes for processing. (The keyword <codeph>SHUFFLE</codeph> is used to indicate a <q>partitioned join</q>, + because that type of join is not related to <q>partitioned tables</q>.) Since the alternative + <q>broadcast</q> join mechanism is the default when table and index statistics are unavailable, you might + use this hint for queries where broadcast joins are unsuitable; typically, partitioned joins are more + efficient for joins between large tables of similar size. + </li> + + <li> + <codeph>[BROADCAST]</codeph> - Makes that join operation use the <q>broadcast</q> technique that sends the + entire contents of the right-hand table to all nodes involved in processing the join. This is the default + mode of operation when table and index statistics are unavailable, so you would typically only need it if + stale metadata caused Impala to mistakenly choose a partitioned join operation. Typically, broadcast joins + are more efficient in cases where one table is much smaller than the other. (Put the smaller table on the + right side of the <codeph>JOIN</codeph> operator.) + </li> + </ul> + + <p> + <b>Hints for INSERT ... SELECT queries:</b> + </p> + + <p conref="../shared/impala_common.xml#common/insert_hints"/> + + <p> + <b>Suggestions versus directives:</b> + </p> + + <p> + In early Impala releases, hints were always obeyed and so acted more like directives. Once Impala gained join + order optimizations, sometimes join queries were automatically reordered in a way that made a hint + irrelevant. Therefore, the hints act more like suggestions in Impala 1.2.2 and higher. + </p> + + <p> + To force Impala to follow the hinted execution mechanism for a join query, include the + <codeph>STRAIGHT_JOIN</codeph> keyword in the <codeph>SELECT</codeph> statement. See + <xref href="impala_perf_joins.xml#straight_join"/> for details. When you use this technique, Impala does not + reorder the joined tables at all, so you must be careful to arrange the join order to put the largest table + (or subquery result set) first, then the smallest, second smallest, third smallest, and so on. This ordering lets Impala do the + most I/O-intensive parts of the query using local reads on the DataNodes, and then reduce the size of the + intermediate result set as much as possible as each subsequent table or subquery result set is joined. + </p> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p> + Queries that include subqueries in the <codeph>WHERE</codeph> clause can be rewritten internally as join + queries. Currently, you cannot apply hints to the joins produced by these types of queries. + </p> + + <p> + Because hints can prevent queries from taking advantage of new metadata or improvements in query planning, + use them only when required to work around performance issues, and be prepared to remove them when they are + no longer required, such as after a new Impala release or bug fix. + </p> + + <p> + In particular, the <codeph>[BROADCAST]</codeph> and <codeph>[SHUFFLE]</codeph> hints are expected to be + needed much less frequently in Impala 1.2.2 and higher, because the join order optimization feature in + combination with the <codeph>COMPUTE STATS</codeph> statement now automatically choose join order and join + mechanism without the need to rewrite the query and add hints. See + <xref href="impala_perf_joins.xml#perf_joins"/> for details. + </p> + + <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> + + <p rev="2.0.0"> + The hints embedded within <codeph>--</codeph> comments are compatible with Hive queries. The hints embedded + within <codeph>/* */</codeph> comments or <codeph>[ ]</codeph> square brackets are not recognized by or not + compatible with Hive. For example, Hive raises an error for Impala hints within <codeph>/* */</codeph> + comments because it does not recognize the Impala hint names. + </p> + + <p conref="../shared/impala_common.xml#common/view_blurb"/> + + <p rev="2.0.0"> + If you use a hint in the query that defines a view, the hint is preserved when you query the view. Impala + internally rewrites all hints in views to use the <codeph>--</codeph> comment notation, so that Hive can + query such views without errors due to unrecognized hint names. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + For example, this query joins a large customer table with a small lookup table of less than 100 rows. The + right-hand table can be broadcast efficiently to all nodes involved in the join. Thus, you would use the + <codeph>[broadcast]</codeph> hint to force a broadcast join strategy: + </p> + +<codeblock>select straight_join customer.address, state_lookup.state_name + from customer join <b>[broadcast]</b> state_lookup + on customer.state_id = state_lookup.state_id;</codeblock> + + <p> + This query joins two large tables of unpredictable size. You might benchmark the query with both kinds of + hints and find that it is more efficient to transmit portions of each table to other nodes for processing. + Thus, you would use the <codeph>[shuffle]</codeph> hint to force a partitioned join strategy: + </p> + +<codeblock>select straight_join weather.wind_velocity, geospatial.altitude + from weather join <b>[shuffle]</b> geospatial + on weather.lat = geospatial.lat and weather.long = geospatial.long;</codeblock> + + <p> + For joins involving three or more tables, the hint applies to the tables on either side of that specific + <codeph>JOIN</codeph> keyword. The <codeph>STRAIGHT_JOIN</codeph> keyword ensures that joins are processed + in a predictable order from left to right. For example, this query joins + <codeph>t1</codeph> and <codeph>t2</codeph> using a partitioned join, then joins that result set to + <codeph>t3</codeph> using a broadcast join: + </p> + +<codeblock>select straight_join t1.name, t2.id, t3.price + from t1 join <b>[shuffle]</b> t2 join <b>[broadcast]</b> t3 + on t1.id = t2.id and t2.id = t3.id;</codeblock> + + <!-- To do: This is a good place to add more sample output showing before and after EXPLAIN plans. --> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + For more background information about join queries, see <xref href="impala_joins.xml#joins"/>. For + performance considerations, see <xref href="impala_perf_joins.xml#perf_joins"/>. + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_howto_rm.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_howto_rm.xml b/docs/topics/impala_howto_rm.xml new file mode 100644 index 0000000..2a7494b --- /dev/null +++ b/docs/topics/impala_howto_rm.xml @@ -0,0 +1,420 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="howto_impala_rm"> + + <title>How to Configure Resource Management for Impala</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Admission Control"/> + <data name="Category" value="Resource Management"/> + </metadata> + </prolog> + <conbody> + <p>Impala includes features that balance and maximize resources in your CDH cluster. This topic describes + how you can enhance a CDH cluster using Impala to improve efficiency.</p> + <p outputclass="toc inpage">A typical deployment uses the following.</p> + <ul audience="HTML"> + <li audience="HTML">Creating Static Service Pools</li> + <li audience="HTML">Using Admission Control <ul> + <li audience="HTML">Setting Per-query Memory Limits</li> + <li audience="HTML">Creating Dynamic Resource Pools</li> + </ul></li> + + </ul> + </conbody> + <concept id="static_service_pools"> + <title>Creating Static Service Pools</title> + <conbody> + <p>Use Static Service Pools to allocate dedicated resources for Impala and other services to allow + for predictable resource availability. </p> + <p>Static service pools isolate services from one another, so that high load on one service has + bounded impact on other services. You can use Cloudera Manager to configure static service pools + that control memory, CPU and Disk I/O.</p> + <p >The following screenshot shows a sample configuration for Static Service Pools in + Cloudera Manager:</p> + <draft-comment author="ddawson"> + <p>Need accurate numbers - or can we remove services other than HDFS, Impala, and YARN? Matt + Jacobs is going to run these numbers by someone in the field.</p> + </draft-comment> + <p> + <image href="../images/howto_static_server_pools_config.png" placement="break" + id="PDF_33" align="center" scale="33" audience="HTML"/> + <image href="../images/howto_static_server_pools_config.png" placement="break" + id="HTML_SCALEFIT" align="center" scalefit="yes" audience="PDF"/> + </p> + <p> + <ul id="ul_tkw_4rs_pw"> + <li > + <p >HDFS always needs to have a minimum of 5-10% of the resources.</p> + </li> + <li > + Generally, YARN and Impala split the rest of the resources. + + <ul id="ul_ukw_4rs_pw"> + <li > + <p >For mostly batch workloads, you might allocate YARN 60%, Impala 30%, and HDFS + 10%. </p> + </li> + <li > + <p >For mostly ad hoc query workloads, you might allocate Impala 60%, YARN 30%, and + HDFS 10%.</p> + </li> + </ul></li> + </ul> + </p> + </conbody> + </concept> + <concept id="enable_admission_control"> + <title>Using Admission Control</title> + <conbody> + <p>Within the constraints of the static service pool, you can further subdivide Impala's + resources using Admission Control. You configure Impala Admission Control pools in the Cloudera + Manager Dynamic Resource Pools page.</p> + <p>You use Admission Control to divide usage between Dynamic Resource Pools in multitenant use + cases. Allocating resources judiciously allows your most important queries to run faster and + more reliably.</p> + <p> + <note>In this context, Impala Dynamic Resource Pools are different than the default YARN Dynamic + Resource Pools. You can turn on Dynamic Resource Pools that are exclusively for use by + Impala.</note> + </p> + <p>Admission Control is enabled by default.</p> + <p>A Dynamic Resource Pool has the following properties:<ul id="ul_blk_jjg_sw"> + <li><b>Max Running Queries</b>: Maximum number of concerrently executing queries in the pool + before incoming queries are queued.</li> + <li><b>Max Memory Resources</b>: Maximum memory used by queries in the pool before incoming + queries are queued. This value is used at the time of admission and is not enforced at query + runtime.</li> + <li><b>Default Query Memory Limit</b>: Defines the maximum amount of memory a query can + allocate on each node. This is enforced at runtime. If the query attempts to use more memory, + it is forced to spill, if possible. Otherwise, it is cancelled. The total memory that can be + used by a query is the <codeph>MEM_LIMIT</codeph> times the number of nodes.</li> + <li><b>Max Queued Queries</b>: Maximum number of queries that can be queued in the pool before + additional queries are rejected.</li> + <li><b>Queue Timeout</b>: Specifies how long queries can wait in the queue before they are + cancelled with a timeout error.</li> + </ul></p> + </conbody> + </concept> + <concept id="set_per_query_memory_limits"> + <title>Setting Per-query Memory Limits</title> + <conbody> + <p>Use per-query memory limits to prevent queries from consuming excessive memory resources that + impact other queries. <ph rev="upstream">Cloudera</ph> recommends that you set the query memory limits whenever + possible.</p> + <p>If you set the <b>Pool Max Mem Resources</b> for a resource pool, Impala attempts to throttle + queries if there is not enough memory to run them within the specified resources.</p> + <p>Only use admission control with maximum memory resources if you can ensure there are query + memory limits. Set the pool <b>Default Query Memory Limit</b> to be certain. You can override + this setting with the <codeph>query</codeph> option, if necessary.</p> + <p>Typically, you set query memory limits using the <codeph>set MEM_LIMIT=Xg;</codeph> query + option. When you find the right value for your business case, memory-based admission control + works well. The potential downside is that queries that attempt to use more memory might perform + poorly or even be cancelled.</p> + <p>To find a reasonable default query memory limit:<ol id="ol_ydt_xhy_pw"> + <li>Run the workload.</li> + <li>In Cloudera Manager, go to <menucascade> + <uicontrol>Impala</uicontrol> + <uicontrol>Queries</uicontrol> + </menucascade>.</li> + <li>Click <uicontrol>Select Attributes</uicontrol>.</li> + <li>Select <uicontrol>Per Node Peak Memory Usage</uicontrol> and click + <uicontrol>Update</uicontrol>.</li> + <li>Allow the system time to gather information, then click the <uicontrol>Show + Histogram</uicontrol> icon to see the results.<image placement="break" + href="../images/howto_show_histogram.png" align="center" id="image_hmv_xky_pw"/></li> + <li>Use the histogram to find a value that accounts for most queries. Queries that require more + resources than this limit should explicitly set the memory limit to ensure they can run to + completion.<draft-comment author="ddawson">This chart uses bad sample data - we will change + the chart when we have real numbers from the sample use case.</draft-comment><image + placement="break" href="../images/howto_per_node_peak_memory_usage.png" align="center" + id="image_ehn_hly_pw" scalefit="yes"/></li> + </ol></p> + </conbody> + </concept> + <concept id="concept_en4_3sy_pw"> + <title>Creating Dynamic Resource Pools</title> + <conbody> + <p>A dynamic resource pool is a named configuration of resources and a policy for scheduling the + resources among Impala queries running in the pool. Dynamic resource pools allow you to schedule + and allocate resources to Impala queries based on a user's access to specific pools and the + resources available to those pools. </p> + <p>This example creates both production and development resource pools or queues. It assumes you + have 3 worker nodes with 24GiB of RAM each for an aggregate memory of 72000MiB. This pool + configuration allocates the Production queue twice the memory resources of the Development + queue, and a higher number of concurrent queries.</p> + <p>To create a Production dynamic resource pool for Impala:</p> + <ol> + <li>In Cloudera Manager, select <menucascade> + <uicontrol>Clusters</uicontrol> + <uicontrol>Dynamic Resource Pool Configuration</uicontrol> + </menucascade>.</li> + <li>Click the <uicontrol>Impala Admission Control</uicontrol> tab.</li> + <li>Click <b>Create Resource Pool</b>.</li> + <li>Specify a name and resource limits for the Production pool:<ul id="ul_rjt_wqv_2v"> + <li>In the <b>Resource Pool Name</b> field, enter <userinput>Production</userinput>.</li> + <li>In the <uicontrol>Max Memory</uicontrol> field, enter <userinput>48000</userinput>.</li> + <li>In the <uicontrol>Default Query Memory Limit</uicontrol> field, enter + <userinput>1600</userinput>.</li> + <li>In the <uicontrol>Max Running Queries</uicontrol> field, enter + <userinput>10</userinput>.</li> + <li>In the <uicontrol>Max Queued Queries</uicontrol> field, enter + <userinput>200</userinput>.</li> + </ul></li> + <li>Click <uicontrol>Create</uicontrol>.</li> + <li>Click <uicontrol>Refresh Dynamic Resource Pools</uicontrol>.</li> + </ol> + <p>The Production queue runs up to 10 queries at once. If the total memory requested + by these queries exceeds 48000 MiB, it holds the next query in the queue until the memory is + released. It also prevents a query from running if it needs more memory than is currently + available. Admission Control holds the next query if either Max Running Queries is reached, or + the pool Max Memory limit is reached.</p> + <p>Here, Max Memory resources and Default Query Memory Limit throttle throughput to 10 queries, + so setting Max Running Queries might not be necessary, though it does not hurt to do so. Most + users set Max Running Queries when they cannot pick good numbers for memory. Since users can + override the query option <varname>mem_limit</varname>, setting the Max Running Queries property + might make sense.</p> + <p>To create a Development dynamic resource pool for Impala:</p> + + <ol> + <li>In Cloudera Manager, select <menucascade> + <uicontrol>Clusters</uicontrol> + <uicontrol>Dynamic Resource Pool Configuration</uicontrol> + </menucascade>.</li> + <li>Click the <uicontrol>Impala Admission Control</uicontrol> tab.</li> + <li>Click <b>Create Resource Pool</b>.</li> + <li>Specify a name and resource limits for the Development pool:<ul id="ul_j42_q3z_pw"> + <li>In the <b>Resource Pool Name</b> field, enter <userinput>Development</userinput>.</li> + <li>In the <uicontrol>Max Memory</uicontrol> field, enter <userinput>24000</userinput>.</li> + <li>In the <uicontrol>Default Query Memory Limit</uicontrol> field, enter + <userinput>8000</userinput>.</li> + <li>In the <uicontrol>Max Running Queries</uicontrol> field, enter 1.</li> + <li>In the <uicontrol>Max Queued Queries</uicontrol> field, enter 100.</li> + </ul></li> + <li>Click <uicontrol>Create</uicontrol>.</li> + <li>Click <uicontrol>Refresh Dynamic Resource Pools</uicontrol>.<p>The Development queue runs + one query at a time. If the total memory required by the query exceeds 24000 MiB, it holds the + query until memory is released.</p></li> + </ol> + </conbody> + <concept id="setting_placement_rules"> + <title>Understanding Placement Rules</title> + <conbody> + <p>Placement rules determine how queries are mapped to resource pools. The standard settings are + to use a specified pool when specified; otherwise, use the default pool.</p> + <p>For example, you can use the SET statement to select the pool in which to run a + query.<codeblock>SET REQUEST_POOL=Production;</codeblock></p> + <p>If you do not use a <codeph>SET</codeph> statement, queries are run in the default pool.</p> + </conbody> + </concept> + <concept id="setting_access_control_on_pools"> + <title>Setting Access Control on Pools</title> + <conbody> + <p>You can specify that only cetain users and groups are allowed to use the pools you define.</p> + <p>To create a Development dynamic resource pool for Impala:</p> + <ol> + <li>In Cloudera Manager, select <menucascade> + <uicontrol>Clusters</uicontrol> + <uicontrol>Dynamic Resource Pool Configuration</uicontrol> + </menucascade>.</li> + <li>Click the <uicontrol>Impala Admission Control</uicontrol> tab.</li> + <li>Click the <uicontrol>Edit</uicontrol> button for the Production pool.</li> + <li>Click the Submission Access Control tab.</li> + <li>Select <uicontrol>Allow these users and groups to submit to this pool</uicontrol>.</li> + <li>Enter a comma-separated list of users who can use the pool. + <image placement="break" + href="../images/howto_access_control.png" align="center" scalefit="yes"/> + </li> + + <li>Click <uicontrol>Save</uicontrol>.</li> + </ol> + </conbody> + </concept> + </concept> +<concept id="impala_resource_management_example"> + <title>Impala Resource Management Example</title> + <conbody> + <p>Anne Chang is administrator for an enterprise data hub that runs a number of workloads, + including Impala. </p> + <p>Anne has a 20-node cluster that uses Cloudera Manager static partitioning. Because of the + heavy Impala workload, Anne needs to make sure Impala gets enough resources. While the best + configuration values might not be known in advance, she decides to start by allocating 50% of + resources to Impala. Each node has 128 GiB dedicated to each impalad. Impala has 2560 GiB in + aggregate that can be shared across the resource pools she creates.</p> + <p>Next, Anne studies the workload in more detail. After some research, she might choose to + revisit these initial values for static partitioning. </p> + <p>To figure out how to further allocate Impalaâs resources, Anne needs to consider the workloads + and users, and determine their requirements. There are a few main sources of Impala queries: <ul + id="ul_ml3_sf2_5w"> + <li>Large reporting queries executed by an external process/tool. These are critical business + intelligence queries that are important for business decisions. It is important that they get + the resources they need to run. There typically are not many of these queries at a given + time.</li> + <li>Frequent, small queries generated by a web UI. These queries scan a limited amount of data + and do not require expensive joins or aggregations. These queries are important, but not as + critical, perhaps the client tries resending the query or the end user refreshes the + page.</li> + <li>Occasionally, expert users might run ad-hoc queries. The queries can vary significantly in + their resource requirements. While Anne wants a good experience for these users, it is hard to + control what they do (for example, submitting inefficient or incorrect queries by mistake). + Anne restricts these queries by default and tells users to reach out to her if they need more + resources. </li> + </ul></p> + <p>To set up admission control for this workload, Anne first runs the workloads independently, so + that she can observe the workloadâs resource usage in Cloudera Manager. If they could not easily + be run manually, but had been run in the past, Anne uses the history information from Cloudera + Manager. It can be helpful to use other search criteria (for example, <i>user</i>) to isolate + queries by workload. Anne uses the Cloudera Manager chart for Per-Node Peak Memory usage to + identify the maximum memory requirements for the queries. </p> + <p>From this data, Anne observes the following about the queries in the groups above:<ul + id="ul_amq_ng2_5w"> + <li> Large reporting queries use up to 32 GiB per node. There are typically 1 or 2 queries + running at a time. On one occasion, she observed that 3 of these queries were running + concurrently. Queries can take 3 minutes to complete.</li> + <li>Web UI-generated queries use between 100 MiB per node to usually less than 4 GiB per node + of memory, but occasionally as much as 10 GiB per node. Queries take, on average, 5 seconds, + and there can be as many as 140 incoming queries per minute.</li> + <li>Anne has little data on ad hoc queries, but some are trivial (approximately 100 MiB per + node), others join several tables (requiring a few GiB per node), and one user submitted a + huge cross join of all tables that used all system resources (that was likely a mistake).</li> + </ul></p> + <p>Based on these observations, Anne creates the admission control configuration with the + following pools: </p> + <section id="section_yjc_h32_5w"> + <title>XL_Reporting</title> + <p> + <table frame="all" rowsep="1" colsep="1" id="XL_Reporting_Table"> + <tgroup cols="2"> + <colspec colname="c1" colnum="1" colwidth="1.0*"/> + <colspec colname="c2" colnum="2" colwidth="1.0*"/> + <thead> + <row> + <entry>Property</entry> + <entry>Value</entry> + </row> + </thead> + <tbody> + <row> + <entry>Max Memory</entry> + <entry>1280 GiB</entry> + </row> + <row> + <entry>Default Query Memory Limit</entry> + <entry>32 GiB</entry> + </row> + <row> + <entry>Max Running Queries</entry> + <entry>2</entry> + </row> + <row> + <entry>Queue Timeout</entry> + <entry>5 minutes</entry> + </row> + </tbody> + </tgroup> + </table> + </p> + <p>This pool is for large reporting queries. In order to support running 2 queries at a time, + the pool memory resources are set to 1280 GiB (aggregate cluster memory). This is for 2 + queries, each with 32 GiB per node, across 20 nodes. Anne sets the poolâs Default Query Memory + Limit to 32 GiB so that no query uses more than 32 GiB on any given node. She sets Max Running + Queries to 2 (though it is not necessary she do so). She increases the poolâs queue timeout to + 5 minutes in case a third query comes in and has to wait. She does not expect more than 3 + concurrent queries, and she does not want them to wait that long anyway, so she does not + increase the queue timeout. If the workload increases in the future, she might choose to adjust + the configuration or buy more hardware. </p> + </section> + <section id="section_xm3_j32_5w"><title>HighThroughput_UI</title> + <p> + <table frame="all" rowsep="1" colsep="1" id="High_Throughput_UI_Table"> + <tgroup cols="2"> + <colspec colname="c1" colnum="1" colwidth="1.0*"/> + <colspec colname="c2" colnum="2" colwidth="1.0*"/> + <thead> + <row> + <entry>Property</entry> + <entry>Value</entry> + </row> + </thead> + <tbody> + <row> + <entry>Max Memory</entry> + <entry>960 GiB (inferred)</entry> + </row> + <row> + <entry>Default Query Memory Limit</entry> + <entry>4 GiB</entry> + </row> + <row> + <entry>Max Running Queries</entry> + <entry>12</entry> + </row> + <row> + <entry>Queue Timeout</entry> + <entry>5 minutes</entry> + </row> + </tbody> + </tgroup> + </table> + </p> + <p>This pool is used for the small, high throughput queries generated by the web tool. Anne sets + the Default Query Memory Limit to 4 GiB per node, and sets Max Running Queries to 12. This + implies a maximum amount of memory per node used by the queries in this pool: 48 GiB per node + (12 queries * 4 GiB per node memory limit).</p><p>Notice that Anne does not set the pool memory resources, but does set the poolâs Default Query + Memory Limit. This is intentional: admission control processes queries faster when a pool uses + the Max Running Queries limit instead of the peak memory resources.</p><p>This should be enough memory for most queries, since only a few go over 4 GiB per node. For those + that do require more memory, they can probably still complete with less memory (spilling if + necessary). If, on occasion, a query cannot run with this much memory and it fails, Anne might + reconsider this configuration later, or perhaps she does not need to worry about a few rare + failures from this web UI.</p><p>With regard to throughput, since these queries take around 5 seconds and she is allowing 12 + concurrent queries, the pool should be able to handle approximately 144 queries per minute, + which is enough for the peak maximum expected of 140 queries per minute. In case there is a + large burst of queries, Anne wants them to queue. The default maximum size of the queue is + already 200, which should be more than large enough. Anne does not need to change it.</p></section> + <section id="section_asm_yj2_5w"><title>Default</title> + <p> + <table frame="all" rowsep="1" colsep="1" id="default_table"> + <tgroup cols="2"> + <colspec colname="c1" colnum="1" colwidth="1.0*"/> + <colspec colname="c2" colnum="2" colwidth="1.0*"/> + <thead> + <row> + <entry>Property</entry> + <entry>Value</entry> + </row> + </thead> + <tbody> + <row> + <entry>Max Memory</entry> + <entry>320 GiB</entry> + </row> + <row> + <entry>Default Query Memory Limit</entry> + <entry>4 GiB</entry> + </row> + <row> + <entry>Max Running Queries</entry> + <entry>Unlimited</entry> + </row> + <row> + <entry>Queue Timeout</entry> + <entry>60 Seconds</entry> + </row> + </tbody> + </tgroup> + </table> + </p><p>The default pool (which already exists) is a catch all for ad-hoc queries. Anne wants to use the + remaining memory not used by the first two pools, 16 GiB per node (XL_Reporting uses 64 GiB per + node, High_Throughput_UI uses 48 GiB per node). For the other pools to get the resources they + expect, she must still set the Max Memory resources and the Default Query Memory Limit. She + sets the Max Memory resources to 320 GiB (16 * 20). She sets the Default Query Memory Limit to + 4 GiB per node for now. That is somewhat arbitrary, but satisfies some of the ad hoc queries + she observed. If someone writes a bad query by mistake, she does not actually want it using all + the system resources. If a user has a large query to submit, an expert user can override the + Default Query Memory Limit (up to 16 GiB per node, since that is bound by the pool Max Memory + resources). If that is still insufficient for this userâs workload, the user should work with + Anne to adjust the settings and perhaps create a dedicated pool for the workload.</p></section> + </conbody> +</concept> +</concept>
