http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_hbase.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_hbase.html b/docs/build/html/topics/impala_hbase.html new file mode 100644 index 0000000..7ee8bad --- /dev/null +++ b/docs/build/html/topics/impala_hbase.html @@ -0,0 +1,763 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2. 8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="impala_hbase"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Using Impala to Query HBase Tables</title></head><body id="impala_hbase"><main role="main"><article role="article" aria-labelledby="impala_hbase__hbase"> + + <h1 class="title topictitle1" id="impala_hbase__hbase">Using Impala to Query HBase Tables</h1> + + + + <div class="body conbody"> + + <p class="p"> + + You can use Impala to query HBase tables. This capability allows convenient access to a storage system that + is tuned for different kinds of workloads than the default with Impala. The default Impala tables use data + files stored on HDFS, which are ideal for bulk loads and queries using full-table scans. In contrast, HBase + can do efficient queries for data organized for OLTP-style workloads, with lookups of individual rows or + ranges of values. + </p> + + <p class="p"> + From the perspective of an Impala user, coming from an RDBMS background, HBase is a kind of key-value store + where the value consists of multiple fields. The key is mapped to one column in the Impala table, and the + various fields of the value are mapped to the other columns in the Impala table. + </p> + + <p class="p"> + For background information on HBase, see <a class="xref" href="https://hbase.apache.org/book.html" target="_blank">the Apache HBase documentation</a>. + </p> + + <p class="p toc inpage"></p> + </div> + + <article class="topic concept nested1" aria-labelledby="ariaid-title2" id="impala_hbase__hbase_using"> + + <h2 class="title topictitle2" id="ariaid-title2">Overview of Using HBase with Impala</h2> + + + <div class="body conbody"> + + <p class="p"> + When you use Impala with HBase: + </p> + + <ul class="ul"> + <li class="li"> + You create the tables on the Impala side using the Hive shell, because the Impala <code class="ph codeph">CREATE + TABLE</code> statement currently does not support custom SerDes and some other syntax needed for these + tables: + <ul class="ul"> + <li class="li"> + You designate it as an HBase table using the <code class="ph codeph">STORED BY + 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'</code> clause on the Hive <code class="ph codeph">CREATE + TABLE</code> statement. + </li> + + <li class="li"> + You map these specially created tables to corresponding tables that exist in HBase, with the clause + <code class="ph codeph">TBLPROPERTIES("hbase.table.name" = "<var class="keyword varname">table_name_in_hbase</var>")</code> on the + Hive <code class="ph codeph">CREATE TABLE</code> statement. + </li> + + <li class="li"> + See <a class="xref" href="#hbase_queries">Examples of Querying HBase Tables from Impala</a> for a full example. + </li> + </ul> + </li> + + <li class="li"> + You define the column corresponding to the HBase row key as a string with the <code class="ph codeph">#string</code> + keyword, or map it to a <code class="ph codeph">STRING</code> column. + </li> + + <li class="li"> + Because Impala and Hive share the same metastore database, once you create the table in Hive, you can + query or insert into it through Impala. (After creating a new table through Hive, issue the + <code class="ph codeph">INVALIDATE METADATA</code> statement in <span class="keyword cmdname">impala-shell</span> to make Impala aware of + the new table.) + </li> + + <li class="li"> + You issue queries against the Impala tables. For efficient queries, use <code class="ph codeph">WHERE</code> clauses to + find a single key value or a range of key values wherever practical, by testing the Impala column + corresponding to the HBase row key. Avoid queries that do full-table scans, which are efficient for + regular Impala tables but inefficient in HBase. + </li> + </ul> + + <p class="p"> + To work with an HBase table from Impala, ensure that the <code class="ph codeph">impala</code> user has read/write + privileges for the HBase table, using the <code class="ph codeph">GRANT</code> command in the HBase shell. For details + about HBase security, see <a class="xref" href="https://hbase.apache.org/book.html#security" target="_blank">the Security chapter in the Apache HBase documentation</a>. + </p> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="impala_hbase__hbase_config"> + + <h2 class="title topictitle2" id="ariaid-title3">Configuring HBase for Use with Impala</h2> + + + <div class="body conbody"> + + <p class="p"> + HBase works out of the box with Impala. There is no mandatory configuration needed to use these two + components together. + </p> + + <p class="p"> + To avoid delays if HBase is unavailable during Impala startup or after an <code class="ph codeph">INVALIDATE + METADATA</code> statement, set timeout values similar to the following in + <span class="ph filepath">/etc/impala/conf/hbase-site.xml</span>: + </p> + +<pre class="pre codeblock"><code><property> + <name>hbase.client.retries.number</name> + <value>3</value> +</property> +<property> + <name>hbase.rpc.timeout</name> + <value>3000</value> +</property> +</code></pre> + + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="impala_hbase__hbase_types"> + + <h2 class="title topictitle2" id="ariaid-title4">Supported Data Types for HBase Columns</h2> + + <div class="body conbody"> + + <p class="p"> + To understand how Impala column data types are mapped to fields in HBase, you should have some background + knowledge about HBase first. You set up the mapping by running the <code class="ph codeph">CREATE TABLE</code> statement + in the Hive shell. See + <a class="xref" href="https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration" target="_blank">the + Hive wiki</a> for a starting point, and <a class="xref" href="#hbase_queries">Examples of Querying HBase Tables from Impala</a> for examples. + </p> + + <p class="p"> + HBase works as a kind of <span class="q">"bit bucket"</span>, in the sense that HBase does not enforce any typing for the + key or value fields. All the type enforcement is done on the Impala side. + </p> + + <p class="p"> + For best performance of Impala queries against HBase tables, most queries will perform comparisons in the + <code class="ph codeph">WHERE</code> against the column that corresponds to the HBase row key. When creating the table + through the Hive shell, use the <code class="ph codeph">STRING</code> data type for the column that corresponds to the + HBase row key. Impala can translate conditional tests (through operators such as <code class="ph codeph">=</code>, + <code class="ph codeph"><</code>, <code class="ph codeph">BETWEEN</code>, and <code class="ph codeph">IN</code>) against this column into fast + lookups in HBase, but this optimization (<span class="q">"predicate pushdown"</span>) only works when that column is + defined as <code class="ph codeph">STRING</code>. + </p> + + <p class="p"> + Starting in Impala 1.1, Impala also supports reading and writing to columns that are defined in the Hive + <code class="ph codeph">CREATE TABLE</code> statement using binary data types, represented in the Hive table definition + using the <code class="ph codeph">#binary</code> keyword, often abbreviated as <code class="ph codeph">#b</code>. Defining numeric + columns as binary can reduce the overall data volume in the HBase tables. You should still define the + column that corresponds to the HBase row key as a <code class="ph codeph">STRING</code>, to allow fast lookups using + those columns. + </p> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title5" id="impala_hbase__hbase_performance"> + + <h2 class="title topictitle2" id="ariaid-title5">Performance Considerations for the Impala-HBase Integration</h2> + + + <div class="body conbody"> + + <p class="p"> + To understand the performance characteristics of SQL queries against data stored in HBase, you should have + some background knowledge about how HBase interacts with SQL-oriented systems first. See + <a class="xref" href="https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration" target="_blank">the + Hive wiki</a> for a starting point; because Impala shares the same metastore database as Hive, the + information about mapping columns from Hive tables to HBase tables is generally applicable to Impala too. + </p> + + <p class="p"> + Impala uses the HBase client API via Java Native Interface (JNI) to query data stored in HBase. This + querying does not read HFiles directly. The extra communication overhead makes it important to choose what + data to store in HBase or in HDFS, and construct efficient queries that can retrieve the HBase data + efficiently: + </p> + + <ul class="ul"> + <li class="li"> + Use HBase table for queries that return a single row or a range of rows, not queries that scan the entire + table. (If a query has no <code class="ph codeph">WHERE</code> clause, that is a strong indicator that it is an + inefficient query for an HBase table.) + </li> + + <li class="li"> + If you have join queries that do aggregation operations on large fact tables and join the results against + small dimension tables, consider using Impala for the fact tables and HBase for the dimension tables. + (Because Impala does a full scan on the HBase table in this case, rather than doing single-row HBase + lookups based on the join column, only use this technique where the HBase table is small enough that + doing a full table scan does not cause a performance bottleneck for the query.) + </li> + </ul> + + <p class="p"> + Query predicates are applied to row keys as start and stop keys, thereby limiting the scope of a particular + lookup. If row keys are not mapped to string columns, then ordering is typically incorrect and comparison + operations do not work. For example, if row keys are not mapped to string columns, evaluating for greater + than (>) or less than (<) cannot be completed. + </p> + + <p class="p"> + Predicates on non-key columns can be sent to HBase to scan as <code class="ph codeph">SingleColumnValueFilters</code>, + providing some performance gains. In such a case, HBase returns fewer rows than if those same predicates + were applied using Impala. While there is some improvement, it is not as great when start and stop rows are + used. This is because the number of rows that HBase must examine is not limited as it is when start and + stop rows are used. As long as the row key predicate only applies to a single row, HBase will locate and + return that row. Conversely, if a non-key predicate is used, even if it only applies to a single row, HBase + must still scan the entire table to find the correct result. + </p> + + <div class="example"><h3 class="title sectiontitle">Interpreting EXPLAIN Output for HBase Queries</h3> + + + + <p class="p"> + For example, here are some queries against the following Impala table, which is mapped to an HBase table. + The examples show excerpts from the output of the <code class="ph codeph">EXPLAIN</code> statement, demonstrating what + things to look for to indicate an efficient or inefficient query against an HBase table. + </p> + + <p class="p"> + The first column (<code class="ph codeph">cust_id</code>) was specified as the key column in the <code class="ph codeph">CREATE + EXTERNAL TABLE</code> statement; for performance, it is important to declare this column as + <code class="ph codeph">STRING</code>. Other columns, such as <code class="ph codeph">BIRTH_YEAR</code> and + <code class="ph codeph">NEVER_LOGGED_ON</code>, are also declared as <code class="ph codeph">STRING</code>, rather than their + <span class="q">"natural"</span> types of <code class="ph codeph">INT</code> or <code class="ph codeph">BOOLEAN</code>, because Impala can optimize + those types more effectively in HBase tables. For comparison, we leave one column, + <code class="ph codeph">YEAR_REGISTERED</code>, as <code class="ph codeph">INT</code> to show that filtering on this column is + inefficient. + </p> + +<pre class="pre codeblock"><code>describe hbase_table; +Query: describe hbase_table ++-----------------------+--------+---------+ +| name | type | comment | ++-----------------------+--------+---------+ +| cust_id | <strong class="ph b">string</strong> | | +| birth_year | <strong class="ph b">string</strong> | | +| never_logged_on | <strong class="ph b">string</strong> | | +| private_email_address | string | | +| year_registered | <strong class="ph b">int</strong> | | ++-----------------------+--------+---------+ +</code></pre> + + <p class="p"> + The best case for performance involves a single row lookup using an equality comparison on the column + defined as the row key: + </p> + +<pre class="pre codeblock"><code>explain select count(*) from hbase_table where cust_id = '[email protected]'; ++------------------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 | +| WARNING: The following tables are missing relevant table and/or column statistics. | +| hbase.hbase_table | +| | +| 03:AGGREGATE [MERGE FINALIZE] | +| | output: sum(count(*)) | +| | | +| 02:EXCHANGE [PARTITION=UNPARTITIONED] | +| | | +| 01:AGGREGATE | +| | output: count(*) | +| | | +<strong class="ph b">| 00:SCAN HBASE [hbase.hbase_table] |</strong> +<strong class="ph b">| start key: [email protected] |</strong> +<strong class="ph b">| stop key: [email protected]\0 |</strong> ++------------------------------------------------------------------------------------+ +</code></pre> + + <p class="p"> + Another type of efficient query involves a range lookup on the row key column, using SQL operators such + as greater than (or equal), less than (or equal), or <code class="ph codeph">BETWEEN</code>. This example also includes + an equality test on a non-key column; because that column is a <code class="ph codeph">STRING</code>, Impala can let + HBase perform that test, indicated by the <code class="ph codeph">hbase filters:</code> line in the + <code class="ph codeph">EXPLAIN</code> output. Doing the filtering within HBase is more efficient than transmitting all + the data to Impala and doing the filtering on the Impala side. + </p> + +<pre class="pre codeblock"><code>explain select count(*) from hbase_table where cust_id between 'a' and 'b' + and never_logged_on = 'true'; ++------------------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------------------+ +... + +| 01:AGGREGATE | +| | output: count(*) | +| | | +<strong class="ph b">| 00:SCAN HBASE [hbase.hbase_table] |</strong> +<strong class="ph b">| start key: a |</strong> +<strong class="ph b">| stop key: b\0 |</strong> +<strong class="ph b">| hbase filters: cols:never_logged_on EQUAL 'true' |</strong> ++------------------------------------------------------------------------------------+ +</code></pre> + + <p class="p"> + The query is less efficient if Impala has to evaluate any of the predicates, because Impala must scan the + entire HBase table. Impala can only push down predicates to HBase for columns declared as + <code class="ph codeph">STRING</code>. This example tests a column declared as <code class="ph codeph">INT</code>, and the + <code class="ph codeph">predicates:</code> line in the <code class="ph codeph">EXPLAIN</code> output indicates that the test is + performed after the data is transmitted to Impala. + </p> + +<pre class="pre codeblock"><code>explain select count(*) from hbase_table where year_registered = 2010; ++------------------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------------------+ +... + +| 01:AGGREGATE | +| | output: count(*) | +| | | +<strong class="ph b">| 00:SCAN HBASE [hbase.hbase_table] |</strong> +<strong class="ph b">| predicates: year_registered = 2010 |</strong> ++------------------------------------------------------------------------------------+ +</code></pre> + + <p class="p"> + The same inefficiency applies if the key column is compared to any non-constant value. Here, even though + the key column is a <code class="ph codeph">STRING</code>, and is tested using an equality operator, Impala must scan + the entire HBase table because the key column is compared to another column value rather than a constant. + </p> + +<pre class="pre codeblock"><code>explain select count(*) from hbase_table where cust_id = private_email_address; ++------------------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------------------+ +... + +| 01:AGGREGATE | +| | output: count(*) | +| | | +<strong class="ph b">| 00:SCAN HBASE [hbase.hbase_table] |</strong> +<strong class="ph b">| predicates: cust_id = private_email_address |</strong> ++------------------------------------------------------------------------------------+ +</code></pre> + + <p class="p"> + Currently, tests on the row key using <code class="ph codeph">OR</code> or <code class="ph codeph">IN</code> clauses are not + optimized into direct lookups either. Such limitations might be lifted in the future, so always check the + <code class="ph codeph">EXPLAIN</code> output to be sure whether a particular SQL construct results in an efficient + query or not for HBase tables. + </p> + +<pre class="pre codeblock"><code>explain select count(*) from hbase_table where + cust_id = '[email protected]' or cust_id = '[email protected]'; ++----------------------------------------------------------------------------------------+ +| Explain String | ++----------------------------------------------------------------------------------------+ +... + +| 01:AGGREGATE | +| | output: count(*) | +| | | +<strong class="ph b">| 00:SCAN HBASE [hbase.hbase_table] |</strong> +<strong class="ph b">| predicates: cust_id = '[email protected]' OR cust_id = '[email protected]' |</strong> ++----------------------------------------------------------------------------------------+ + +explain select count(*) from hbase_table where + cust_id in ('[email protected]', '[email protected]'); ++------------------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------------------+ +... + +| 01:AGGREGATE | +| | output: count(*) | +| | | +<strong class="ph b">| 00:SCAN HBASE [hbase.hbase_table] |</strong> +<strong class="ph b">| predicates: cust_id IN ('[email protected]', '[email protected]') |</strong> ++------------------------------------------------------------------------------------+ +</code></pre> + + <p class="p"> + Either rewrite into separate queries for each value and combine the results in the application, or + combine the single-row queries using UNION ALL: + </p> + +<pre class="pre codeblock"><code>select count(*) from hbase_table where cust_id = '[email protected]'; +select count(*) from hbase_table where cust_id = '[email protected]'; + +explain + select count(*) from hbase_table where cust_id = '[email protected]' + union all + select count(*) from hbase_table where cust_id = '[email protected]'; ++------------------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------------------+ +... + +| | 04:AGGREGATE | +| | | output: count(*) | +| | | | +<strong class="ph b">| | 03:SCAN HBASE [hbase.hbase_table] |</strong> +<strong class="ph b">| | start key: [email protected] |</strong> +<strong class="ph b">| | stop key: [email protected]\0 |</strong> +| | | +| 10:MERGE | +... + +| 02:AGGREGATE | +| | output: count(*) | +| | | +<strong class="ph b">| 01:SCAN HBASE [hbase.hbase_table] |</strong> +<strong class="ph b">| start key: [email protected] |</strong> +<strong class="ph b">| stop key: [email protected]\0 |</strong> ++------------------------------------------------------------------------------------+ +</code></pre> + + </div> + + <div class="example"><h3 class="title sectiontitle">Configuration Options for Java HBase Applications</h3> + + + + <p class="p"> If you have an HBase Java application that calls the + <code class="ph codeph">setCacheBlocks</code> or <code class="ph codeph">setCaching</code> + methods of the class <a class="xref" href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html" target="_blank">org.apache.hadoop.hbase.client.Scan</a>, you can set these same + caching behaviors through Impala query options, to control the memory + pressure on the HBase RegionServer. For example, when doing queries in + HBase that result in full-table scans (which by default are + inefficient for HBase), you can reduce memory usage and speed up the + queries by turning off the <code class="ph codeph">HBASE_CACHE_BLOCKS</code> setting + and specifying a large number for the <code class="ph codeph">HBASE_CACHING</code> + setting. + </p> + + <p class="p"> + To set these options, issue commands like the following in <span class="keyword cmdname">impala-shell</span>: + </p> + +<pre class="pre codeblock"><code>-- Same as calling setCacheBlocks(true) or setCacheBlocks(false). +set hbase_cache_blocks=true; +set hbase_cache_blocks=false; + +-- Same as calling setCaching(rows). +set hbase_caching=1000; +</code></pre> + + <p class="p"> + Or update the <span class="keyword cmdname">impalad</span> defaults file <span class="ph filepath">/etc/default/impala</span> and + include settings for <code class="ph codeph">HBASE_CACHE_BLOCKS</code> and/or <code class="ph codeph">HBASE_CACHING</code> in the + <code class="ph codeph">-default_query_options</code> setting for <code class="ph codeph">IMPALA_SERVER_ARGS</code>. See + <a class="xref" href="impala_config_options.html#config_options">Modifying Impala Startup Options</a> for details. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + In Impala 2.0 and later, these options are settable through the JDBC or ODBC interfaces using the + <code class="ph codeph">SET</code> statement. + </div> + + </div> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="impala_hbase__hbase_scenarios"> + + <h2 class="title topictitle2" id="ariaid-title6">Use Cases for Querying HBase through Impala</h2> + + + <div class="body conbody"> + + <p class="p"> + The following are popular use cases for using Impala to query HBase tables: + </p> + + <ul class="ul"> + <li class="li"> + Keeping large fact tables in Impala, and smaller dimension tables in HBase. The fact tables use Parquet + or other binary file format optimized for scan operations. Join queries scan through the large Impala + fact tables, and cross-reference the dimension tables using efficient single-row lookups in HBase. + </li> + + <li class="li"> + Using HBase to store rapidly incrementing counters, such as how many times a web page has been viewed, or + on a social network, how many connections a user has or how many votes a post received. HBase is + efficient for capturing such changeable data: the append-only storage mechanism is efficient for writing + each change to disk, and a query always returns the latest value. An application could query specific + totals like these from HBase, and combine the results with a broader set of data queried from Impala. + </li> + + <li class="li"> + <p class="p"> + Storing very wide tables in HBase. Wide tables have many columns, possibly thousands, typically + recording many attributes for an important subject such as a user of an online service. These tables + are also often sparse, that is, most of the columns values are <code class="ph codeph">NULL</code>, 0, + <code class="ph codeph">false</code>, empty string, or other blank or placeholder value. (For example, any particular + web site user might have never used some site feature, filled in a certain field in their profile, + visited a particular part of the site, and so on.) A typical query against this kind of table is to + look up a single row to retrieve all the information about a specific subject, rather than summing, + averaging, or filtering millions of rows as in typical Impala-managed tables. + </p> + <p class="p"> + Or the HBase table could be joined with a larger Impala-managed table. For example, analyze the large + Impala table representing web traffic for a site and pick out 50 users who view the most pages. Join + that result with the wide user table in HBase to look up attributes of those users. The HBase side of + the join would result in 50 efficient single-row lookups in HBase, rather than scanning the entire user + table. + </p> + </li> + </ul> + </div> + </article> + + + + + + + + <article class="topic concept nested1" aria-labelledby="ariaid-title7" id="impala_hbase__hbase_loading"> + + <h2 class="title topictitle2" id="ariaid-title7">Loading Data into an HBase Table</h2> + + + <div class="body conbody"> + + <p class="p"> + The Impala <code class="ph codeph">INSERT</code> statement works for HBase tables. The <code class="ph codeph">INSERT ... VALUES</code> + syntax is ideally suited to HBase tables, because inserting a single row is an efficient operation for an + HBase table. (For regular Impala tables, with data files in HDFS, the tiny data files produced by + <code class="ph codeph">INSERT ... VALUES</code> are extremely inefficient, so you would not use that technique with + tables containing any significant data volume.) + </p> + + + + <p class="p"> + When you use the <code class="ph codeph">INSERT ... SELECT</code> syntax, the result in the HBase table could be fewer + rows than you expect. HBase only stores the most recent version of each unique row key, so if an + <code class="ph codeph">INSERT ... SELECT</code> statement copies over multiple rows containing the same value for the + key column, subsequent queries will only return one row with each key column value: + </p> + + <p class="p"> + Although Impala does not have an <code class="ph codeph">UPDATE</code> statement, you can achieve the same effect by + doing successive <code class="ph codeph">INSERT</code> statements using the same value for the key column each time: + </p> + + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title8" id="impala_hbase__hbase_limitations"> + + <h2 class="title topictitle2" id="ariaid-title8">Limitations and Restrictions of the Impala and HBase Integration</h2> + + <div class="body conbody"> + + <p class="p"> + The Impala integration with HBase has the following limitations and restrictions, some inherited from the + integration between HBase and Hive, and some unique to Impala: + </p> + + <ul class="ul"> + <li class="li"> + <p class="p"> + If you issue a <code class="ph codeph">DROP TABLE</code> for an internal (Impala-managed) table that is mapped to an + HBase table, the underlying table is not removed in HBase. The Hive <code class="ph codeph">DROP TABLE</code> + statement also removes the HBase table in this case. + </p> + </li> + + <li class="li"> + <p class="p"> + The <code class="ph codeph">INSERT OVERWRITE</code> statement is not available for HBase tables. You can insert new + data, or modify an existing row by inserting a new row with the same key value, but not replace the + entire contents of the table. You can do an <code class="ph codeph">INSERT OVERWRITE</code> in Hive if you need this + capability. + </p> + </li> + + <li class="li"> + <p class="p"> + If you issue a <code class="ph codeph">CREATE TABLE LIKE</code> statement for a table mapped to an HBase table, the + new table is also an HBase table, but inherits the same underlying HBase table name as the original. + The new table is effectively an alias for the old one, not a new table with identical column structure. + Avoid using <code class="ph codeph">CREATE TABLE LIKE</code> for HBase tables, to avoid any confusion. + </p> + </li> + + <li class="li"> + <p class="p"> + Copying data into an HBase table using the Impala <code class="ph codeph">INSERT ... SELECT</code> syntax might + produce fewer new rows than are in the query result set. If the result set contains multiple rows with + the same value for the key column, each row supercedes any previous rows with the same key value. + Because the order of the inserted rows is unpredictable, you cannot rely on this technique to preserve + the <span class="q">"latest"</span> version of a particular key value. + </p> + </li> + <li class="li"> + <p class="p"> + Because the complex data types (<code class="ph codeph">ARRAY</code>, <code class="ph codeph">STRUCT</code>, and <code class="ph codeph">MAP</code>) + available in <span class="keyword">Impala 2.3</span> and higher are currently only supported in Parquet tables, you cannot + use these types in HBase tables that are queried through Impala. + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">LOAD DATA</code> statement cannot be used with HBase tables. + </p> + </li> + </ul> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title9" id="impala_hbase__hbase_queries"> + + <h2 class="title topictitle2" id="ariaid-title9">Examples of Querying HBase Tables from Impala</h2> + + <div class="body conbody"> + + <p class="p"> + The following examples create an HBase table with four column families, + create a corresponding table through Hive, + then insert and query the table through Impala. + </p> + <p class="p"> + In HBase shell, the table + name is quoted in <code class="ph codeph">CREATE</code> and <code class="ph codeph">DROP</code> statements. Tables created in HBase + begin in <span class="q">"enabled"</span> state; before dropping them through the HBase shell, you must issue a + <code class="ph codeph">disable '<var class="keyword varname">table_name</var>'</code> statement. + </p> + +<pre class="pre codeblock"><code>$ hbase shell +15/02/10 16:07:45 +HBase Shell; enter 'help<RETURN>' for list of supported commands. +Type "exit<RETURN>" to leave the HBase Shell +... + +hbase(main):001:0> create 'hbasealltypessmall', 'boolsCF', 'intsCF', 'floatsCF', 'stringsCF' +0 row(s) in 4.6520 seconds + +=> Hbase::Table - hbasealltypessmall +hbase(main):006:0> quit +</code></pre> + + <p class="p"> + Issue the following <code class="ph codeph">CREATE TABLE</code> statement in the Hive shell. (The Impala <code class="ph codeph">CREATE + TABLE</code> statement currently does not support the <code class="ph codeph">STORED BY</code> clause, so you switch into Hive to + create the table, then back to Impala and the <span class="keyword cmdname">impala-shell</span> interpreter to issue the + queries.) + </p> + + <p class="p"> + This example creates an external table mapped to the HBase table, usable by both Impala and Hive. It is + defined as an external table so that when dropped by Impala or Hive, the original HBase table is not touched at all. + </p> + + <p class="p"> + The <code class="ph codeph">WITH SERDEPROPERTIES</code> clause + specifies that the first column (<code class="ph codeph">ID</code>) represents the row key, and maps the remaining + columns of the SQL table to HBase column families. The mapping relies on the ordinal order of the + columns in the table, not the column names in the <code class="ph codeph">CREATE TABLE</code> statement. + The first column is defined to be the lookup key; the + <code class="ph codeph">STRING</code> data type produces the fastest key-based lookups for HBase tables. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + For Impala with HBase tables, the most important aspect to ensure good performance is to use a + <code class="ph codeph">STRING</code> column as the row key, as shown in this example. + </div> + +<pre class="pre codeblock"><code>$ hive +... +hive> use hbase; +OK +Time taken: 4.095 seconds +hive> CREATE EXTERNAL TABLE hbasestringids ( + > id string, + > bool_col boolean, + > tinyint_col tinyint, + > smallint_col smallint, + > int_col int, + > bigint_col bigint, + > float_col float, + > double_col double, + > date_string_col string, + > string_col string, + > timestamp_col timestamp) + > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' + > WITH SERDEPROPERTIES ( + > "hbase.columns.mapping" = + > ":key,boolsCF:bool_col,intsCF:tinyint_col,intsCF:smallint_col,intsCF:int_col,intsCF:\ + > bigint_col,floatsCF:float_col,floatsCF:double_col,stringsCF:date_string_col,\ + > stringsCF:string_col,stringsCF:timestamp_col" + > ) + > TBLPROPERTIES("hbase.table.name" = "hbasealltypessmall"); +OK +Time taken: 2.879 seconds +hive> quit; +</code></pre> + + <p class="p"> + Once you have established the mapping to an HBase table, you can issue DML statements and queries + from Impala. The following example shows a series of <code class="ph codeph">INSERT</code> + statements followed by a query. + The ideal kind of query from a performance standpoint + retrieves a row from the table based on a row key + mapped to a string column. + An initial <code class="ph codeph">INVALIDATE METADATA <var class="keyword varname">table_name</var></code> + statement makes the table created through Hive visible to Impala. + </p> + +<pre class="pre codeblock"><code>$ impala-shell -i localhost -d hbase +Starting Impala Shell without Kerberos authentication +Connected to localhost:21000 +... +Query: use `hbase` +[localhost:21000] > invalidate metadata hbasestringids; +Fetched 0 row(s) in 0.09s +[localhost:21000] > desc hbasestringids; ++-----------------+-----------+---------+ +| name | type | comment | ++-----------------+-----------+---------+ +| id | string | | +| bool_col | boolean | | +| double_col | double | | +| float_col | float | | +| bigint_col | bigint | | +| int_col | int | | +| smallint_col | smallint | | +| tinyint_col | tinyint | | +| date_string_col | string | | +| string_col | string | | +| timestamp_col | timestamp | | ++-----------------+-----------+---------+ +Fetched 11 row(s) in 0.02s +[localhost:21000] > insert into hbasestringids values ('0001',true,3.141,9.94,1234567,32768,4000,76,'2014-12-31','Hello world',now()); +Inserted 1 row(s) in 0.26s +[localhost:21000] > insert into hbasestringids values ('0002',false,2.004,6.196,1500,8000,129,127,'2014-01-01','Foo bar',now()); +Inserted 1 row(s) in 0.12s +[localhost:21000] > select * from hbasestringids where id = '0001'; ++------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+ +| id | bool_col | double_col | float_col | bigint_col | int_col | smallint_col | tinyint_col | date_string_col | string_col | timestamp_col | ++------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+ +| 0001 | true | 3.141 | 9.939999580383301 | 1234567 | 32768 | 4000 | 76 | 2014-12-31 | Hello world | 2015-02-10 16:36:59.764838000 | ++------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+ +Fetched 1 row(s) in 0.54s +</code></pre> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + After you create a table in Hive, such as the HBase mapping table in this example, issue an + <code class="ph codeph">INVALIDATE METADATA <var class="keyword varname">table_name</var></code> statement the next time you connect to + Impala, make Impala aware of the new table. (Prior to Impala 1.2.4, you could not specify the table name if + Impala was not aware of the table yet; in Impala 1.2.4 and higher, specifying the table name avoids + reloading the metadata for other tables that are not changed.) + </div> + </div> + </article> +</article></main></body></html> \ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_hbase_cache_blocks.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_hbase_cache_blocks.html b/docs/build/html/topics/impala_hbase_cache_blocks.html new file mode 100644 index 0000000..ac76539 --- /dev/null +++ b/docs/build/html/topics/impala_hbase_cache_blocks.html @@ -0,0 +1,36 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="hbase_cache_blocks"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>HBASE_CACHE_BLOCKS Query Option</title></head><body id="hbase_cache_blocks"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">HBASE_CACHE_BLOCKS Query Option</h1> + + + + <div class="body conbody"> + + <p class="p"> + + Setting this option is equivalent to calling the + <code class="ph codeph">setCacheBlocks</code> method of the class <a class="xref" href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html" target="_blank">org.apache.hadoop.hbase.client.Scan</a>, in an HBase Java + application. Helps to control the memory pressure on the HBase + RegionServer, in conjunction with the <code class="ph codeph">HBASE_CACHING</code> query + option. </p> + + <p class="p"> + <strong class="ph b">Type:</strong> Boolean; recognized values are 1 and 0, or <code class="ph codeph">true</code> and <code class="ph codeph">false</code>; + any other value interpreted as <code class="ph codeph">false</code> + </p> + <p class="p"> + <strong class="ph b">Default:</strong> <code class="ph codeph">false</code> (shown as 0 in output of <code class="ph codeph">SET</code> statement) + </p> + + <p class="p"> + <strong class="ph b">Related information:</strong> + </p> + <p class="p"> + <a class="xref" href="impala_hbase.html#impala_hbase">Using Impala to Query HBase Tables</a>, + <a class="xref" href="impala_hbase_caching.html#hbase_caching">HBASE_CACHING Query Option</a> + </p> + + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_hbase_caching.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_hbase_caching.html b/docs/build/html/topics/impala_hbase_caching.html new file mode 100644 index 0000000..f78d0d5 --- /dev/null +++ b/docs/build/html/topics/impala_hbase_caching.html @@ -0,0 +1,36 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="hbase_caching"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>HBASE_CACHING Query Option</title></head><body id="hbase_caching"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">HBASE_CACHING Query Option</h1> + + + + <div class="body conbody"> + + <p class="p"> + + Setting this option is equivalent to calling the + <code class="ph codeph">setCaching</code> method of the class <a class="xref" href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html" target="_blank">org.apache.hadoop.hbase.client.Scan</a>, in an HBase Java + application. Helps to control the memory pressure on the HBase + RegionServer, in conjunction with the <code class="ph codeph">HBASE_CACHE_BLOCKS</code> + query option. </p> + + <p class="p"> + <strong class="ph b">Type:</strong> <code class="ph codeph">BOOLEAN</code> + </p> + + <p class="p"> + <strong class="ph b">Default:</strong> 0 + </p> + + <p class="p"> + <strong class="ph b">Related information:</strong> + </p> + <p class="p"> + <a class="xref" href="impala_hbase.html#impala_hbase">Using Impala to Query HBase Tables</a>, + <a class="xref" href="impala_hbase_cache_blocks.html#hbase_cache_blocks">HBASE_CACHE_BLOCKS Query Option</a> + </p> + + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_hints.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_hints.html b/docs/build/html/topics/impala_hints.html new file mode 100644 index 0000000..3b0c81d --- /dev/null +++ b/docs/build/html/topics/impala_hints.html @@ -0,0 +1,306 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_select.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="hints"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Query Hints in Impala SELECT Statements</title></head><body id="hints"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Query Hints in Impala SELECT Statements</h1> + + + + <div class="body conbody"> + + <p class="p"> + + The Impala SQL dialect supports query hints, for fine-tuning the inner workings of queries. Specify hints as + a temporary workaround for expensive queries, where missing statistics or other factors cause inefficient + performance. + </p> + + <p class="p"> + Hints are most often used for the most resource-intensive kinds of Impala queries: + </p> + + <ul class="ul"> + <li class="li"> + Join queries involving large tables, where intermediate result sets are transmitted across the network to + evaluate the join conditions. + </li> + + <li class="li"> + Inserting into partitioned Parquet tables, where many memory buffers could be allocated on each host to + hold intermediate results for each partition. + </li> + </ul> + + <p class="p"> + <strong class="ph b">Syntax:</strong> + </p> + + <p class="p"> + You can represent the hints as keywords surrounded by <code class="ph codeph">[]</code> square brackets; include the + brackets in the text of the SQL statement. + </p> + +<pre class="pre codeblock"><code>SELECT STRAIGHT_JOIN <var class="keyword varname">select_list</var> FROM +<var class="keyword varname">join_left_hand_table</var> + JOIN [{BROADCAST|SHUFFLE}] +<var class="keyword varname">join_right_hand_table</var> +<var class="keyword varname">remainder_of_query</var>; + +INSERT <var class="keyword varname">insert_clauses</var> + [{SHUFFLE|NOSHUFFLE}] + SELECT <var class="keyword varname">remainder_of_query</var>; +</code></pre> + + <p class="p"> + In <span class="keyword">Impala 2.0</span> and higher, you can also specify the hints inside comments that use + either the <code class="ph codeph">/* */</code> or <code class="ph codeph">--</code> notation. Specify a <code class="ph codeph">+</code> symbol + immediately before the hint name. + </p> + +<pre class="pre codeblock"><code>SELECT STRAIGHT_JOIN <var class="keyword varname">select_list</var> FROM +<var class="keyword varname">join_left_hand_table</var> + JOIN /* +BROADCAST|SHUFFLE */ +<var class="keyword varname">join_right_hand_table</var> +<var class="keyword varname">remainder_of_query</var>; + +SELECT <var class="keyword varname">select_list</var> FROM +<var class="keyword varname">join_left_hand_table</var> + JOIN -- +BROADCAST|SHUFFLE +<var class="keyword varname">join_right_hand_table</var> +<var class="keyword varname">remainder_of_query</var>; + +INSERT <var class="keyword varname">insert_clauses</var> + /* +SHUFFLE|NOSHUFFLE */ + SELECT <var class="keyword varname">remainder_of_query</var>; + +INSERT <var class="keyword varname">insert_clauses</var> + -- +SHUFFLE|NOSHUFFLE + SELECT <var class="keyword varname">remainder_of_query</var>; +</code></pre> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + + <p class="p"> + With both forms of hint syntax, include the <code class="ph codeph">STRAIGHT_JOIN</code> + keyword immediately after the <code class="ph codeph">SELECT</code> keyword to prevent Impala from + reordering the tables in a way that makes the hint ineffective. + </p> + + <p class="p"> + To reduce the need to use hints, run the <code class="ph codeph">COMPUTE STATS</code> statement against all tables involved + in joins, or used as the source tables for <code class="ph codeph">INSERT ... SELECT</code> operations where the + destination is a partitioned Parquet table. Do this operation after loading data or making substantial + changes to the data within each table. Having up-to-date statistics helps Impala choose more efficient query + plans without the need for hinting. See <a class="xref" href="impala_perf_stats.html#perf_stats">Table and Column Statistics</a> for details and + examples. + </p> + + <p class="p"> + To see which join strategy is used for a particular query, examine the <code class="ph codeph">EXPLAIN</code> output for + that query. See <a class="xref" href="impala_explain_plan.html#perf_explain">Using the EXPLAIN Plan for Performance Tuning</a> for details and examples. + </p> + + <p class="p"> + <strong class="ph b">Hints for join queries:</strong> + </p> + + <p class="p"> + The <code class="ph codeph">[BROADCAST]</code> and <code class="ph codeph">[SHUFFLE]</code> hints control the execution strategy for join + queries. Specify one of the following constructs immediately after the <code class="ph codeph">JOIN</code> keyword in a + query: + </p> + + <ul class="ul"> + <li class="li"> + <code class="ph codeph">[SHUFFLE]</code> - Makes that join operation use the <span class="q">"partitioned"</span> technique, which divides + up corresponding rows from both tables using a hashing algorithm, sending subsets of the rows to other + nodes for processing. (The keyword <code class="ph codeph">SHUFFLE</code> is used to indicate a <span class="q">"partitioned join"</span>, + because that type of join is not related to <span class="q">"partitioned tables"</span>.) Since the alternative + <span class="q">"broadcast"</span> join mechanism is the default when table and index statistics are unavailable, you might + use this hint for queries where broadcast joins are unsuitable; typically, partitioned joins are more + efficient for joins between large tables of similar size. + </li> + + <li class="li"> + <code class="ph codeph">[BROADCAST]</code> - Makes that join operation use the <span class="q">"broadcast"</span> technique that sends the + entire contents of the right-hand table to all nodes involved in processing the join. This is the default + mode of operation when table and index statistics are unavailable, so you would typically only need it if + stale metadata caused Impala to mistakenly choose a partitioned join operation. Typically, broadcast joins + are more efficient in cases where one table is much smaller than the other. (Put the smaller table on the + right side of the <code class="ph codeph">JOIN</code> operator.) + </li> + </ul> + + <p class="p"> + <strong class="ph b">Hints for INSERT ... SELECT queries:</strong> + </p> + + <div class="p"> + When inserting into partitioned tables, especially using the Parquet file format, you can include a hint in + the <code class="ph codeph">INSERT</code> statement to fine-tune the overall performance of the operation and its + resource usage: + <ul class="ul"> + <li class="li"> + These hints are available in Impala 1.2.2 and higher. + </li> + + <li class="li"> + You would only use these hints if an <code class="ph codeph">INSERT</code> into a partitioned Parquet table was + failing due to capacity limits, or if such an <code class="ph codeph">INSERT</code> was succeeding but with + less-than-optimal performance. + </li> + + <li class="li"> + To use these hints, put the hint keyword <code class="ph codeph">[SHUFFLE]</code> or <code class="ph codeph">[NOSHUFFLE]</code> + (including the square brackets) after the <code class="ph codeph">PARTITION</code> clause, immediately before the + <code class="ph codeph">SELECT</code> keyword. + </li> + + <li class="li"> + <code class="ph codeph">[SHUFFLE]</code> selects an execution plan that minimizes the number of files being written + simultaneously to HDFS, and the number of memory buffers holding data for individual partitions. Thus + it reduces overall resource usage for the <code class="ph codeph">INSERT</code> operation, allowing some + <code class="ph codeph">INSERT</code> operations to succeed that otherwise would fail. It does involve some data + transfer between the nodes so that the data files for a particular partition are all constructed on the + same node. + </li> + + <li class="li"> + <code class="ph codeph">[NOSHUFFLE]</code> selects an execution plan that might be faster overall, but might also + produce a larger number of small data files or exceed capacity limits, causing the + <code class="ph codeph">INSERT</code> operation to fail. Use <code class="ph codeph">[SHUFFLE]</code> in cases where an + <code class="ph codeph">INSERT</code> statement fails or runs inefficiently due to all nodes attempting to construct + data for all partitions. + </li> + + <li class="li"> + Impala automatically uses the <code class="ph codeph">[SHUFFLE]</code> method if any partition key column in the + source table, mentioned in the <code class="ph codeph">INSERT ... SELECT</code> query, does not have column + statistics. In this case, only the <code class="ph codeph">[NOSHUFFLE]</code> hint would have any effect. + </li> + + <li class="li"> + If column statistics are available for all partition key columns in the source table mentioned in the + <code class="ph codeph">INSERT ... SELECT</code> query, Impala chooses whether to use the <code class="ph codeph">[SHUFFLE]</code> + or <code class="ph codeph">[NOSHUFFLE]</code> technique based on the estimated number of distinct values in those + columns and the number of nodes involved in the <code class="ph codeph">INSERT</code> operation. In this case, you + might need the <code class="ph codeph">[SHUFFLE]</code> or the <code class="ph codeph">[NOSHUFFLE]</code> hint to override the + execution plan selected by Impala. + </li> + </ul> + </div> + + <p class="p"> + <strong class="ph b">Suggestions versus directives:</strong> + </p> + + <p class="p"> + In early Impala releases, hints were always obeyed and so acted more like directives. Once Impala gained join + order optimizations, sometimes join queries were automatically reordered in a way that made a hint + irrelevant. Therefore, the hints act more like suggestions in Impala 1.2.2 and higher. + </p> + + <p class="p"> + To force Impala to follow the hinted execution mechanism for a join query, include the + <code class="ph codeph">STRAIGHT_JOIN</code> keyword in the <code class="ph codeph">SELECT</code> statement. See + <a class="xref" href="impala_perf_joins.html#straight_join">Overriding Join Reordering with STRAIGHT_JOIN</a> for details. When you use this technique, Impala does not + reorder the joined tables at all, so you must be careful to arrange the join order to put the largest table + (or subquery result set) first, then the smallest, second smallest, third smallest, and so on. This ordering lets Impala do the + most I/O-intensive parts of the query using local reads on the DataNodes, and then reduce the size of the + intermediate result set as much as possible as each subsequent table or subquery result set is joined. + </p> + + <p class="p"> + <strong class="ph b">Restrictions:</strong> + </p> + + <p class="p"> + Queries that include subqueries in the <code class="ph codeph">WHERE</code> clause can be rewritten internally as join + queries. Currently, you cannot apply hints to the joins produced by these types of queries. + </p> + + <p class="p"> + Because hints can prevent queries from taking advantage of new metadata or improvements in query planning, + use them only when required to work around performance issues, and be prepared to remove them when they are + no longer required, such as after a new Impala release or bug fix. + </p> + + <p class="p"> + In particular, the <code class="ph codeph">[BROADCAST]</code> and <code class="ph codeph">[SHUFFLE]</code> hints are expected to be + needed much less frequently in Impala 1.2.2 and higher, because the join order optimization feature in + combination with the <code class="ph codeph">COMPUTE STATS</code> statement now automatically choose join order and join + mechanism without the need to rewrite the query and add hints. See + <a class="xref" href="impala_perf_joins.html#perf_joins">Performance Considerations for Join Queries</a> for details. + </p> + + <p class="p"> + <strong class="ph b">Compatibility:</strong> + </p> + + <p class="p"> + The hints embedded within <code class="ph codeph">--</code> comments are compatible with Hive queries. The hints embedded + within <code class="ph codeph">/* */</code> comments or <code class="ph codeph">[ ]</code> square brackets are not recognized by or not + compatible with Hive. For example, Hive raises an error for Impala hints within <code class="ph codeph">/* */</code> + comments because it does not recognize the Impala hint names. + </p> + + <p class="p"> + <strong class="ph b">Considerations for views:</strong> + </p> + + <p class="p"> + If you use a hint in the query that defines a view, the hint is preserved when you query the view. Impala + internally rewrites all hints in views to use the <code class="ph codeph">--</code> comment notation, so that Hive can + query such views without errors due to unrecognized hint names. + </p> + + <p class="p"> + <strong class="ph b">Examples:</strong> + </p> + + <p class="p"> + For example, this query joins a large customer table with a small lookup table of less than 100 rows. The + right-hand table can be broadcast efficiently to all nodes involved in the join. Thus, you would use the + <code class="ph codeph">[broadcast]</code> hint to force a broadcast join strategy: + </p> + +<pre class="pre codeblock"><code>select straight_join customer.address, state_lookup.state_name + from customer join <strong class="ph b">[broadcast]</strong> state_lookup + on customer.state_id = state_lookup.state_id;</code></pre> + + <p class="p"> + This query joins two large tables of unpredictable size. You might benchmark the query with both kinds of + hints and find that it is more efficient to transmit portions of each table to other nodes for processing. + Thus, you would use the <code class="ph codeph">[shuffle]</code> hint to force a partitioned join strategy: + </p> + +<pre class="pre codeblock"><code>select straight_join weather.wind_velocity, geospatial.altitude + from weather join <strong class="ph b">[shuffle]</strong> geospatial + on weather.lat = geospatial.lat and weather.long = geospatial.long;</code></pre> + + <p class="p"> + For joins involving three or more tables, the hint applies to the tables on either side of that specific + <code class="ph codeph">JOIN</code> keyword. The <code class="ph codeph">STRAIGHT_JOIN</code> keyword ensures that joins are processed + in a predictable order from left to right. For example, this query joins + <code class="ph codeph">t1</code> and <code class="ph codeph">t2</code> using a partitioned join, then joins that result set to + <code class="ph codeph">t3</code> using a broadcast join: + </p> + +<pre class="pre codeblock"><code>select straight_join t1.name, t2.id, t3.price + from t1 join <strong class="ph b">[shuffle]</strong> t2 join <strong class="ph b">[broadcast]</strong> t3 + on t1.id = t2.id and t2.id = t3.id;</code></pre> + + + + <p class="p"> + <strong class="ph b">Related information:</strong> + </p> + + <p class="p"> + For more background information about join queries, see <a class="xref" href="impala_joins.html#joins">Joins in Impala SELECT Statements</a>. For + performance considerations, see <a class="xref" href="impala_perf_joins.html#perf_joins">Performance Considerations for Join Queries</a>. + </p> + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_select.html">SELECT Statement</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_identifiers.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_identifiers.html b/docs/build/html/topics/impala_identifiers.html new file mode 100644 index 0000000..3a46f8e --- /dev/null +++ b/docs/build/html/topics/impala_identifiers.html @@ -0,0 +1,110 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_schema_objects.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="identifiers"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Overview of Impala Identifiers</title></head><body id="identifiers"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Overview of Impala Identifiers</h1> + + + + <div class="body conbody"> + + <p class="p"> + Identifiers are the names of databases, tables, or columns that you specify in a SQL statement. The rules for + identifiers govern what names you can give to things you create, the notation for referring to names + containing unusual characters, and other aspects such as case sensitivity. + </p> + + <ul class="ul"> + <li class="li"> + <p class="p"> + The minimum length of an identifier is 1 character. + </p> + </li> + + <li class="li"> + <p class="p"> + The maximum length of an identifier is currently 128 characters, enforced by the metastore database. + </p> + </li> + + <li class="li"> + <p class="p"> + An identifier must start with an alphabetic character. The remainder can contain any combination of + alphanumeric characters and underscores. Quoting the identifier with backticks has no effect on the allowed + characters in the name. + </p> + </li> + + <li class="li"> + <p class="p"> + An identifier can contain only ASCII characters. + </p> + </li> + + <li class="li"> + <p class="p"> + To use an identifier name that matches one of the Impala reserved keywords (listed in + <a class="xref" href="impala_reserved_words.html#reserved_words">Impala Reserved Words</a>), surround the identifier with <code class="ph codeph">``</code> + characters (backticks). Quote the reserved word even if it is part of a fully qualified name. + The following example shows how a reserved word can be used as a column name if it is quoted + with backticks in the <code class="ph codeph">CREATE TABLE</code> statement, and how the column name + must also be quoted with backticks in a query: + </p> +<pre class="pre codeblock"><code>[localhost:21000] > create table reserved (`data` string); + +[localhost:21000] > select data from reserved; +ERROR: AnalysisException: Syntax error in line 1: +select data from reserved + ^ +Encountered: DATA +Expected: ALL, CASE, CAST, DISTINCT, EXISTS, FALSE, IF, INTERVAL, NOT, NULL, STRAIGHT_JOIN, TRUE, IDENTIFIER +CAUSED BY: Exception: Syntax error + +[localhost:21000] > select reserved.data from reserved; +ERROR: AnalysisException: Syntax error in line 1: +select reserved.data from reserved + ^ +Encountered: DATA +Expected: IDENTIFIER +CAUSED BY: Exception: Syntax error + +[localhost:21000] > select reserved.`data` from reserved; + +[localhost:21000] > +</code></pre> + + <div class="note important note_important"><span class="note__title importanttitle">Important:</span> + Because the list of reserved words grows over time as new SQL syntax is added, + consider adopting coding conventions (especially for any automated scripts + or in packaged applications) to always quote all identifiers with backticks. + Quoting all identifiers protects your SQL from compatibility issues if + new reserved words are added in later releases. + </div> + + </li> + + <li class="li"> + <p class="p"> + Impala identifiers are always case-insensitive. That is, tables named <code class="ph codeph">t1</code> and + <code class="ph codeph">T1</code> always refer to the same table, regardless of quote characters. Internally, Impala + always folds all specified table and column names to lowercase. This is why the column headers in query + output are always displayed in lowercase. + </p> + </li> + </ul> + + <p class="p"> + See <a class="xref" href="impala_aliases.html#aliases">Overview of Impala Aliases</a> for how to define shorter or easier-to-remember aliases if the + original names are long or cryptic identifiers. + <span class="ph"> Aliases follow the same rules as identifiers when it comes to case + insensitivity. Aliases can be longer than identifiers (up to the maximum length of a Java string) and can + include additional characters such as spaces and dashes when they are quoted using backtick characters. + </span> + </p> + + <p class="p"> + Another way to define different names for the same tables or columns is to create views. See + <a class="xref" href="../shared/../topics/impala_views.html#views">Overview of Impala Views</a> for details. + </p> + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_schema_objects.html">Impala Schema Objects and Object Names</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_impala_shell.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_impala_shell.html b/docs/build/html/topics/impala_impala_shell.html new file mode 100644 index 0000000..020a588 --- /dev/null +++ b/docs/build/html/topics/impala_impala_shell.html @@ -0,0 +1,87 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_shell_options.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_connecting.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_shell_running_commands.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_shell_commands.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="impala_shell"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Using the Impala Shell (impala-shell Command)</title></head><body id= "impala_shell"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Using the Impala Shell (impala-shell Command)</h1> + + + + <div class="body conbody"> + + <p class="p"> + + You can use the Impala shell tool (<code class="ph codeph">impala-shell</code>) to set up databases and tables, insert + data, and issue queries. For ad hoc queries and exploration, you can submit SQL statements in an interactive + session. To automate your work, you can specify command-line options to process a single statement or a + script file. The <span class="keyword cmdname">impala-shell</span> interpreter accepts all the same SQL statements listed in + <a class="xref" href="impala_langref_sql.html#langref_sql">Impala SQL Statements</a>, plus some shell-only commands that you can use for tuning + performance and diagnosing problems. + </p> + + <p class="p"> + The <code class="ph codeph">impala-shell</code> command fits into the familiar Unix toolchain: + </p> + + <ul class="ul"> + <li class="li"> + The <code class="ph codeph">-q</code> option lets you issue a single query from the command line, without starting the + interactive interpreter. You could use this option to run <code class="ph codeph">impala-shell</code> from inside a shell + script or with the command invocation syntax from a Python, Perl, or other kind of script. + </li> + + <li class="li"> + The <code class="ph codeph">-f</code> option lets you process a file containing multiple SQL statements, + such as a set of reports or DDL statements to create a group of tables and views. + </li> + + <li class="li"> + The <code class="ph codeph">--var</code> option lets you pass substitution variables to the statements that + are executed by that <span class="keyword cmdname">impala-shell</span> session, for example the statements + in a script file processed by the <code class="ph codeph">-f</code> option. You encode the substitution variable + on the command line using the notation + <code class="ph codeph">--var=<var class="keyword varname">variable_name</var>=<var class="keyword varname">value</var></code>. + Within a SQL statement, you substitute the value by using the notation <code class="ph codeph">${var:<var class="keyword varname">variable_name</var>}</code>. + This feature is available in <span class="keyword">Impala 2.5</span> and higher. + </li> + + <li class="li"> + The <code class="ph codeph">-o</code> option lets you save query output to a file. + </li> + + <li class="li"> + The <code class="ph codeph">-B</code> option turns off pretty-printing, so that you can produce comma-separated, + tab-separated, or other delimited text files as output. (Use the <code class="ph codeph">--output_delimiter</code> option + to choose the delimiter character; the default is the tab character.) + </li> + + <li class="li"> + In non-interactive mode, query output is printed to <code class="ph codeph">stdout</code> or to the file specified by the + <code class="ph codeph">-o</code> option, while incidental output is printed to <code class="ph codeph">stderr</code>, so that you can + process just the query output as part of a Unix pipeline. + </li> + + <li class="li"> + In interactive mode, <code class="ph codeph">impala-shell</code> uses the <code class="ph codeph">readline</code> facility to recall + and edit previous commands. + </li> + </ul> + + <p class="p"> + For information on installing the Impala shell, see <a class="xref" href="impala_install.html#install">Installing Impala</a>. + </p> + + <p class="p"> + For information about establishing a connection to a DataNode running the <code class="ph codeph">impalad</code> daemon + through the <code class="ph codeph">impala-shell</code> command, see <a class="xref" href="impala_connecting.html#connecting">Connecting to impalad through impala-shell</a>. + </p> + + <p class="p"> + For a list of the <code class="ph codeph">impala-shell</code> command-line options, see + <a class="xref" href="impala_shell_options.html#shell_options">impala-shell Configuration Options</a>. For reference information about the + <code class="ph codeph">impala-shell</code> interactive commands, see + <a class="xref" href="impala_shell_commands.html#shell_commands">impala-shell Command Reference</a>. + </p> + + <p class="p toc"></p> + </div> +<nav role="navigation" class="related-links"><ul class="ullinks"><li class="link ulchildlink"><strong><a href="../topics/impala_shell_options.html">impala-shell Configuration Options</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_connecting.html">Connecting to impalad through impala-shell</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_shell_running_commands.html">Running Commands and SQL Statements in impala-shell</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_shell_commands.html">impala-shell Command Reference</a></strong><br></li></ul></nav></article></main></body></html> \ No newline at end of file
