http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_porting.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_porting.html b/docs/build/html/topics/impala_porting.html new file mode 100644 index 0000000..cc4cb29 --- /dev/null +++ b/docs/build/html/topics/impala_porting.html @@ -0,0 +1,603 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_langref.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="porting"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Porting SQL from Other Database Systems to Impala</title></head><body id="porting"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Porting SQL from Other Database Systems to Impala</h1> + + + + <div class="body conbody"> + + <p class="p"> + + Although Impala uses standard SQL for queries, you might need to modify SQL source when bringing applications + to Impala, due to variations in data types, built-in functions, vendor language extensions, and + Hadoop-specific syntax. Even when SQL is working correctly, you might make further minor modifications for + best performance. + </p> + + <p class="p toc inpage"></p> + </div> + + <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_langref.html">Impala SQL Language Reference</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="porting__porting_ddl_dml"> + + <h2 class="title topictitle2" id="ariaid-title2">Porting DDL and DML Statements</h2> + + <div class="body conbody"> + + <p class="p"> + When adapting SQL code from a traditional database system to Impala, expect to find a number of differences + in the DDL statements that you use to set up the schema. Clauses related to physical layout of files, + tablespaces, and indexes have no equivalent in Impala. You might restructure your schema considerably to + account for the Impala partitioning scheme and Hadoop file formats. + </p> + + <p class="p"> + Expect SQL queries to have a much higher degree of compatibility. With modest rewriting to address vendor + extensions and features not yet supported in Impala, you might be able to run identical or almost-identical + query text on both systems. + </p> + + <p class="p"> + Therefore, consider separating out the DDL into a separate Impala-specific setup script. Focus your reuse + and ongoing tuning efforts on the code for SQL queries. + </p> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="porting__porting_data_types"> + + <h2 class="title topictitle2" id="ariaid-title3">Porting Data Types from Other Database Systems</h2> + + <div class="body conbody"> + + <ul class="ul"> + <li class="li"> + <p class="p"> + Change any <code class="ph codeph">VARCHAR</code>, <code class="ph codeph">VARCHAR2</code>, and <code class="ph codeph">CHAR</code> columns to + <code class="ph codeph">STRING</code>. Remove any length constraints from the column declarations; for example, + change <code class="ph codeph">VARCHAR(32)</code> or <code class="ph codeph">CHAR(1)</code> to <code class="ph codeph">STRING</code>. Impala is + very flexible about the length of string values; it does not impose any length constraints + or do any special processing (such as blank-padding) for <code class="ph codeph">STRING</code> columns. + (In Impala 2.0 and higher, there are data types <code class="ph codeph">VARCHAR</code> and <code class="ph codeph">CHAR</code>, + with length constraints for both types and blank-padding for <code class="ph codeph">CHAR</code>. + However, for performance reasons, it is still preferable to use <code class="ph codeph">STRING</code> + columns where practical.) + </p> + </li> + + <li class="li"> + <p class="p"> + For national language character types such as <code class="ph codeph">NCHAR</code>, <code class="ph codeph">NVARCHAR</code>, or + <code class="ph codeph">NCLOB</code>, be aware that while Impala can store and query UTF-8 character data, currently + some string manipulation operations only work correctly with ASCII data. See + <a class="xref" href="impala_string.html#string">STRING Data Type</a> for details. + </p> + </li> + + <li class="li"> + <p class="p"> + Change any <code class="ph codeph">DATE</code>, <code class="ph codeph">DATETIME</code>, or <code class="ph codeph">TIME</code> columns to + <code class="ph codeph">TIMESTAMP</code>. Remove any precision constraints. Remove any timezone clauses, and make + sure your application logic or ETL process accounts for the fact that Impala expects all + <code class="ph codeph">TIMESTAMP</code> values to be in + <a class="xref" href="http://en.wikipedia.org/wiki/Coordinated_Universal_Time" target="_blank">Coordinated + Universal Time (UTC)</a>. See <a class="xref" href="impala_timestamp.html#timestamp">TIMESTAMP Data Type</a> for information about + the <code class="ph codeph">TIMESTAMP</code> data type, and + <a class="xref" href="impala_datetime_functions.html#datetime_functions">Impala Date and Time Functions</a> for conversion functions for different + date and time formats. + </p> + <p class="p"> + You might also need to adapt date- and time-related literal values and format strings to use the + supported Impala date and time formats. If you have date and time literals with different separators or + different numbers of <code class="ph codeph">YY</code>, <code class="ph codeph">MM</code>, and so on placeholders than Impala + expects, consider using calls to <code class="ph codeph">regexp_replace()</code> to transform those values to the + Impala-compatible format. See <a class="xref" href="impala_timestamp.html#timestamp">TIMESTAMP Data Type</a> for information about the + allowed formats for date and time literals, and + <a class="xref" href="impala_string_functions.html#string_functions">Impala String Functions</a> for string conversion functions such as + <code class="ph codeph">regexp_replace()</code>. + </p> + <p class="p"> + Instead of <code class="ph codeph">SYSDATE</code>, call the function <code class="ph codeph">NOW()</code>. + </p> + <p class="p"> + Instead of adding or subtracting directly from a date value to produce a value <var class="keyword varname">N</var> + days in the past or future, use an <code class="ph codeph">INTERVAL</code> expression, for example <code class="ph codeph">NOW() + + INTERVAL 30 DAYS</code>. + </p> + </li> + + <li class="li"> + <p class="p"> + Although Impala supports <code class="ph codeph">INTERVAL</code> expressions for datetime arithmetic, as shown in + <a class="xref" href="impala_timestamp.html#timestamp">TIMESTAMP Data Type</a>, <code class="ph codeph">INTERVAL</code> is not available as a column + data type in Impala. For any <code class="ph codeph">INTERVAL</code> values stored in tables, convert them to numeric + values that you can add or subtract using the functions in + <a class="xref" href="impala_datetime_functions.html#datetime_functions">Impala Date and Time Functions</a>. For example, if you had a table + <code class="ph codeph">DEADLINES</code> with an <code class="ph codeph">INT</code> column <code class="ph codeph">TIME_PERIOD</code>, you could + construct dates N days in the future like so: + </p> +<pre class="pre codeblock"><code>SELECT NOW() + INTERVAL time_period DAYS from deadlines;</code></pre> + </li> + + <li class="li"> + <p class="p"> + For <code class="ph codeph">YEAR</code> columns, change to the smallest Impala integer type that has sufficient + range. See <a class="xref" href="impala_datatypes.html#datatypes">Data Types</a> for details about ranges, casting, and so on + for the various numeric data types. + </p> + </li> + + <li class="li"> + <p class="p"> + Change any <code class="ph codeph">DECIMAL</code> and <code class="ph codeph">NUMBER</code> types. If fixed-point precision is not + required, you can use <code class="ph codeph">FLOAT</code> or <code class="ph codeph">DOUBLE</code> on the Impala side depending on + the range of values. For applications that require precise decimal values, such as financial data, you + might need to make more extensive changes to table structure and application logic, such as using + separate integer columns for dollars and cents, or encoding numbers as string values and writing UDFs + to manipulate them. See <a class="xref" href="impala_datatypes.html#datatypes">Data Types</a> for details about ranges, + casting, and so on for the various numeric data types. + </p> + </li> + + <li class="li"> + <p class="p"> + <code class="ph codeph">FLOAT</code>, <code class="ph codeph">DOUBLE</code>, and <code class="ph codeph">REAL</code> types are supported in + Impala. Remove any precision and scale specifications. (In Impala, <code class="ph codeph">REAL</code> is just an + alias for <code class="ph codeph">DOUBLE</code>; columns declared as <code class="ph codeph">REAL</code> are turned into + <code class="ph codeph">DOUBLE</code> behind the scenes.) See <a class="xref" href="impala_datatypes.html#datatypes">Data Types</a> for + details about ranges, casting, and so on for the various numeric data types. + </p> + </li> + + <li class="li"> + <p class="p"> + Most integer types from other systems have equivalents in Impala, perhaps under different names such as + <code class="ph codeph">BIGINT</code> instead of <code class="ph codeph">INT8</code>. For any that are unavailable, for example + <code class="ph codeph">MEDIUMINT</code>, switch to the smallest Impala integer type that has sufficient range. + Remove any precision specifications. See <a class="xref" href="impala_datatypes.html#datatypes">Data Types</a> for details + about ranges, casting, and so on for the various numeric data types. + </p> + </li> + + <li class="li"> + <p class="p"> + Remove any <code class="ph codeph">UNSIGNED</code> constraints. All Impala numeric types are signed. See + <a class="xref" href="impala_datatypes.html#datatypes">Data Types</a> for details about ranges, casting, and so on for the + various numeric data types. + </p> + </li> + + <li class="li"> + <p class="p"> + For any types holding bitwise values, use an integer type with enough range to hold all the relevant + bits within a positive integer. See <a class="xref" href="impala_datatypes.html#datatypes">Data Types</a> for details about + ranges, casting, and so on for the various numeric data types. + </p> + <p class="p"> + For example, <code class="ph codeph">TINYINT</code> has a maximum positive value of 127, not 256, so to manipulate + 8-bit bitfields as positive numbers switch to the next largest type <code class="ph codeph">SMALLINT</code>. + </p> +<pre class="pre codeblock"><code>[localhost:21000] > select cast(127*2 as tinyint); ++--------------------------+ +| cast(127 * 2 as tinyint) | ++--------------------------+ +| -2 | ++--------------------------+ +[localhost:21000] > select cast(128 as tinyint); ++----------------------+ +| cast(128 as tinyint) | ++----------------------+ +| -128 | ++----------------------+ +[localhost:21000] > select cast(127*2 as smallint); ++---------------------------+ +| cast(127 * 2 as smallint) | ++---------------------------+ +| 254 | ++---------------------------+</code></pre> + <p class="p"> + Impala does not support notation such as <code class="ph codeph">b'0101'</code> for bit literals. + </p> + </li> + + <li class="li"> + <p class="p"> + For BLOB values, use <code class="ph codeph">STRING</code> to represent <code class="ph codeph">CLOB</code> or + <code class="ph codeph">TEXT</code> types (character based large objects) up to 32 KB in size. Binary large objects + such as <code class="ph codeph">BLOB</code>, <code class="ph codeph">RAW</code> <code class="ph codeph">BINARY</code>, and + <code class="ph codeph">VARBINARY</code> do not currently have an equivalent in Impala. + </p> + </li> + + <li class="li"> + <p class="p"> + For Boolean-like types such as <code class="ph codeph">BOOL</code>, use the Impala <code class="ph codeph">BOOLEAN</code> type. + </p> + </li> + + <li class="li"> + <p class="p"> + Because Impala currently does not support composite or nested types, any spatial data types in other + database systems do not have direct equivalents in Impala. You could represent spatial values in string + format and write UDFs to process them. See <a class="xref" href="impala_udf.html#udfs">Impala User-Defined Functions (UDFs)</a> for details. Where + practical, separate spatial types into separate tables so that Impala can still work with the + non-spatial data. + </p> + </li> + + <li class="li"> + <p class="p"> + Take out any <code class="ph codeph">DEFAULT</code> clauses. Impala can use data files produced from many different + sources, such as Pig, Hive, or MapReduce jobs. The fast import mechanisms of <code class="ph codeph">LOAD DATA</code> + and external tables mean that Impala is flexible about the format of data files, and Impala does not + necessarily validate or cleanse data before querying it. When copying data through Impala + <code class="ph codeph">INSERT</code> statements, you can use conditional functions such as <code class="ph codeph">CASE</code> or + <code class="ph codeph">NVL</code> to substitute some other value for <code class="ph codeph">NULL</code> fields; see + <a class="xref" href="impala_conditional_functions.html#conditional_functions">Impala Conditional Functions</a> for details. + </p> + </li> + + <li class="li"> + <p class="p"> + Take out any constraints from your <code class="ph codeph">CREATE TABLE</code> and <code class="ph codeph">ALTER TABLE</code> + statements, for example <code class="ph codeph">PRIMARY KEY</code>, <code class="ph codeph">FOREIGN KEY</code>, + <code class="ph codeph">UNIQUE</code>, <code class="ph codeph">NOT NULL</code>, <code class="ph codeph">UNSIGNED</code>, or + <code class="ph codeph">CHECK</code> constraints. Impala can use data files produced from many different sources, + such as Pig, Hive, or MapReduce jobs. Therefore, Impala expects initial data validation to happen + earlier during the ETL or ELT cycle. After data is loaded into Impala tables, you can perform queries + to test for <code class="ph codeph">NULL</code> values. When copying data through Impala <code class="ph codeph">INSERT</code> + statements, you can use conditional functions such as <code class="ph codeph">CASE</code> or <code class="ph codeph">NVL</code> to + substitute some other value for <code class="ph codeph">NULL</code> fields; see + <a class="xref" href="impala_conditional_functions.html#conditional_functions">Impala Conditional Functions</a> for details. + </p> + <p class="p"> + Do as much verification as practical before loading data into Impala. After data is loaded into Impala, + you can do further verification using SQL queries to check if values have expected ranges, if values + are <code class="ph codeph">NULL</code> or not, and so on. If there is a problem with the data, you will need to + re-run earlier stages of the ETL process, or do an <code class="ph codeph">INSERT ... SELECT</code> statement in + Impala to copy the faulty data to a new table and transform or filter out the bad values. + </p> + </li> + + <li class="li"> + <p class="p"> + Take out any <code class="ph codeph">CREATE INDEX</code>, <code class="ph codeph">DROP INDEX</code>, and <code class="ph codeph">ALTER + INDEX</code> statements, and equivalent <code class="ph codeph">ALTER TABLE</code> statements. Remove any + <code class="ph codeph">INDEX</code>, <code class="ph codeph">KEY</code>, or <code class="ph codeph">PRIMARY KEY</code> clauses from + <code class="ph codeph">CREATE TABLE</code> and <code class="ph codeph">ALTER TABLE</code> statements. Impala is optimized for bulk + read operations for data warehouse-style queries, and therefore does not support indexes for its + tables. + </p> + </li> + + <li class="li"> + <p class="p"> + Calls to built-in functions with out-of-range or otherwise incorrect arguments, return + <code class="ph codeph">NULL</code> in Impala as opposed to raising exceptions. (This rule applies even when the + <code class="ph codeph">ABORT_ON_ERROR=true</code> query option is in effect.) Run small-scale queries using + representative data to doublecheck that calls to built-in functions are returning expected values + rather than <code class="ph codeph">NULL</code>. For example, unsupported <code class="ph codeph">CAST</code> operations do not + raise an error in Impala: + </p> +<pre class="pre codeblock"><code>select cast('foo' as int); ++--------------------+ +| cast('foo' as int) | ++--------------------+ +| NULL | ++--------------------+</code></pre> + </li> + + <li class="li"> + <p class="p"> + For any other type not supported in Impala, you could represent their values in string format and write + UDFs to process them. See <a class="xref" href="impala_udf.html#udfs">Impala User-Defined Functions (UDFs)</a> for details. + </p> + </li> + + <li class="li"> + <p class="p"> + To detect the presence of unsupported or unconvertable data types in data files, do initial testing + with the <code class="ph codeph">ABORT_ON_ERROR=true</code> query option in effect. This option causes queries to + fail immediately if they encounter disallowed type conversions. See + <a class="xref" href="impala_abort_on_error.html#abort_on_error">ABORT_ON_ERROR Query Option</a> for details. For example: + </p> +<pre class="pre codeblock"><code>set abort_on_error=true; +select count(*) from (select * from t1); +-- The above query will fail if the data files for T1 contain any +-- values that can't be converted to the expected Impala data types. +-- For example, if T1.C1 is defined as INT but the column contains +-- floating-point values like 1.1, the query will return an error.</code></pre> + </li> + </ul> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="porting__porting_statements"> + + <h2 class="title topictitle2" id="ariaid-title4">SQL Statements to Remove or Adapt</h2> + + <div class="body conbody"> + + <p class="p"> + Some SQL statements or clauses that you might be familiar with are not currently supported in Impala: + </p> + + <ul class="ul"> + <li class="li"> + <p class="p"> + Impala has no <code class="ph codeph">DELETE</code> statement. Impala is intended for data warehouse-style operations + where you do bulk moves and transforms of large quantities of data. Instead of using + <code class="ph codeph">DELETE</code>, use <code class="ph codeph">INSERT OVERWRITE</code> to entirely replace the contents of a + table or partition, or use <code class="ph codeph">INSERT ... SELECT</code> to copy a subset of data (everything but + the rows you intended to delete) from one table to another. See <a class="xref" href="impala_dml.html#dml">DML Statements</a> for + an overview of Impala DML statements. + </p> + </li> + + <li class="li"> + <p class="p"> + Impala has no <code class="ph codeph">UPDATE</code> statement. Impala is intended for data warehouse-style operations + where you do bulk moves and transforms of large quantities of data. Instead of using + <code class="ph codeph">UPDATE</code>, do all necessary transformations early in the ETL process, such as in the job + that generates the original data, or when copying from one table to another to convert to a particular + file format or partitioning scheme. See <a class="xref" href="impala_dml.html#dml">DML Statements</a> for an overview of Impala DML + statements. + </p> + </li> + + <li class="li"> + <p class="p"> + Impala has no transactional statements, such as <code class="ph codeph">COMMIT</code> or <code class="ph codeph">ROLLBACK</code>. + Impala effectively works like the <code class="ph codeph">AUTOCOMMIT</code> mode in some database systems, where + changes take effect as soon as they are made. + </p> + </li> + + <li class="li"> + <p class="p"> + If your database, table, column, or other names conflict with Impala reserved words, use different + names or quote the names with backticks. See <a class="xref" href="impala_reserved_words.html#reserved_words">Impala Reserved Words</a> + for the current list of Impala reserved words. + </p> + <p class="p"> + Conversely, if you use a keyword that Impala does not recognize, it might be interpreted as a table or + column alias. For example, in <code class="ph codeph">SELECT * FROM t1 NATURAL JOIN t2</code>, Impala does not + recognize the <code class="ph codeph">NATURAL</code> keyword and interprets it as an alias for the table + <code class="ph codeph">t1</code>. If you experience any unexpected behavior with queries, check the list of reserved + words to make sure all keywords in join and <code class="ph codeph">WHERE</code> clauses are recognized. + </p> + </li> + + <li class="li"> + <p class="p"> + Impala supports subqueries only in the <code class="ph codeph">FROM</code> clause of a query, not within the + <code class="ph codeph">WHERE</code> clauses. Therefore, you cannot use clauses such as <code class="ph codeph">WHERE + <var class="keyword varname">column</var> IN (<var class="keyword varname">subquery</var>)</code>. Also, Impala does not allow + <code class="ph codeph">EXISTS</code> or <code class="ph codeph">NOT EXISTS</code> clauses (although <code class="ph codeph">EXISTS</code> is a + reserved keyword). + </p> + </li> + + <li class="li"> + <p class="p"> + Impala supports <code class="ph codeph">UNION</code> and <code class="ph codeph">UNION ALL</code> set operators, but not + <code class="ph codeph">INTERSECT</code>. <span class="ph">Prefer <code class="ph codeph">UNION ALL</code> over <code class="ph codeph">UNION</code> when you know the + data sets are disjoint or duplicate values are not a problem; <code class="ph codeph">UNION ALL</code> is more efficient + because it avoids materializing and sorting the entire result set to eliminate duplicate values.</span> + </p> + </li> + + <li class="li"> + <p class="p"> + Within queries, Impala requires query aliases for any subqueries: + </p> +<pre class="pre codeblock"><code>-- Without the alias 'contents_of_t1' at the end, query gives syntax error. +select count(*) from (select * from t1) contents_of_t1;</code></pre> + </li> + + <li class="li"> + <p class="p"> + When an alias is declared for an expression in a query, that alias cannot be referenced again within + the same query block: + </p> +<pre class="pre codeblock"><code>-- Can't reference AVERAGE twice in the SELECT list where it's defined. +select avg(x) as average, average+1 from t1 group by x; +ERROR: AnalysisException: couldn't resolve column reference: 'average' + +-- Although it can be referenced again later in the same query. +select avg(x) as average from t1 group by x having average > 3;</code></pre> + <p class="p"> + For Impala, either repeat the expression again, or abstract the expression into a <code class="ph codeph">WITH</code> + clause, creating named columns that can be referenced multiple times anywhere in the base query: + </p> +<pre class="pre codeblock"><code>-- The following 2 query forms are equivalent. +select avg(x) as average, avg(x)+1 from t1 group by x; +with avg_t as (select avg(x) average from t1 group by x) select average, average+1 from avg_t;</code></pre> + + </li> + + <li class="li"> + <p class="p"> + Impala does not support certain rarely used join types that are less appropriate for high-volume tables + used for data warehousing. In some cases, Impala supports join types but requires explicit syntax to + ensure you do not do inefficient joins of huge tables by accident. For example, Impala does not support + natural joins or anti-joins, and requires the <code class="ph codeph">CROSS JOIN</code> operator for Cartesian + products. See <a class="xref" href="impala_joins.html#joins">Joins in Impala SELECT Statements</a> for details on the syntax for Impala join clauses. + </p> + </li> + + <li class="li"> + <p class="p"> + Impala has a limited choice of partitioning types. Partitions are defined based on each distinct + combination of values for one or more partition key columns. Impala does not redistribute or check data + to create evenly distributed partitions; you must choose partition key columns based on your knowledge + of the data volume and distribution. Adapt any tables that use range, list, hash, or key partitioning + to use the Impala partition syntax for <code class="ph codeph">CREATE TABLE</code> and <code class="ph codeph">ALTER TABLE</code> + statements. Impala partitioning is similar to range partitioning where every range has exactly one + value, or key partitioning where the hash function produces a separate bucket for every combination of + key values. See <a class="xref" href="impala_partitioning.html#partitioning">Partitioning for Impala Tables</a> for usage details, and + <a class="xref" href="impala_create_table.html#create_table">CREATE TABLE Statement</a> and + <a class="xref" href="impala_alter_table.html#alter_table">ALTER TABLE Statement</a> for syntax. + </p> + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + Because the number of separate partitions is potentially higher than in other database systems, keep a + close eye on the number of partitions and the volume of data in each one; scale back the number of + partition key columns if you end up with too many partitions with a small volume of data in each one. + Remember, to distribute work for a query across a cluster, you need at least one HDFS block per node. + HDFS blocks are typically multiple megabytes, <span class="ph">especially</span> for Parquet + files. Therefore, if each partition holds only a few megabytes of data, you are unlikely to see much + parallelism in the query because such a small amount of data is typically processed by a single node. + </div> + </li> + + <li class="li"> + <p class="p"> + For <span class="q">"top-N"</span> queries, Impala uses the <code class="ph codeph">LIMIT</code> clause rather than comparing against a + pseudocolumn named <code class="ph codeph">ROWNUM</code> or <code class="ph codeph">ROW_NUM</code>. See + <a class="xref" href="impala_limit.html#limit">LIMIT Clause</a> for details. + </p> + </li> + </ul> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title5" id="porting__porting_antipatterns"> + + <h2 class="title topictitle2" id="ariaid-title5">SQL Constructs to Doublecheck</h2> + + <div class="body conbody"> + + <p class="p"> + Some SQL constructs that are supported have behavior or defaults more oriented towards convenience than + optimal performance. Also, sometimes machine-generated SQL, perhaps issued through JDBC or ODBC + applications, might have inefficiencies or exceed internal Impala limits. As you port SQL code, be alert + and change these things where appropriate: + </p> + + <ul class="ul"> + <li class="li"> + <p class="p"> + A <code class="ph codeph">CREATE TABLE</code> statement with no <code class="ph codeph">STORED AS</code> clause creates data files + in plain text format, which is convenient for data interchange but not a good choice for high-volume + data with high-performance queries. See <a class="xref" href="impala_file_formats.html#file_formats">How Impala Works with Hadoop File Formats</a> for why and + how to use specific file formats for compact data and high-performance queries. Especially see + <a class="xref" href="impala_parquet.html#parquet">Using the Parquet File Format with Impala Tables</a>, for details about the file format most heavily optimized for + large-scale data warehouse queries. + </p> + </li> + + <li class="li"> + <p class="p"> + A <code class="ph codeph">CREATE TABLE</code> statement with no <code class="ph codeph">PARTITIONED BY</code> clause stores all the + data files in the same physical location, which can lead to scalability problems when the data volume + becomes large. + </p> + <p class="p"> + On the other hand, adapting tables that were already partitioned in a different database system could + produce an Impala table with a high number of partitions and not enough data in each one, leading to + underutilization of Impala's parallel query features. + </p> + <p class="p"> + See <a class="xref" href="impala_partitioning.html#partitioning">Partitioning for Impala Tables</a> for details about setting up partitioning and + tuning the performance of queries on partitioned tables. + </p> + </li> + + <li class="li"> + <p class="p"> + The <code class="ph codeph">INSERT ... VALUES</code> syntax is suitable for setting up toy tables with a few rows for + functional testing, but because each such statement creates a separate tiny file in HDFS, it is not a + scalable technique for loading megabytes or gigabytes (let alone petabytes) of data. Consider revising + your data load process to produce raw data files outside of Impala, then setting up Impala external + tables or using the <code class="ph codeph">LOAD DATA</code> statement to use those data files instantly in Impala + tables, with no conversion or indexing stage. See <a class="xref" href="impala_tables.html#external_tables">External Tables</a> and + <a class="xref" href="impala_load_data.html#load_data">LOAD DATA Statement</a> for details about the Impala techniques for working with + data files produced outside of Impala; see <a class="xref" href="impala_tutorial.html#tutorial_etl">Data Loading and Querying Examples</a> for examples + of ETL workflow for Impala. + </p> + </li> + + <li class="li"> + <p class="p"> + If your ETL process is not optimized for Hadoop, you might end up with highly fragmented small data + files, or a single giant data file that cannot take advantage of distributed parallel queries or + partitioning. In this case, use an <code class="ph codeph">INSERT ... SELECT</code> statement to copy the data into a + new table and reorganize into a more efficient layout in the same operation. See + <a class="xref" href="impala_insert.html#insert">INSERT Statement</a> for details about the <code class="ph codeph">INSERT</code> statement. + </p> + <p class="p"> + You can do <code class="ph codeph">INSERT ... SELECT</code> into a table with a more efficient file format (see + <a class="xref" href="impala_file_formats.html#file_formats">How Impala Works with Hadoop File Formats</a>) or from an unpartitioned table into a partitioned + one (see <a class="xref" href="impala_partitioning.html#partitioning">Partitioning for Impala Tables</a>). + </p> + </li> + + <li class="li"> + <p class="p"> + The number of expressions allowed in an Impala query might be smaller than for some other database + systems, causing failures for very complicated queries (typically produced by automated SQL + generators). Where practical, keep the number of expressions in the <code class="ph codeph">WHERE</code> clauses to + approximately 2000 or fewer. As a workaround, set the query option + <code class="ph codeph">DISABLE_CODEGEN=true</code> if queries fail for this reason. See + <a class="xref" href="impala_disable_codegen.html#disable_codegen">DISABLE_CODEGEN Query Option</a> for details. + </p> + </li> + + <li class="li"> + <p class="p"> + If practical, rewrite <code class="ph codeph">UNION</code> queries to use the <code class="ph codeph">UNION ALL</code> operator + instead. <span class="ph">Prefer <code class="ph codeph">UNION ALL</code> over <code class="ph codeph">UNION</code> when you know the + data sets are disjoint or duplicate values are not a problem; <code class="ph codeph">UNION ALL</code> is more efficient + because it avoids materializing and sorting the entire result set to eliminate duplicate values.</span> + </p> + </li> + </ul> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="porting__porting_next"> + + <h2 class="title topictitle2" id="ariaid-title6">Next Porting Steps after Verifying Syntax and Semantics</h2> + + <div class="body conbody"> + + <p class="p"> + Throughout this section, some of the decisions you make during the porting process also have a substantial + impact on performance. After your SQL code is ported and working correctly, doublecheck the + performance-related aspects of your schema design, physical layout, and queries to make sure that the + ported application is taking full advantage of Impala's parallelism, performance-related SQL features, and + integration with Hadoop components. + </p> + + <ul class="ul"> + <li class="li"> + Have you run the <code class="ph codeph">COMPUTE STATS</code> statement on each table involved in join queries? Have + you also run <code class="ph codeph">COMPUTE STATS</code> for each table used as the source table in an <code class="ph codeph">INSERT + ... SELECT</code> or <code class="ph codeph">CREATE TABLE AS SELECT</code> statement? + </li> + + <li class="li"> + Are you using the most efficient file format for your data volumes, table structure, and query + characteristics? + </li> + + <li class="li"> + Are you using partitioning effectively? That is, have you partitioned on columns that are often used for + filtering in <code class="ph codeph">WHERE</code> clauses? Have you partitioned at the right granularity so that there + is enough data in each partition to parallelize the work for each query? + </li> + + <li class="li"> + Does your ETL process produce a relatively small number of multi-megabyte data files (good) rather than a + huge number of small files (bad)? + </li> + </ul> + + <p class="p"> + See <a class="xref" href="impala_performance.html#performance">Tuning Impala for Performance</a> for details about the whole performance tuning + process. + </p> + </div> + </article> +</article></main></body></html> \ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_ports.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_ports.html b/docs/build/html/topics/impala_ports.html new file mode 100644 index 0000000..7becb40 --- /dev/null +++ b/docs/build/html/topics/impala_ports.html @@ -0,0 +1,421 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="ports"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Ports Used by Impala</title></head><body id="ports"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Ports Used by Impala</h1> + + + <div class="body conbody" id="ports__conbody_ports"> + + <p class="p"> + + Impala uses the TCP ports listed in the following table. Before deploying Impala, ensure these ports are open + on each system. + </p> + + <table class="table"><caption></caption><colgroup><col style="width:18.181818181818183%"><col style="width:27.27272727272727%"><col style="width:9.090909090909092%"><col style="width:18.181818181818183%"><col style="width:27.27272727272727%"></colgroup><thead class="thead"> + <tr class="row"> + <th class="entry nocellnorowborder" id="ports__conbody_ports__entry__1"> + Component + </th> + <th class="entry nocellnorowborder" id="ports__conbody_ports__entry__2"> + Service + </th> + <th class="entry nocellnorowborder" id="ports__conbody_ports__entry__3"> + Port + </th> + <th class="entry nocellnorowborder" id="ports__conbody_ports__entry__4"> + Access Requirement + </th> + <th class="entry nocellnorowborder" id="ports__conbody_ports__entry__5"> + Comment + </th> + </tr> + </thead><tbody class="tbody"> + <tr class="row"> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__1 "> + <p class="p"> + Impala Daemon + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__2 "> + <p class="p"> + Impala Daemon Frontend Port + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__3 "> + <p class="p"> + 21000 + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__4 "> + <p class="p"> + External + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__5 "> + <p class="p"> + Used to transmit commands and receive results by <code class="ph codeph">impala-shell</code> and + some ODBC drivers. + </p> + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__1 "> + <p class="p"> + Impala Daemon + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__2 "> + <p class="p"> + Impala Daemon Frontend Port + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__3 "> + <p class="p"> + 21050 + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__4 "> + <p class="p"> + External + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__5 "> + <p class="p"> + Used to transmit commands and receive results by applications, such as Business Intelligence tools, + using JDBC, the Beeswax query editor in Hue, and some ODBC drivers. + </p> + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__1 "> + <p class="p"> + Impala Daemon + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__2 "> + <p class="p"> + Impala Daemon Backend Port + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__3 "> + <p class="p"> + 22000 + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__4 "> + <p class="p"> + Internal + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__5 "> + <p class="p"> + Internal use only. Impala daemons use this port to communicate with each other. + </p> + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__1 "> + <p class="p"> + Impala Daemon + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__2 "> + <p class="p"> + StateStoreSubscriber Service Port + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__3 "> + <p class="p"> + 23000 + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__4 "> + <p class="p"> + Internal + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__5 "> + <p class="p"> + Internal use only. Impala daemons listen on this port for updates from the statestore daemon. + </p> + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__1 "> + <p class="p"> + Catalog Daemon + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__2 "> + <p class="p"> + StateStoreSubscriber Service Port + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__3 "> + <p class="p"> + 23020 + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__4 "> + <p class="p"> + Internal + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__5 "> + <p class="p"> + Internal use only. The catalog daemon listens on this port for updates from the statestore daemon. + </p> + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__1 "> + <p class="p"> + Impala Daemon + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__2 "> + <p class="p"> + Impala Daemon HTTP Server Port + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__3 "> + <p class="p"> + 25000 + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__4 "> + <p class="p"> + External + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__5 "> + <p class="p"> + Impala web interface for administrators to monitor and troubleshoot. + </p> + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__1 "> + <p class="p"> + Impala StateStore Daemon + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__2 "> + <p class="p"> + StateStore HTTP Server Port + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__3 "> + <p class="p"> + 25010 + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__4 "> + <p class="p"> + External + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__5 "> + <p class="p"> + StateStore web interface for administrators to monitor and troubleshoot. + </p> + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__1 "> + <p class="p"> + Impala Catalog Daemon + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__2 "> + <p class="p"> + Catalog HTTP Server Port + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__3 "> + <p class="p"> + 25020 + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__4 "> + <p class="p"> + External + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__5 "> + <p class="p"> + Catalog service web interface for administrators to monitor and troubleshoot. New in Impala 1.2 and + higher. + </p> + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__1 "> + <p class="p"> + Impala StateStore Daemon + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__2 "> + <p class="p"> + StateStore Service Port + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__3 "> + <p class="p"> + 24000 + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__4 "> + <p class="p"> + Internal + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__5 "> + <p class="p"> + Internal use only. The statestore daemon listens on this port for registration/unregistration + requests. + </p> + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__1 "> + <p class="p"> + Impala Catalog Daemon + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__2 "> + <p class="p"> + StateStore Service Port + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__3 "> + <p class="p"> + 26000 + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__4 "> + <p class="p"> + Internal + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__5 "> + <p class="p"> + Internal use only. The catalog service uses this port to communicate with the Impala daemons. New + in Impala 1.2 and higher. + </p> + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__1 "> + <p class="p"> + Impala Daemon + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__2 "> + <p class="p"> + Llama Callback Port + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__3 "> + <p class="p"> + 28000 + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__4 "> + <p class="p"> + Internal + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__5 "> + <p class="p"> + Internal use only. Impala daemons use to communicate with Llama. New in <span class="keyword">Impala 1.3</span> and higher. + </p> + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__1 "> + <p class="p"> + Impala Llama ApplicationMaster + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__2 "> + <p class="p"> + Llama Thrift Admin Port + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__3 "> + <p class="p"> + 15002 + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__4 "> + <p class="p"> + Internal + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__5 "> + <p class="p"> + Internal use only. New in <span class="keyword">Impala 1.3</span> and higher. + </p> + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__1 "> + <p class="p"> + Impala Llama ApplicationMaster + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__2 "> + <p class="p"> + Llama Thrift Port + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__3 "> + <p class="p"> + 15000 + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__4 "> + <p class="p"> + Internal + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__5 "> + <p class="p"> + Internal use only. New in <span class="keyword">Impala 1.3</span> and higher. + </p> + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__1 "> + <p class="p"> + Impala Llama ApplicationMaster + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__2 "> + <p class="p"> + Llama HTTP Port + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__3 "> + <p class="p"> + 15001 + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__4 "> + <p class="p"> + External + </p> + </td> + <td class="entry nocellnorowborder" headers="ports__conbody_ports__entry__5 "> + <p class="p"> + Llama service web interface for administrators to monitor and troubleshoot. + New in <span class="keyword">Impala 1.3</span> and higher. + </p> + </td> + </tr> + </tbody></table> + </div> +</article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_prefetch_mode.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_prefetch_mode.html b/docs/build/html/topics/impala_prefetch_mode.html new file mode 100644 index 0000000..ea13792 --- /dev/null +++ b/docs/build/html/topics/impala_prefetch_mode.html @@ -0,0 +1,47 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="prefetch_mode"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>PREFETCH_MODE Query Option (Impala 2.6 or higher only)</title></head><body id="prefetch_mode"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">PREFETCH_MODE Query Option (<span class="keyword">Impala 2.6</span> or higher only)</h1> + + + + <div class="body conbody"> + + <p class="p"> + + Determines whether the prefetching optimization is applied during + join query processing. + </p> + + <p class="p"> + <strong class="ph b">Type:</strong> numeric (0, 1) + or corresponding mnemonic strings (<code class="ph codeph">NONE</code>, <code class="ph codeph">HT_BUCKET</code>). + </p> + + <p class="p"> + <strong class="ph b">Default:</strong> 1 (equivalent to <code class="ph codeph">HT_BUCKET</code>) + </p> + + <p class="p"> + <strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.6.0</span> + </p> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + <p class="p"> + The default mode is 1, which means that hash table buckets are + prefetched during join query processing. + </p> + + <p class="p"> + <strong class="ph b">Related information:</strong> + </p> + <p class="p"> + <a class="xref" href="impala_joins.html#joins">Joins in Impala SELECT Statements</a>, + <a class="xref" href="impala_perf_joins.html#perf_joins">Performance Considerations for Join Queries</a>. + </p> + + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_prereqs.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_prereqs.html b/docs/build/html/topics/impala_prereqs.html new file mode 100644 index 0000000..e378343 --- /dev/null +++ b/docs/build/html/topics/impala_prereqs.html @@ -0,0 +1,275 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_planning.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="prereqs"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Impala Requirements</title></head><body id="prereqs"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Impala Requirements</h1> + + + + + <div class="body conbody"> + + <p class="p"> + + + To perform as expected, Impala depends on the availability of the software, hardware, and configurations + described in the following sections. + </p> + + <p class="p toc inpage"></p> + </div> + + <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_planning.html">Planning for Impala Deployment</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="prereqs__prereqs_os"> + + <h2 class="title topictitle2" id="ariaid-title2">Supported Operating Systems</h2> + + <div class="body conbody"> + + <p class="p"> + + + + + + + + + Apache Impala runs on Linux systems only. See the <span class="ph filepath">README.md</span> + file for more information. + </p> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="prereqs__prereqs_hive"> + + <h2 class="title topictitle2" id="ariaid-title3">Hive Metastore and Related Configuration</h2> + + + <div class="body conbody"> + + <p class="p"> + + + + Impala can interoperate with data stored in Hive, and uses the same infrastructure as Hive for tracking + metadata about schema objects such as tables and columns. The following components are prerequisites for + Impala: + </p> + + <ul class="ul"> + <li class="li"> + MySQL or PostgreSQL, to act as a metastore database for both Impala and Hive. + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + <p class="p"> + Installing and configuring a Hive metastore is an Impala requirement. Impala does not work without + the metastore database. For the process of installing and configuring the metastore, see + <a class="xref" href="impala_install.html#install">Installing Impala</a>. + </p> + + <p class="p"> + Always configure a <strong class="ph b">Hive metastore service</strong> rather than connecting directly to the metastore + database. The Hive metastore service is required to interoperate between different levels of + metastore APIs if this is necessary for your environment, and using it avoids known issues with + connecting directly to the metastore database. + </p> + + <p class="p"> + A summary of the metastore installation process is as follows: + </p> + <ul class="ul"> + <li class="li"> + Install a MySQL or PostgreSQL database. Start the database if it is not started after installation. + </li> + + <li class="li"> + Download the + <a class="xref" href="http://www.mysql.com/products/connector/" target="_blank">MySQL + connector</a> or the + <a class="xref" href="http://jdbc.postgresql.org/download.html" target="_blank">PostgreSQL + connector</a> and place it in the <code class="ph codeph">/usr/share/java/</code> directory. + </li> + + <li class="li"> + Use the appropriate command line tool for your database to create the metastore database. + </li> + + <li class="li"> + Use the appropriate command line tool for your database to grant privileges for the metastore + database to the <code class="ph codeph">hive</code> user. + </li> + + <li class="li"> + Modify <code class="ph codeph">hive-site.xml</code> to include information matching your particular database: its + URL, username, and password. You will copy the <code class="ph codeph">hive-site.xml</code> file to the Impala + Configuration Directory later in the Impala installation process. + </li> + </ul> + </div> + </li> + + <li class="li"> + <strong class="ph b">Optional:</strong> Hive. Although only the Hive metastore database is required for Impala to function, you + might install Hive on some client machines to create and load data into tables that use certain file + formats. See <a class="xref" href="impala_file_formats.html#file_formats">How Impala Works with Hadoop File Formats</a> for details. Hive does not need to be + installed on the same DataNodes as Impala; it just needs access to the same metastore database. + </li> + </ul> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="prereqs__prereqs_java"> + + <h2 class="title topictitle2" id="ariaid-title4">Java Dependencies</h2> + + + <div class="body conbody"> + + <p class="p"> + + + Although Impala is primarily written in C++, it does use Java to communicate with various Hadoop + components: + </p> + + <ul class="ul"> + <li class="li"> + The officially supported JVM for Impala is the Oracle JVM. Other JVMs might cause issues, typically + resulting in a failure at <span class="keyword cmdname">impalad</span> startup. In particular, the JamVM used by default on + certain levels of Ubuntu systems can cause <span class="keyword cmdname">impalad</span> to fail to start. + </li> + + <li class="li"> + Internally, the <span class="keyword cmdname">impalad</span> daemon relies on the <code class="ph codeph">JAVA_HOME</code> environment + variable to locate the system Java libraries. Make sure the <span class="keyword cmdname">impalad</span> service is not run + from an environment with an incorrect setting for this variable. + </li> + + <li class="li"> + All Java dependencies are packaged in the <code class="ph codeph">impala-dependencies.jar</code> file, which is located + at <code class="ph codeph">/usr/lib/impala/lib/</code>. These map to everything that is built under + <code class="ph codeph">fe/target/dependency</code>. + </li> + </ul> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title5" id="prereqs__prereqs_network"> + + <h2 class="title topictitle2" id="ariaid-title5">Networking Configuration Requirements</h2> + + + <div class="body conbody"> + + <p class="p"> + + As part of ensuring best performance, Impala attempts to complete tasks on local data, as opposed to using + network connections to work with remote data. To support this goal, Impala matches + the <strong class="ph b">hostname</strong> provided to each Impala daemon with the <strong class="ph b">IP address</strong> of each DataNode by + resolving the hostname flag to an IP address. For Impala to work with local data, use a single IP interface + for the DataNode and the Impala daemon on each machine. Ensure that the Impala daemon's hostname flag + resolves to the IP address of the DataNode. For single-homed machines, this is usually automatic, but for + multi-homed machines, ensure that the Impala daemon's hostname resolves to the correct interface. Impala + tries to detect the correct hostname at start-up, and prints the derived hostname at the start of the log + in a message of the form: + </p> + +<pre class="pre codeblock"><code>Using hostname: impala-daemon-1.example.com</code></pre> + + <p class="p"> + In the majority of cases, this automatic detection works correctly. If you need to explicitly set the + hostname, do so by setting the <code class="ph codeph">--hostname</code> flag. + </p> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="prereqs__prereqs_hardware"> + + <h2 class="title topictitle2" id="ariaid-title6">Hardware Requirements</h2> + + <div class="body conbody"> + + <p class="p"> + + + + + + + + + During join operations, portions of data from each joined table are loaded into memory. Data sets can be + very large, so ensure your hardware has sufficient memory to accommodate the joins you anticipate + completing. + </p> + + <p class="p"> + While requirements vary according to data set size, the following is generally recommended: + </p> + + <ul class="ul"> + <li class="li"> + CPU - Impala version 2.2 and higher uses the SSSE3 instruction set, which is included in newer processors. + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + This required level of processor is the same as in Impala version 1.x. The Impala 2.0 and 2.1 releases + had a stricter requirement for the SSE4.1 instruction set, which has now been relaxed. + </div> + + </li> + + <li class="li"> + Memory - 128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query + processing on a particular node exceed the amount of memory available to Impala on that node, the query + writes temporary work data to disk, which can lead to long query times. Note that because the work is + parallelized, and intermediate results for aggregate queries are typically smaller than the original + data, Impala can query and join tables that are much larger than the memory available on an individual + node. + </li> + + <li class="li"> + Storage - DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for disk + performance with Impala. Ensure that you have sufficient disk space to store the data Impala will be + querying. + </li> + </ul> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title7" id="prereqs__prereqs_account"> + + <h2 class="title topictitle2" id="ariaid-title7">User Account Requirements</h2> + + + <div class="body conbody"> + + <p class="p"> + + + + Impala creates and uses a user and group named <code class="ph codeph">impala</code>. Do not delete this account or group + and do not modify the account's or group's permissions and rights. Ensure no existing systems obstruct the + functioning of these accounts and groups. For example, if you have scripts that delete user accounts not in + a white-list, add these accounts to the list of permitted accounts. + </p> + + <p class="p"> + For correct file deletion during <code class="ph codeph">DROP TABLE</code> operations, Impala must be able to move files + to the HDFS trashcan. You might need to create an HDFS directory <span class="ph filepath">/user/impala</span>, + writeable by the <code class="ph codeph">impala</code> user, so that the trashcan can be created. Otherwise, data files + might remain behind after a <code class="ph codeph">DROP TABLE</code> statement. + </p> + + <p class="p"> + Impala should not run as root. Best Impala performance is achieved using direct reads, but root is not + permitted to use direct reads. Therefore, running Impala as root negatively affects performance. + </p> + + <p class="p"> + By default, any user can connect to Impala and access all the associated databases and tables. You can + enable authorization and authentication based on the Linux OS user who connects to the Impala server, and + the associated groups for that user. <a class="xref" href="impala_security.html#security">Impala Security</a> for details. These + security features do not change the underlying file permission requirements; the <code class="ph codeph">impala</code> + user still needs to be able to access the data files. + </p> + </div> + </article> +</article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_processes.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_processes.html b/docs/build/html/topics/impala_processes.html new file mode 100644 index 0000000..60bc3c4 --- /dev/null +++ b/docs/build/html/topics/impala_processes.html @@ -0,0 +1,115 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_config_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="processes"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Starting Impala</title></head><body id="processes"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Starting Impala</h1> + + + <div class="body conbody"> + + <p class="p"> + + + + To activate Impala if it is installed but not yet started: + </p> + + <ol class="ol"> + <li class="li"> + Set any necessary configuration options for the Impala services. See + <a class="xref" href="impala_config_options.html#config_options">Modifying Impala Startup Options</a> for details. + </li> + + <li class="li"> + Start one instance of the Impala statestore. The statestore helps Impala to distribute work efficiently, + and to continue running in the event of availability problems for other Impala nodes. If the statestore + becomes unavailable, Impala continues to function. + </li> + + <li class="li"> + Start one instance of the Impala catalog service. + </li> + + <li class="li"> + Start the main Impala service on one or more DataNodes, ideally on all DataNodes to maximize local + processing and avoid network traffic due to remote reads. + </li> + </ol> + + <p class="p"> + Once Impala is running, you can conduct interactive experiments using the instructions in + <a class="xref" href="impala_tutorial.html#tutorial">Impala Tutorials</a> and try <a class="xref" href="impala_impala_shell.html#impala_shell">Using the Impala Shell (impala-shell Command)</a>. + </p> + + <p class="p toc inpage"></p> + </div> + + <nav role="navigation" class="related-links"><ul class="ullinks"><li class="link ulchildlink"><strong><a href="../topics/impala_config_options.html">Modifying Impala Startup Options</a></strong><br></li></ul></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="processes__starting_via_cmdline"> + + <h2 class="title topictitle2" id="ariaid-title2">Starting Impala from the Command Line</h2> + + <div class="body conbody"> + + <p class="p"> + To start the Impala state store and Impala from the command line or a script, you can either use the + <span class="keyword cmdname">service</span> command or you can start the daemons directly through the + <span class="keyword cmdname">impalad</span>, <code class="ph codeph">statestored</code>, and <span class="keyword cmdname">catalogd</span> executables. + </p> + + <p class="p"> + Start the Impala statestore and then start <code class="ph codeph">impalad</code> instances. You can modify the values + the service initialization scripts use when starting the statestore and Impala by editing + <code class="ph codeph">/etc/default/impala</code>. + </p> + + <p class="p"> + Start the statestore service using a command similar to the following: + </p> + + <div class="p"> +<pre class="pre codeblock"><code>$ sudo service impala-state-store start</code></pre> + </div> + + <p class="p"> + Start the catalog service using a command similar to the following: + </p> + +<pre class="pre codeblock"><code>$ sudo service impala-catalog start</code></pre> + + <p class="p"> + Start the Impala service on each DataNode using a command similar to the following: + </p> + + <div class="p"> +<pre class="pre codeblock"><code>$ sudo service impala-server start</code></pre> + </div> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + <p class="p"> + In <span class="keyword">Impala 2.5</span> and higher, Impala UDFs and UDAs written in C++ are persisted in the metastore database. + Java UDFs are also persisted, if they were created with the new <code class="ph codeph">CREATE FUNCTION</code> syntax for Java UDFs, + where the Java function argument and return types are omitted. + Java-based UDFs created with the old <code class="ph codeph">CREATE FUNCTION</code> syntax do not persist across restarts + because they are held in the memory of the <span class="keyword cmdname">catalogd</span> daemon. + Until you re-create such Java UDFs using the new <code class="ph codeph">CREATE FUNCTION</code> syntax, + you must reload those Java-based UDFs by running the original <code class="ph codeph">CREATE FUNCTION</code> statements again each time + you restart the <span class="keyword cmdname">catalogd</span> daemon. + Prior to <span class="keyword">Impala 2.5</span> the requirement to reload functions after a restart applied to both C++ and Java functions. + </p> + </div> + + <div class="p"> + If any of the services fail to start, review: + <ul class="ul"> + <li class="li"> + <a class="xref" href="impala_logging.html#logs_debug">Reviewing Impala Logs</a> + </li> + + <li class="li"> + <a class="xref" href="impala_troubleshooting.html#troubleshooting">Troubleshooting Impala</a> + </li> + </ul> + </div> + </div> + </article> +</article></main></body></html> \ No newline at end of file
