http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_porting.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_porting.xml b/docs/topics/impala_porting.xml new file mode 100644 index 0000000..3800713 --- /dev/null +++ b/docs/topics/impala_porting.xml @@ -0,0 +1,623 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="porting"> + + <title>Porting SQL from Other Database Systems to Impala</title> + <titlealts audience="PDF"><navtitle>Porting SQL</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Databases"/> + <data name="Category" value="Hive"/> + <data name="Category" value="Oracle"/> + <data name="Category" value="MySQL"/> + <data name="Category" value="PostgreSQL"/> + <data name="Category" value="Troubleshooting"/> + <data name="Category" value="Porting"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">porting</indexterm> + Although Impala uses standard SQL for queries, you might need to modify SQL source when bringing applications + to Impala, due to variations in data types, built-in functions, vendor language extensions, and + Hadoop-specific syntax. Even when SQL is working correctly, you might make further minor modifications for + best performance. + </p> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="porting_ddl_dml"> + + <title>Porting DDL and DML Statements</title> + + <conbody> + + <p> + When adapting SQL code from a traditional database system to Impala, expect to find a number of differences + in the DDL statements that you use to set up the schema. Clauses related to physical layout of files, + tablespaces, and indexes have no equivalent in Impala. You might restructure your schema considerably to + account for the Impala partitioning scheme and Hadoop file formats. + </p> + + <p> + Expect SQL queries to have a much higher degree of compatibility. With modest rewriting to address vendor + extensions and features not yet supported in Impala, you might be able to run identical or almost-identical + query text on both systems. + </p> + + <p> + Therefore, consider separating out the DDL into a separate Impala-specific setup script. Focus your reuse + and ongoing tuning efforts on the code for SQL queries. + </p> + </conbody> + </concept> + + <concept id="porting_data_types"> + + <title>Porting Data Types from Other Database Systems</title> + + <conbody> + + <ul> + <li> + <p> + Change any <codeph>VARCHAR</codeph>, <codeph>VARCHAR2</codeph>, and <codeph>CHAR</codeph> columns to + <codeph>STRING</codeph>. Remove any length constraints from the column declarations; for example, + change <codeph>VARCHAR(32)</codeph> or <codeph>CHAR(1)</codeph> to <codeph>STRING</codeph>. Impala is + very flexible about the length of string values; it does not impose any length constraints + or do any special processing (such as blank-padding) for <codeph>STRING</codeph> columns. + (In Impala 2.0 and higher, there are data types <codeph>VARCHAR</codeph> and <codeph>CHAR</codeph>, + with length constraints for both types and blank-padding for <codeph>CHAR</codeph>. + However, for performance reasons, it is still preferable to use <codeph>STRING</codeph> + columns where practical.) + </p> + </li> + + <li> + <p> + For national language character types such as <codeph>NCHAR</codeph>, <codeph>NVARCHAR</codeph>, or + <codeph>NCLOB</codeph>, be aware that while Impala can store and query UTF-8 character data, currently + some string manipulation operations only work correctly with ASCII data. See + <xref href="impala_string.xml#string"/> for details. + </p> + </li> + + <li> + <p> + Change any <codeph>DATE</codeph>, <codeph>DATETIME</codeph>, or <codeph>TIME</codeph> columns to + <codeph>TIMESTAMP</codeph>. Remove any precision constraints. Remove any timezone clauses, and make + sure your application logic or ETL process accounts for the fact that Impala expects all + <codeph>TIMESTAMP</codeph> values to be in + <xref href="http://en.wikipedia.org/wiki/Coordinated_Universal_Time" scope="external" format="html">Coordinated + Universal Time (UTC)</xref>. See <xref href="impala_timestamp.xml#timestamp"/> for information about + the <codeph>TIMESTAMP</codeph> data type, and + <xref href="impala_datetime_functions.xml#datetime_functions"/> for conversion functions for different + date and time formats. + </p> + <p> + You might also need to adapt date- and time-related literal values and format strings to use the + supported Impala date and time formats. If you have date and time literals with different separators or + different numbers of <codeph>YY</codeph>, <codeph>MM</codeph>, and so on placeholders than Impala + expects, consider using calls to <codeph>regexp_replace()</codeph> to transform those values to the + Impala-compatible format. See <xref href="impala_timestamp.xml#timestamp"/> for information about the + allowed formats for date and time literals, and + <xref href="impala_string_functions.xml#string_functions"/> for string conversion functions such as + <codeph>regexp_replace()</codeph>. + </p> + <p> + Instead of <codeph>SYSDATE</codeph>, call the function <codeph>NOW()</codeph>. + </p> + <p> + Instead of adding or subtracting directly from a date value to produce a value <varname>N</varname> + days in the past or future, use an <codeph>INTERVAL</codeph> expression, for example <codeph>NOW() + + INTERVAL 30 DAYS</codeph>. + </p> + </li> + + <li> + <p> + Although Impala supports <codeph>INTERVAL</codeph> expressions for datetime arithmetic, as shown in + <xref href="impala_timestamp.xml#timestamp"/>, <codeph>INTERVAL</codeph> is not available as a column + data type in Impala. For any <codeph>INTERVAL</codeph> values stored in tables, convert them to numeric + values that you can add or subtract using the functions in + <xref href="impala_datetime_functions.xml#datetime_functions"/>. For example, if you had a table + <codeph>DEADLINES</codeph> with an <codeph>INT</codeph> column <codeph>TIME_PERIOD</codeph>, you could + construct dates N days in the future like so: + </p> +<codeblock>SELECT NOW() + INTERVAL time_period DAYS from deadlines;</codeblock> + </li> + + <li> + <p> + For <codeph>YEAR</codeph> columns, change to the smallest Impala integer type that has sufficient + range. See <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, casting, and so on + for the various numeric data types. + </p> + </li> + + <li> + <p> + Change any <codeph>DECIMAL</codeph> and <codeph>NUMBER</codeph> types. If fixed-point precision is not + required, you can use <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph> on the Impala side depending on + the range of values. For applications that require precise decimal values, such as financial data, you + might need to make more extensive changes to table structure and application logic, such as using + separate integer columns for dollars and cents, or encoding numbers as string values and writing UDFs + to manipulate them. See <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, + casting, and so on for the various numeric data types. + </p> + </li> + + <li> + <p> + <codeph>FLOAT</codeph>, <codeph>DOUBLE</codeph>, and <codeph>REAL</codeph> types are supported in + Impala. Remove any precision and scale specifications. (In Impala, <codeph>REAL</codeph> is just an + alias for <codeph>DOUBLE</codeph>; columns declared as <codeph>REAL</codeph> are turned into + <codeph>DOUBLE</codeph> behind the scenes.) See <xref href="impala_datatypes.xml#datatypes"/> for + details about ranges, casting, and so on for the various numeric data types. + </p> + </li> + + <li> + <p> + Most integer types from other systems have equivalents in Impala, perhaps under different names such as + <codeph>BIGINT</codeph> instead of <codeph>INT8</codeph>. For any that are unavailable, for example + <codeph>MEDIUMINT</codeph>, switch to the smallest Impala integer type that has sufficient range. + Remove any precision specifications. See <xref href="impala_datatypes.xml#datatypes"/> for details + about ranges, casting, and so on for the various numeric data types. + </p> + </li> + + <li> + <p> + Remove any <codeph>UNSIGNED</codeph> constraints. All Impala numeric types are signed. See + <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, casting, and so on for the + various numeric data types. + </p> + </li> + + <li> + <p> + For any types holding bitwise values, use an integer type with enough range to hold all the relevant + bits within a positive integer. See <xref href="impala_datatypes.xml#datatypes"/> for details about + ranges, casting, and so on for the various numeric data types. + </p> + <p> + For example, <codeph>TINYINT</codeph> has a maximum positive value of 127, not 256, so to manipulate + 8-bit bitfields as positive numbers switch to the next largest type <codeph>SMALLINT</codeph>. + </p> +<codeblock>[localhost:21000] > select cast(127*2 as tinyint); ++--------------------------+ +| cast(127 * 2 as tinyint) | ++--------------------------+ +| -2 | ++--------------------------+ +[localhost:21000] > select cast(128 as tinyint); ++----------------------+ +| cast(128 as tinyint) | ++----------------------+ +| -128 | ++----------------------+ +[localhost:21000] > select cast(127*2 as smallint); ++---------------------------+ +| cast(127 * 2 as smallint) | ++---------------------------+ +| 254 | ++---------------------------+</codeblock> + <p> + Impala does not support notation such as <codeph>b'0101'</codeph> for bit literals. + </p> + </li> + + <li> + <p> + For BLOB values, use <codeph>STRING</codeph> to represent <codeph>CLOB</codeph> or + <codeph>TEXT</codeph> types (character based large objects) up to 32 KB in size. Binary large objects + such as <codeph>BLOB</codeph>, <codeph>RAW</codeph> <codeph>BINARY</codeph>, and + <codeph>VARBINARY</codeph> do not currently have an equivalent in Impala. + </p> + </li> + + <li> + <p> + For Boolean-like types such as <codeph>BOOL</codeph>, use the Impala <codeph>BOOLEAN</codeph> type. + </p> + </li> + + <li> + <p> + Because Impala currently does not support composite or nested types, any spatial data types in other + database systems do not have direct equivalents in Impala. You could represent spatial values in string + format and write UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for details. Where + practical, separate spatial types into separate tables so that Impala can still work with the + non-spatial data. + </p> + </li> + + <li> + <p> + Take out any <codeph>DEFAULT</codeph> clauses. Impala can use data files produced from many different + sources, such as Pig, Hive, or MapReduce jobs. The fast import mechanisms of <codeph>LOAD DATA</codeph> + and external tables mean that Impala is flexible about the format of data files, and Impala does not + necessarily validate or cleanse data before querying it. When copying data through Impala + <codeph>INSERT</codeph> statements, you can use conditional functions such as <codeph>CASE</codeph> or + <codeph>NVL</codeph> to substitute some other value for <codeph>NULL</codeph> fields; see + <xref href="impala_conditional_functions.xml#conditional_functions"/> for details. + </p> + </li> + + <li> + <p> + Take out any constraints from your <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> + statements, for example <codeph>PRIMARY KEY</codeph>, <codeph>FOREIGN KEY</codeph>, + <codeph>UNIQUE</codeph>, <codeph>NOT NULL</codeph>, <codeph>UNSIGNED</codeph>, or + <codeph>CHECK</codeph> constraints. Impala can use data files produced from many different sources, + such as Pig, Hive, or MapReduce jobs. Therefore, Impala expects initial data validation to happen + earlier during the ETL or ELT cycle. After data is loaded into Impala tables, you can perform queries + to test for <codeph>NULL</codeph> values. When copying data through Impala <codeph>INSERT</codeph> + statements, you can use conditional functions such as <codeph>CASE</codeph> or <codeph>NVL</codeph> to + substitute some other value for <codeph>NULL</codeph> fields; see + <xref href="impala_conditional_functions.xml#conditional_functions"/> for details. + </p> + <p> + Do as much verification as practical before loading data into Impala. After data is loaded into Impala, + you can do further verification using SQL queries to check if values have expected ranges, if values + are <codeph>NULL</codeph> or not, and so on. If there is a problem with the data, you will need to + re-run earlier stages of the ETL process, or do an <codeph>INSERT ... SELECT</codeph> statement in + Impala to copy the faulty data to a new table and transform or filter out the bad values. + </p> + </li> + + <li> + <p> + Take out any <codeph>CREATE INDEX</codeph>, <codeph>DROP INDEX</codeph>, and <codeph>ALTER + INDEX</codeph> statements, and equivalent <codeph>ALTER TABLE</codeph> statements. Remove any + <codeph>INDEX</codeph>, <codeph>KEY</codeph>, or <codeph>PRIMARY KEY</codeph> clauses from + <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements. Impala is optimized for bulk + read operations for data warehouse-style queries, and therefore does not support indexes for its + tables. + </p> + </li> + + <li> + <p> + Calls to built-in functions with out-of-range or otherwise incorrect arguments, return + <codeph>NULL</codeph> in Impala as opposed to raising exceptions. (This rule applies even when the + <codeph>ABORT_ON_ERROR=true</codeph> query option is in effect.) Run small-scale queries using + representative data to doublecheck that calls to built-in functions are returning expected values + rather than <codeph>NULL</codeph>. For example, unsupported <codeph>CAST</codeph> operations do not + raise an error in Impala: + </p> +<codeblock>select cast('foo' as int); ++--------------------+ +| cast('foo' as int) | ++--------------------+ +| NULL | ++--------------------+</codeblock> + </li> + + <li> + <p> + For any other type not supported in Impala, you could represent their values in string format and write + UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for details. + </p> + </li> + + <li> + <p> + To detect the presence of unsupported or unconvertable data types in data files, do initial testing + with the <codeph>ABORT_ON_ERROR=true</codeph> query option in effect. This option causes queries to + fail immediately if they encounter disallowed type conversions. See + <xref href="impala_abort_on_error.xml#abort_on_error"/> for details. For example: + </p> +<codeblock>set abort_on_error=true; +select count(*) from (select * from t1); +-- The above query will fail if the data files for T1 contain any +-- values that can't be converted to the expected Impala data types. +-- For example, if T1.C1 is defined as INT but the column contains +-- floating-point values like 1.1, the query will return an error.</codeblock> + </li> + </ul> + </conbody> + </concept> + + <concept id="porting_statements"> + + <title>SQL Statements to Remove or Adapt</title> + + <conbody> + + <p> + Some SQL statements or clauses that you might be familiar with are not currently supported in Impala: + </p> + + <ul> + <li> + <p> + Impala has no <codeph>DELETE</codeph> statement. Impala is intended for data warehouse-style operations + where you do bulk moves and transforms of large quantities of data. Instead of using + <codeph>DELETE</codeph>, use <codeph>INSERT OVERWRITE</codeph> to entirely replace the contents of a + table or partition, or use <codeph>INSERT ... SELECT</codeph> to copy a subset of data (everything but + the rows you intended to delete) from one table to another. See <xref href="impala_dml.xml#dml"/> for + an overview of Impala DML statements. + </p> + </li> + + <li> + <p> + Impala has no <codeph>UPDATE</codeph> statement. Impala is intended for data warehouse-style operations + where you do bulk moves and transforms of large quantities of data. Instead of using + <codeph>UPDATE</codeph>, do all necessary transformations early in the ETL process, such as in the job + that generates the original data, or when copying from one table to another to convert to a particular + file format or partitioning scheme. See <xref href="impala_dml.xml#dml"/> for an overview of Impala DML + statements. + </p> + </li> + + <li> + <p> + Impala has no transactional statements, such as <codeph>COMMIT</codeph> or <codeph>ROLLBACK</codeph>. + Impala effectively works like the <codeph>AUTOCOMMIT</codeph> mode in some database systems, where + changes take effect as soon as they are made. + </p> + </li> + + <li> + <p> + If your database, table, column, or other names conflict with Impala reserved words, use different + names or quote the names with backticks. See <xref href="impala_reserved_words.xml#reserved_words"/> + for the current list of Impala reserved words. + </p> + <p> + Conversely, if you use a keyword that Impala does not recognize, it might be interpreted as a table or + column alias. For example, in <codeph>SELECT * FROM t1 NATURAL JOIN t2</codeph>, Impala does not + recognize the <codeph>NATURAL</codeph> keyword and interprets it as an alias for the table + <codeph>t1</codeph>. If you experience any unexpected behavior with queries, check the list of reserved + words to make sure all keywords in join and <codeph>WHERE</codeph> clauses are recognized. + </p> + </li> + + <li> + <p> + Impala supports subqueries only in the <codeph>FROM</codeph> clause of a query, not within the + <codeph>WHERE</codeph> clauses. Therefore, you cannot use clauses such as <codeph>WHERE + <varname>column</varname> IN (<varname>subquery</varname>)</codeph>. Also, Impala does not allow + <codeph>EXISTS</codeph> or <codeph>NOT EXISTS</codeph> clauses (although <codeph>EXISTS</codeph> is a + reserved keyword). + </p> + </li> + + <li> + <p> + Impala supports <codeph>UNION</codeph> and <codeph>UNION ALL</codeph> set operators, but not + <codeph>INTERSECT</codeph>. <ph conref="../shared/impala_common.xml#common/union_all_vs_union"/> + </p> + </li> + + <li> + <p> + Within queries, Impala requires query aliases for any subqueries: + </p> +<codeblock>-- Without the alias 'contents_of_t1' at the end, query gives syntax error. +select count(*) from (select * from t1) contents_of_t1;</codeblock> + </li> + + <li> + <p> + When an alias is declared for an expression in a query, that alias cannot be referenced again within + the same query block: + </p> +<codeblock>-- Can't reference AVERAGE twice in the SELECT list where it's defined. +select avg(x) as average, average+1 from t1 group by x; +ERROR: AnalysisException: couldn't resolve column reference: 'average' + +-- Although it can be referenced again later in the same query. +select avg(x) as average from t1 group by x having average > 3;</codeblock> + <p> + For Impala, either repeat the expression again, or abstract the expression into a <codeph>WITH</codeph> + clause, creating named columns that can be referenced multiple times anywhere in the base query: + </p> +<codeblock>-- The following 2 query forms are equivalent. +select avg(x) as average, avg(x)+1 from t1 group by x; +with avg_t as (select avg(x) average from t1 group by x) select average, average+1 from avg_t;</codeblock> +<!-- An alternative bunch of queries to use in the example above. +[localhost:21000] > select x*x as x_squared from t1; + +[localhost:21000] > select x*x as x_squared from t1 where x_squared < 100; +ERROR: AnalysisException: couldn't resolve column reference: 'x_squared' +[localhost:21000] > select x*x as x_squared, x_squared * pi() as pi_x_squared from t1; +ERROR: AnalysisException: couldn't resolve column reference: 'x_squared' +[localhost:21000] > select x*x as x_squared from t1 group by x_squared; + +[localhost:21000] > select x*x as x_squared from t1 group by x_squared having x_squared < 100; +--> + </li> + + <li> + <p> + Impala does not support certain rarely used join types that are less appropriate for high-volume tables + used for data warehousing. In some cases, Impala supports join types but requires explicit syntax to + ensure you do not do inefficient joins of huge tables by accident. For example, Impala does not support + natural joins or anti-joins, and requires the <codeph>CROSS JOIN</codeph> operator for Cartesian + products. See <xref href="impala_joins.xml#joins"/> for details on the syntax for Impala join clauses. + </p> + </li> + + <li> + <p> + Impala has a limited choice of partitioning types. Partitions are defined based on each distinct + combination of values for one or more partition key columns. Impala does not redistribute or check data + to create evenly distributed partitions; you must choose partition key columns based on your knowledge + of the data volume and distribution. Adapt any tables that use range, list, hash, or key partitioning + to use the Impala partition syntax for <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> + statements. Impala partitioning is similar to range partitioning where every range has exactly one + value, or key partitioning where the hash function produces a separate bucket for every combination of + key values. See <xref href="impala_partitioning.xml#partitioning"/> for usage details, and + <xref href="impala_create_table.xml#create_table"/> and + <xref href="impala_alter_table.xml#alter_table"/> for syntax. + </p> + <note> + Because the number of separate partitions is potentially higher than in other database systems, keep a + close eye on the number of partitions and the volume of data in each one; scale back the number of + partition key columns if you end up with too many partitions with a small volume of data in each one. + Remember, to distribute work for a query across a cluster, you need at least one HDFS block per node. + HDFS blocks are typically multiple megabytes, <ph rev="parquet_block_size">especially</ph> for Parquet + files. Therefore, if each partition holds only a few megabytes of data, you are unlikely to see much + parallelism in the query because such a small amount of data is typically processed by a single node. + </note> + </li> + + <li> + <p> + For <q>top-N</q> queries, Impala uses the <codeph>LIMIT</codeph> clause rather than comparing against a + pseudocolumn named <codeph>ROWNUM</codeph> or <codeph>ROW_NUM</codeph>. See + <xref href="impala_limit.xml#limit"/> for details. + </p> + </li> + </ul> + </conbody> + </concept> + + <concept id="porting_antipatterns"> + + <title>SQL Constructs to Doublecheck</title> + + <conbody> + + <p> + Some SQL constructs that are supported have behavior or defaults more oriented towards convenience than + optimal performance. Also, sometimes machine-generated SQL, perhaps issued through JDBC or ODBC + applications, might have inefficiencies or exceed internal Impala limits. As you port SQL code, be alert + and change these things where appropriate: + </p> + + <ul> + <li> + <p> + A <codeph>CREATE TABLE</codeph> statement with no <codeph>STORED AS</codeph> clause creates data files + in plain text format, which is convenient for data interchange but not a good choice for high-volume + data with high-performance queries. See <xref href="impala_file_formats.xml#file_formats"/> for why and + how to use specific file formats for compact data and high-performance queries. Especially see + <xref href="impala_parquet.xml#parquet"/>, for details about the file format most heavily optimized for + large-scale data warehouse queries. + </p> + </li> + + <li> + <p> + A <codeph>CREATE TABLE</codeph> statement with no <codeph>PARTITIONED BY</codeph> clause stores all the + data files in the same physical location, which can lead to scalability problems when the data volume + becomes large. + </p> + <p> + On the other hand, adapting tables that were already partitioned in a different database system could + produce an Impala table with a high number of partitions and not enough data in each one, leading to + underutilization of Impala's parallel query features. + </p> + <p> + See <xref href="impala_partitioning.xml#partitioning"/> for details about setting up partitioning and + tuning the performance of queries on partitioned tables. + </p> + </li> + + <li> + <p> + The <codeph>INSERT ... VALUES</codeph> syntax is suitable for setting up toy tables with a few rows for + functional testing, but because each such statement creates a separate tiny file in HDFS, it is not a + scalable technique for loading megabytes or gigabytes (let alone petabytes) of data. Consider revising + your data load process to produce raw data files outside of Impala, then setting up Impala external + tables or using the <codeph>LOAD DATA</codeph> statement to use those data files instantly in Impala + tables, with no conversion or indexing stage. See <xref href="impala_tables.xml#external_tables"/> and + <xref href="impala_load_data.xml#load_data"/> for details about the Impala techniques for working with + data files produced outside of Impala; see <xref href="impala_tutorial.xml#tutorial_etl"/> for examples + of ETL workflow for Impala. + </p> + </li> + + <li> + <p> + If your ETL process is not optimized for Hadoop, you might end up with highly fragmented small data + files, or a single giant data file that cannot take advantage of distributed parallel queries or + partitioning. In this case, use an <codeph>INSERT ... SELECT</codeph> statement to copy the data into a + new table and reorganize into a more efficient layout in the same operation. See + <xref href="impala_insert.xml#insert"/> for details about the <codeph>INSERT</codeph> statement. + </p> + <p> + You can do <codeph>INSERT ... SELECT</codeph> into a table with a more efficient file format (see + <xref href="impala_file_formats.xml#file_formats"/>) or from an unpartitioned table into a partitioned + one (see <xref href="impala_partitioning.xml#partitioning"/>). + </p> + </li> + + <li> + <p> + The number of expressions allowed in an Impala query might be smaller than for some other database + systems, causing failures for very complicated queries (typically produced by automated SQL + generators). Where practical, keep the number of expressions in the <codeph>WHERE</codeph> clauses to + approximately 2000 or fewer. As a workaround, set the query option + <codeph>DISABLE_CODEGEN=true</codeph> if queries fail for this reason. See + <xref href="impala_disable_codegen.xml#disable_codegen"/> for details. + </p> + </li> + + <li> + <p> + If practical, rewrite <codeph>UNION</codeph> queries to use the <codeph>UNION ALL</codeph> operator + instead. <ph conref="../shared/impala_common.xml#common/union_all_vs_union"/> + </p> + </li> + </ul> + </conbody> + </concept> + + <concept id="porting_next"> + + <title>Next Porting Steps after Verifying Syntax and Semantics</title> + + <conbody> + + <p> + Throughout this section, some of the decisions you make during the porting process also have a substantial + impact on performance. After your SQL code is ported and working correctly, doublecheck the + performance-related aspects of your schema design, physical layout, and queries to make sure that the + ported application is taking full advantage of Impala's parallelism, performance-related SQL features, and + integration with Hadoop components. + </p> + + <ul> + <li> + Have you run the <codeph>COMPUTE STATS</codeph> statement on each table involved in join queries? Have + you also run <codeph>COMPUTE STATS</codeph> for each table used as the source table in an <codeph>INSERT + ... SELECT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph> statement? + </li> + + <li> + Are you using the most efficient file format for your data volumes, table structure, and query + characteristics? + </li> + + <li> + Are you using partitioning effectively? That is, have you partitioned on columns that are often used for + filtering in <codeph>WHERE</codeph> clauses? Have you partitioned at the right granularity so that there + is enough data in each partition to parallelize the work for each query? + </li> + + <li> + Does your ETL process produce a relatively small number of multi-megabyte data files (good) rather than a + huge number of small files (bad)? + </li> + </ul> + + <p> + See <xref href="impala_performance.xml#performance"/> for details about the whole performance tuning + process. + </p> + </conbody> + </concept> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_ports.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_ports.xml b/docs/topics/impala_ports.xml new file mode 100644 index 0000000..80f217f --- /dev/null +++ b/docs/topics/impala_ports.xml @@ -0,0 +1,440 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="ports"> + + <title>Ports Used by Impala</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Ports"/> + <data name="Category" value="Network"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody id="conbody_ports"> + + <p> + <indexterm audience="Cloudera">ports</indexterm> + Impala uses the TCP ports listed in the following table. Before deploying Impala, ensure these ports are open + on each system. + </p> + + <table> + <tgroup cols="5"> + <colspec colname="1" colwidth="20*"/> + <colspec colname="2" colwidth="30*"/> + <colspec colname="3" colwidth="10*"/> + <colspec colname="4" colwidth="20*"/> + <colspec colname="5" colwidth="30*"/> + <thead> + <row> + <entry> + Component + </entry> + <entry> + Service + </entry> + <entry> + Port + </entry> + <entry> + Access Requirement + </entry> + <entry> + Comment + </entry> + </row> + </thead> + <tbody> + <row> + <entry> + <p> + Impala Daemon + </p> + </entry> + <entry> + <p> + Impala Daemon Frontend Port + </p> + </entry> + <entry> + <p> + 21000 + </p> + </entry> + <entry> + <p> + External + </p> + </entry> + <entry> + <p> + Used to transmit commands and receive results by <codeph>impala-shell</codeph> and + version 1.2 of the Cloudera ODBC driver. + </p> + </entry> + </row> + <row> + <entry> + <p> + Impala Daemon + </p> + </entry> + <entry> + <p> + Impala Daemon Frontend Port + </p> + </entry> + <entry> + <p> + 21050 + </p> + </entry> + <entry> + <p> + External + </p> + </entry> + <entry> + <p> + Used to transmit commands and receive results by applications, such as Business Intelligence tools, + using JDBC, the Beeswax query editor in Hue, and version 2.0 or higher of the Cloudera ODBC driver. + </p> + </entry> + </row> + <row> + <entry> + <p> + Impala Daemon + </p> + </entry> + <entry> + <p> + Impala Daemon Backend Port + </p> + </entry> + <entry> + <p> + 22000 + </p> + </entry> + <entry> + <p> + Internal + </p> + </entry> + <entry> + <p> + Internal use only. Impala daemons use this port to communicate with each other. + </p> + </entry> + </row> + <row> + <entry> + <p> + Impala Daemon + </p> + </entry> + <entry> + <p> + StateStoreSubscriber Service Port + </p> + </entry> + <entry> + <p> + 23000 + </p> + </entry> + <entry> + <p> + Internal + </p> + </entry> + <entry> + <p> + Internal use only. Impala daemons listen on this port for updates from the statestore daemon. + </p> + </entry> + </row> + <row rev="2.1.0"> + <entry> + <p> + Catalog Daemon + </p> + </entry> + <entry> + <p> + StateStoreSubscriber Service Port + </p> + </entry> + <entry> + <p> + 23020 + </p> + </entry> + <entry> + <p> + Internal + </p> + </entry> + <entry> + <p> + Internal use only. The catalog daemon listens on this port for updates from the statestore daemon. + </p> + </entry> + </row> + <row> + <entry> + <p> + Impala Daemon + </p> + </entry> + <entry> + <p> + Impala Daemon HTTP Server Port + </p> + </entry> + <entry> + <p> + 25000 + </p> + </entry> + <entry> + <p> + External + </p> + </entry> + <entry> + <p> + Impala web interface for administrators to monitor and troubleshoot. + </p> + </entry> + </row> + <row> + <entry> + <p> + Impala StateStore Daemon + </p> + </entry> + <entry> + <p> + StateStore HTTP Server Port + </p> + </entry> + <entry> + <p> + 25010 + </p> + </entry> + <entry> + <p> + External + </p> + </entry> + <entry> + <p> + StateStore web interface for administrators to monitor and troubleshoot. + </p> + </entry> + </row> + <row rev="1.2"> + <entry> + <p> + Impala Catalog Daemon + </p> + </entry> + <entry> + <p> + Catalog HTTP Server Port + </p> + </entry> + <entry> + <p> + 25020 + </p> + </entry> + <entry> + <p> + External + </p> + </entry> + <entry> + <p> + Catalog service web interface for administrators to monitor and troubleshoot. New in Impala 1.2 and + higher. + </p> + </entry> + </row> + <row> + <entry> + <p> + Impala StateStore Daemon + </p> + </entry> + <entry> + <p> + StateStore Service Port + </p> + </entry> + <entry> + <p> + 24000 + </p> + </entry> + <entry> + <p> + Internal + </p> + </entry> + <entry> + <p> + Internal use only. The statestore daemon listens on this port for registration/unregistration + requests. + </p> + </entry> + </row> + <row rev="1.2"> + <entry> + <p> + Impala Catalog Daemon + </p> + </entry> + <entry> + <p> + StateStore Service Port + </p> + </entry> + <entry> + <p> + 26000 + </p> + </entry> + <entry> + <p> + Internal + </p> + </entry> + <entry> + <p> + Internal use only. The catalog service uses this port to communicate with the Impala daemons. New + in Impala 1.2 and higher. + </p> + </entry> + </row> + <row rev="1.3.0"> + <entry> + <p> + Impala Daemon + </p> + </entry> + <entry> + <p> + Llama Callback Port + </p> + </entry> + <entry> + <p> + 28000 + </p> + </entry> + <entry> + <p> + Internal + </p> + </entry> + <entry> + <p> + Internal use only. Impala daemons use to communicate with Llama. New in <ph rev="upstream">CDH 5.0.0</ph> and higher. + </p> + </entry> + </row> + <row rev="1.3.0"> + <entry> + <p> + Impala Llama ApplicationMaster + </p> + </entry> + <entry> + <p> + Llama Thrift Admin Port + </p> + </entry> + <entry> + <p> + 15002 + </p> + </entry> + <entry> + <p> + Internal + </p> + </entry> + <entry> + <p> + Internal use only. New in <ph rev="upstream">CDH 5.0.0</ph> and higher. + </p> + </entry> + </row> + <row rev="1.3.0"> + <entry> + <p> + Impala Llama ApplicationMaster + </p> + </entry> + <entry> + <p> + Llama Thrift Port + </p> + </entry> + <entry> + <p> + 15000 + </p> + </entry> + <entry> + <p> + Internal + </p> + </entry> + <entry> + <p> + Internal use only. New in <ph rev="upstream">CDH 5.0.0</ph> and higher. + </p> + </entry> + </row> + <row rev="1.3.0"> + <entry> + <p> + Impala Llama ApplicationMaster + </p> + </entry> + <entry> + <p> + Llama HTTP Port + </p> + </entry> + <entry> + <p> + 15001 + </p> + </entry> + <entry> + <p> + External + </p> + </entry> + <entry> + <p> + Llama service web interface for administrators to monitor and troubleshoot. New in CDH 5.0.0 and + higher. + </p> + </entry> + </row> + </tbody> + </tgroup> + </table> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_prefetch_mode.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_prefetch_mode.xml b/docs/topics/impala_prefetch_mode.xml new file mode 100644 index 0000000..fc85c11 --- /dev/null +++ b/docs/topics/impala_prefetch_mode.xml @@ -0,0 +1,49 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="prefetch_mode" rev="2.6.0 IMPALA-3286"> + + <title>PREFETCH_MODE Query Option (<keyword keyref="impala26"/> or higher only)</title> + <titlealts audience="PDF"><navtitle>PREFETCH_MODE</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p rev="2.6.0 IMPALA-3286"> + <indexterm audience="Cloudera">PREFETCH_MODE query option</indexterm> + Determines whether the prefetching optimization is applied during + join query processing. + </p> + + <p> + <b>Type:</b> numeric (0, 1) + or corresponding mnemonic strings (<codeph>NONE</codeph>, <codeph>HT_BUCKET</codeph>). + </p> + + <p> + <b>Default:</b> 1 (equivalent to <codeph>HT_BUCKET</codeph>) + </p> + + <p conref="../shared/impala_common.xml#common/added_in_260"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + <p> + The default mode is 1, which means that hash table buckets are + prefetched during join query processing. + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_joins.xml#joins"/>, + <xref href="impala_perf_joins.xml#perf_joins"/>. + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_prereqs.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_prereqs.xml b/docs/topics/impala_prereqs.xml new file mode 100644 index 0000000..8572738 --- /dev/null +++ b/docs/topics/impala_prereqs.xml @@ -0,0 +1,357 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="prereqs"> + + <title>Impala Requirements</title> + <titlealts audience="PDF"><navtitle>Requirements</navtitle></titlealts> + + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Requirements"/> + <data name="Category" value="Planning"/> + <data name="Category" value="Installing"/> + <data name="Category" value="Upgrading"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + <!-- Another instance of a topic pulled into the map twice, resulting in a second HTML page with a *1.html filename. --> + <data name="Category" value="Duplicate Topics"/> + <!-- Using a separate category, 'Multimap', to flag those pages that are duplicate because of multiple DITA map references. --> + <data name="Category" value="Multimap"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">prerequisites</indexterm> + <indexterm audience="Cloudera">requirements</indexterm> + To perform as expected, Impala depends on the availability of the software, hardware, and configurations + described in the following sections. + </p> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="product_compatibility_matrix"> + + <title>Product Compatibility Matrix</title> + + <conbody> + + <p> The ultimate source of truth about compatibility between various + versions of CDH, Cloudera Manager, and various CDH components is the <ph + audience="integrated"><xref + href="rn_consolidated_pcm.xml" + >Product Compatibility Matrix for CDH and Cloudera + Manager</xref></ph><ph audience="standalone">online <xref + href="http://www.cloudera.com/documentation/enterprise/latest/topics/rn_consolidated_pcm.html" + format="html" scope="external">Product Compatibility + Matrix</xref></ph>. </p> + + <p> + For Impala, see the + <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/pcm_impala.html" scope="external" format="html">Impala + compatibility matrix page</xref>. + </p> + </conbody> + </concept> + + <concept id="prereqs_os"> + + <title>Supported Operating Systems</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">software requirements</indexterm> + <indexterm audience="Cloudera">Red Hat Enterprise Linux</indexterm> + <indexterm audience="Cloudera">RHEL</indexterm> + <indexterm audience="Cloudera">CentOS</indexterm> + <indexterm audience="Cloudera">SLES</indexterm> + <indexterm audience="Cloudera">Ubuntu</indexterm> + <indexterm audience="Cloudera">SUSE</indexterm> + <indexterm audience="Cloudera">Debian</indexterm> The relevant supported operating systems + and versions for Impala are the same as for the corresponding CDH 5 platforms. For + details, see the <cite>Supported Operating Systems</cite> page for + <ph audience="integrated"><xref href="rn_consolidated_pcm.xml#cdh_cm_supported_os">CDH + 5</xref></ph><ph audience="standalone"><xref + href="http://www.cloudera.com/documentation/enterprise/latest/topics/rn_consolidated_pcm.html#cdh_cm_supported_os" + scope="external" format="html">CDH 5</xref></ph>. </p> + </conbody> + </concept> + + <concept id="prereqs_hive"> + + <title>Hive Metastore and Related Configuration</title> + <prolog> + <metadata> + <data name="Category" value="Metastore"/> + <data name="Category" value="Hive"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">Hive</indexterm> + <indexterm audience="Cloudera">MySQL</indexterm> + <indexterm audience="Cloudera">PostgreSQL</indexterm> + Impala can interoperate with data stored in Hive, and uses the same infrastructure as Hive for tracking + metadata about schema objects such as tables and columns. The following components are prerequisites for + Impala: + </p> + + <ul> + <li> + MySQL or PostgreSQL, to act as a metastore database for both Impala and Hive. + <note> + <p> + Installing and configuring a Hive metastore is an Impala requirement. Impala does not work without + the metastore database. For the process of installing and configuring the metastore, see + <xref href="impala_install.xml#install"/>. + </p> + <p> + Always configure a <b>Hive metastore service</b> rather than connecting directly to the metastore + database. The Hive metastore service is required to interoperate between possibly different levels of + metastore APIs used by CDH and Impala, and avoids known issues with connecting directly to the + metastore database. The Hive metastore service is set up for you by default if you install through + Cloudera Manager 4.5 or higher. + </p> + <p> + A summary of the metastore installation process is as follows: + </p> + <ul> + <li> + Install a MySQL or PostgreSQL database. Start the database if it is not started after installation. + </li> + + <li> + Download the + <xref href="http://www.mysql.com/products/connector/" scope="external" format="html">MySQL + connector</xref> or the + <xref href="http://jdbc.postgresql.org/download.html" scope="external" format="html">PostgreSQL + connector</xref> and place it in the <codeph>/usr/share/java/</codeph> directory. + </li> + + <li> + Use the appropriate command line tool for your database to create the metastore database. + </li> + + <li> + Use the appropriate command line tool for your database to grant privileges for the metastore + database to the <codeph>hive</codeph> user. + </li> + + <li> + Modify <codeph>hive-site.xml</codeph> to include information matching your particular database: its + URL, username, and password. You will copy the <codeph>hive-site.xml</codeph> file to the Impala + Configuration Directory later in the Impala installation process. + </li> + </ul> + </note> + </li> + + <li> + <b>Optional:</b> Hive. Although only the Hive metastore database is required for Impala to function, you + might install Hive on some client machines to create and load data into tables that use certain file + formats. See <xref href="impala_file_formats.xml#file_formats"/> for details. Hive does not need to be + installed on the same DataNodes as Impala; it just needs access to the same metastore database. + </li> + </ul> + </conbody> + </concept> + + <concept id="prereqs_java"> + + <title>Java Dependencies</title> + <prolog> + <metadata> + <data name="Category" value="Java"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">Java</indexterm> + <indexterm audience="Cloudera">impala-dependencies.jar</indexterm> + Although Impala is primarily written in C++, it does use Java to communicate with various Hadoop + components: + </p> + + <ul> + <li> + The officially supported JVM for Impala is the Oracle JVM. Other JVMs might cause issues, typically + resulting in a failure at <cmdname>impalad</cmdname> startup. In particular, the JamVM used by default on + certain levels of Ubuntu systems can cause <cmdname>impalad</cmdname> to fail to start. + <!-- To do: + Could say something here about JDK 6 vs. JDK 7 in CDH 5. Since we didn't specify the JDK version before, + don't know the impact from the user perspective so not calling it out at the moment. + --> + </li> + + <li> + Internally, the <cmdname>impalad</cmdname> daemon relies on the <codeph>JAVA_HOME</codeph> environment + variable to locate the system Java libraries. Make sure the <cmdname>impalad</cmdname> service is not run + from an environment with an incorrect setting for this variable. + </li> + + <li> + All Java dependencies are packaged in the <codeph>impala-dependencies.jar</codeph> file, which is located + at <codeph>/usr/lib/impala/lib/</codeph>. These map to everything that is built under + <codeph>fe/target/dependency</codeph>. + </li> + </ul> + </conbody> + </concept> + + <concept id="prereqs_network"> + + <title>Networking Configuration Requirements</title> + <prolog> + <metadata> + <data name="Category" value="Network"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">network configuration</indexterm> + As part of ensuring best performance, Impala attempts to complete tasks on local data, as opposed to using + network connections to work with remote data. To support this goal, Impala matches + the <b>hostname</b> provided to each Impala daemon with the <b>IP address</b> of each DataNode by + resolving the hostname flag to an IP address. For Impala to work with local data, use a single IP interface + for the DataNode and the Impala daemon on each machine. Ensure that the Impala daemon's hostname flag + resolves to the IP address of the DataNode. For single-homed machines, this is usually automatic, but for + multi-homed machines, ensure that the Impala daemon's hostname resolves to the correct interface. Impala + tries to detect the correct hostname at start-up, and prints the derived hostname at the start of the log + in a message of the form: + </p> + +<codeblock>Using hostname: impala-daemon-1.example.com</codeblock> + + <p> + In the majority of cases, this automatic detection works correctly. If you need to explicitly set the + hostname, do so by setting the <codeph>--hostname</codeph> flag. + </p> + </conbody> + </concept> + + <concept id="prereqs_hardware"> + + <title>Hardware Requirements</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">hardware requirements</indexterm> + <indexterm audience="Cloudera">capacity</indexterm> + <indexterm audience="Cloudera">RAM</indexterm> + <indexterm audience="Cloudera">memory</indexterm> + <indexterm audience="Cloudera">CPU</indexterm> + <indexterm audience="Cloudera">processor</indexterm> + <indexterm audience="Cloudera">Intel</indexterm> + <indexterm audience="Cloudera">AMD</indexterm> + During join operations, portions of data from each joined table are loaded into memory. Data sets can be + very large, so ensure your hardware has sufficient memory to accommodate the joins you anticipate + completing. + </p> + + <p> + While requirements vary according to data set size, the following is generally recommended: + </p> + + <ul> + <li rev="2.0.0"> + CPU - Impala version 2.2 and higher uses the SSSE3 instruction set, which is included in newer processors. + <note> + This required level of processor is the same as in Impala version 1.x. The Impala 2.0 and 2.1 releases + had a stricter requirement for the SSE4.1 instruction set, which has now been relaxed. + </note> +<!-- + For best performance use: + <ul> + <li> + Intel - Nehalem (released 2008) or later processors. + </li> + + <li> + AMD - Bulldozer (released 2011) or later processors. + </li> + </ul> +--> + </li> + + <li rev="1.2"> + Memory - 128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query + processing on a particular node exceed the amount of memory available to Impala on that node, the query + writes temporary work data to disk, which can lead to long query times. Note that because the work is + parallelized, and intermediate results for aggregate queries are typically smaller than the original + data, Impala can query and join tables that are much larger than the memory available on an individual + node. + </li> + + <li> + Storage - DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for disk + performance with Impala. Ensure that you have sufficient disk space to store the data Impala will be + querying. + </li> + </ul> + </conbody> + </concept> + + <concept id="prereqs_account"> + + <title>User Account Requirements</title> + <prolog> + <metadata> + <data name="Category" value="Users"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">impala user</indexterm> + <indexterm audience="Cloudera">impala group</indexterm> + <indexterm audience="Cloudera">root user</indexterm> + Impala creates and uses a user and group named <codeph>impala</codeph>. Do not delete this account or group + and do not modify the account's or group's permissions and rights. Ensure no existing systems obstruct the + functioning of these accounts and groups. For example, if you have scripts that delete user accounts not in + a white-list, add these accounts to the list of permitted accounts. + </p> + +<!-- Taking out because no longer applicable in CDH 5.5 and up. --> + <p id="impala_hdfs_group" rev="1.2" audience="Cloudera"> + For the resource management feature to work (in combination with CDH 5 and the YARN and Llama components), + the <codeph>impala</codeph> user must be a member of the <codeph>hdfs</codeph> group. This setup is + performed automatically during a new install, but not when upgrading from earlier Impala releases to Impala + 1.2. If you are upgrading a node to CDH 5 that already had Impala 1.1 or 1.0 installed, manually add the + <codeph>impala</codeph> user to the <codeph>hdfs</codeph> group. + </p> + + <p> + For correct file deletion during <codeph>DROP TABLE</codeph> operations, Impala must be able to move files + to the HDFS trashcan. You might need to create an HDFS directory <filepath>/user/impala</filepath>, + writeable by the <codeph>impala</codeph> user, so that the trashcan can be created. Otherwise, data files + might remain behind after a <codeph>DROP TABLE</codeph> statement. + </p> + + <p> + Impala should not run as root. Best Impala performance is achieved using direct reads, but root is not + permitted to use direct reads. Therefore, running Impala as root negatively affects performance. + </p> + + <p> + By default, any user can connect to Impala and access all the associated databases and tables. You can + enable authorization and authentication based on the Linux OS user who connects to the Impala server, and + the associated groups for that user. <xref href="impala_security.xml#security"/> for details. These + security features do not change the underlying file permission requirements; the <codeph>impala</codeph> + user still needs to be able to access the data files. + </p> + </conbody> + </concept> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_processes.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_processes.xml b/docs/topics/impala_processes.xml new file mode 100644 index 0000000..05f2274 --- /dev/null +++ b/docs/topics/impala_processes.xml @@ -0,0 +1,134 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="processes"> + + <title>Starting Impala</title> + <prolog> + <metadata> + <data name="Category" value="Starting and Stopping"/> + <data name="Category" value="Impala"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Operators"/> + </metadata> + </prolog> + + <conbody> + + <p rev="1.2"> + <indexterm audience="Cloudera">state store</indexterm> + <indexterm audience="Cloudera">starting services</indexterm> + <indexterm audience="Cloudera">services</indexterm> + To activate Impala if it is installed but not yet started: + </p> + + <ol> + <li> + Set any necessary configuration options for the Impala services. See + <xref href="impala_config_options.xml#config_options"/> for details. + </li> + + <li> + Start one instance of the Impala statestore. The statestore helps Impala to distribute work efficiently, + and to continue running in the event of availability problems for other Impala nodes. If the statestore + becomes unavailable, Impala continues to function. + </li> + + <li> + Start one instance of the Impala catalog service. + </li> + + <li> + Start the main Impala service on one or more DataNodes, ideally on all DataNodes to maximize local + processing and avoid network traffic due to remote reads. + </li> + </ol> + + <p> + Once Impala is running, you can conduct interactive experiments using the instructions in + <xref href="impala_tutorial.xml#tutorial"/> and try <xref href="impala_impala_shell.xml#impala_shell"/>. + </p> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="starting_via_cm"> + + <title>Starting Impala through Cloudera Manager</title> + + <conbody> + + <p> + If you installed Impala with Cloudera Manager, use Cloudera Manager to start and stop services. The + Cloudera Manager GUI is a convenient way to check that all services are running, to set configuration + options using form fields in a browser, and to spot potential issues such as low disk space before they + become serious. Cloudera Manager automatically starts all the Impala-related services as a group, in the + correct order. See + <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_start_stop_service.html" scope="external" format="html">the + Cloudera Manager Documentation</xref> for details. + </p> + + <note> + <p conref="../shared/impala_common.xml#common/udf_persistence_restriction"/> + </note> + </conbody> + </concept> + + <concept id="starting_via_cmdline"> + + <title>Starting Impala from the Command Line</title> + + <conbody> + + <p> + To start the Impala state store and Impala from the command line or a script, you can either use the + <cmdname>service</cmdname> command or you can start the daemons directly through the + <cmdname>impalad</cmdname>, <codeph>statestored</codeph>, and <cmdname>catalogd</cmdname> executables. + </p> + + <p> + Start the Impala statestore and then start <codeph>impalad</codeph> instances. You can modify the values + the service initialization scripts use when starting the statestore and Impala by editing + <codeph>/etc/default/impala</codeph>. + </p> + + <p> + Start the statestore service using a command similar to the following: + </p> + + <p> +<codeblock>$ sudo service impala-state-store start</codeblock> + </p> + + <p rev="1.2"> + Start the catalog service using a command similar to the following: + </p> + +<codeblock rev="1.2">$ sudo service impala-catalog start</codeblock> + + <p> + Start the Impala service on each DataNode using a command similar to the following: + </p> + + <p> +<codeblock>$ sudo service impala-server start</codeblock> + </p> + + <note> + <p conref="../shared/impala_common.xml#common/udf_persistence_restriction"/> + </note> + + <p> + If any of the services fail to start, review: + <ul> + <li> + <xref href="impala_logging.xml#logs_debug"/> + </li> + + <li> + <xref href="impala_troubleshooting.xml#troubleshooting"/> + </li> + </ul> + </p> + </conbody> + </concept> +</concept>
