[12/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch.

jbapple Thu, 17 Nov 2016 15:12:49 -0800

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_porting.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_porting.xml b/docs/topics/impala_porting.xml
new file mode 100644
index 0000000..3800713
--- /dev/null
+++ b/docs/topics/impala_porting.xml
@@ -0,0 +1,623 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="porting">
+
+  <title>Porting SQL from Other Database Systems to Impala</title>
+  <titlealts audience="PDF"><navtitle>Porting SQL</navtitle></titlealts>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="SQL"/>
+      <data name="Category" value="Databases"/>
+      <data name="Category" value="Hive"/>
+      <data name="Category" value="Oracle"/>
+      <data name="Category" value="MySQL"/>
+      <data name="Category" value="PostgreSQL"/>
+      <data name="Category" value="Troubleshooting"/>
+      <data name="Category" value="Porting"/>
+      <data name="Category" value="Data Analysts"/>
+      <data name="Category" value="Developers"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">porting</indexterm>
+      Although Impala uses standard SQL for queries, you might need to modify 
SQL source when bringing applications
+      to Impala, due to variations in data types, built-in functions, vendor 
language extensions, and
+      Hadoop-specific syntax. Even when SQL is working correctly, you might 
make further minor modifications for
+      best performance.
+    </p>
+
+    <p outputclass="toc inpage"/>
+  </conbody>
+
+  <concept id="porting_ddl_dml">
+
+    <title>Porting DDL and DML Statements</title>
+
+    <conbody>
+
+      <p>
+        When adapting SQL code from a traditional database system to Impala, 
expect to find a number of differences
+        in the DDL statements that you use to set up the schema. Clauses 
related to physical layout of files,
+        tablespaces, and indexes have no equivalent in Impala. You might 
restructure your schema considerably to
+        account for the Impala partitioning scheme and Hadoop file formats.
+      </p>
+
+      <p>
+        Expect SQL queries to have a much higher degree of compatibility. With 
modest rewriting to address vendor
+        extensions and features not yet supported in Impala, you might be able 
to run identical or almost-identical
+        query text on both systems.
+      </p>
+
+      <p>
+        Therefore, consider separating out the DDL into a separate 
Impala-specific setup script. Focus your reuse
+        and ongoing tuning efforts on the code for SQL queries.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="porting_data_types">
+
+    <title>Porting Data Types from Other Database Systems</title>
+
+    <conbody>
+
+      <ul>
+        <li>
+          <p>
+            Change any <codeph>VARCHAR</codeph>, <codeph>VARCHAR2</codeph>, 
and <codeph>CHAR</codeph> columns to
+            <codeph>STRING</codeph>. Remove any length constraints from the 
column declarations; for example,
+            change <codeph>VARCHAR(32)</codeph> or <codeph>CHAR(1)</codeph> to 
<codeph>STRING</codeph>. Impala is
+            very flexible about the length of string values; it does not 
impose any length constraints
+            or do any special processing (such as blank-padding) for 
<codeph>STRING</codeph> columns.
+            (In Impala 2.0 and higher, there are data types 
<codeph>VARCHAR</codeph> and <codeph>CHAR</codeph>,
+            with length constraints for both types and blank-padding for 
<codeph>CHAR</codeph>.
+            However, for performance reasons, it is still preferable to use 
<codeph>STRING</codeph>
+            columns where practical.)
+          </p>
+        </li>
+
+        <li>
+          <p>
+            For national language character types such as 
<codeph>NCHAR</codeph>, <codeph>NVARCHAR</codeph>, or
+            <codeph>NCLOB</codeph>, be aware that while Impala can store and 
query UTF-8 character data, currently
+            some string manipulation operations only work correctly with ASCII 
data. See
+            <xref href="impala_string.xml#string"/> for details.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Change any <codeph>DATE</codeph>, <codeph>DATETIME</codeph>, or 
<codeph>TIME</codeph> columns to
+            <codeph>TIMESTAMP</codeph>. Remove any precision constraints. 
Remove any timezone clauses, and make
+            sure your application logic or ETL process accounts for the fact 
that Impala expects all
+            <codeph>TIMESTAMP</codeph> values to be in
+            <xref 
href="http://en.wikipedia.org/wiki/Coordinated_Universal_Time"; scope="external" 
format="html">Coordinated
+            Universal Time (UTC)</xref>. See <xref 
href="impala_timestamp.xml#timestamp"/> for information about
+            the <codeph>TIMESTAMP</codeph> data type, and
+            <xref href="impala_datetime_functions.xml#datetime_functions"/> 
for conversion functions for different
+            date and time formats.
+          </p>
+          <p>
+            You might also need to adapt date- and time-related literal values 
and format strings to use the
+            supported Impala date and time formats. If you have date and time 
literals with different separators or
+            different numbers of <codeph>YY</codeph>, <codeph>MM</codeph>, and 
so on placeholders than Impala
+            expects, consider using calls to <codeph>regexp_replace()</codeph> 
to transform those values to the
+            Impala-compatible format. See <xref 
href="impala_timestamp.xml#timestamp"/> for information about the
+            allowed formats for date and time literals, and
+            <xref href="impala_string_functions.xml#string_functions"/> for 
string conversion functions such as
+            <codeph>regexp_replace()</codeph>.
+          </p>
+          <p>
+            Instead of <codeph>SYSDATE</codeph>, call the function 
<codeph>NOW()</codeph>.
+          </p>
+          <p>
+            Instead of adding or subtracting directly from a date value to 
produce a value <varname>N</varname>
+            days in the past or future, use an <codeph>INTERVAL</codeph> 
expression, for example <codeph>NOW() +
+            INTERVAL 30 DAYS</codeph>.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Although Impala supports <codeph>INTERVAL</codeph> expressions for 
datetime arithmetic, as shown in
+            <xref href="impala_timestamp.xml#timestamp"/>, 
<codeph>INTERVAL</codeph> is not available as a column
+            data type in Impala. For any <codeph>INTERVAL</codeph> values 
stored in tables, convert them to numeric
+            values that you can add or subtract using the functions in
+            <xref href="impala_datetime_functions.xml#datetime_functions"/>. 
For example, if you had a table
+            <codeph>DEADLINES</codeph> with an <codeph>INT</codeph> column 
<codeph>TIME_PERIOD</codeph>, you could
+            construct dates N days in the future like so:
+          </p>
+<codeblock>SELECT NOW() + INTERVAL time_period DAYS from deadlines;</codeblock>
+        </li>
+
+        <li>
+          <p>
+            For <codeph>YEAR</codeph> columns, change to the smallest Impala 
integer type that has sufficient
+            range. See <xref href="impala_datatypes.xml#datatypes"/> for 
details about ranges, casting, and so on
+            for the various numeric data types.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Change any <codeph>DECIMAL</codeph> and <codeph>NUMBER</codeph> 
types. If fixed-point precision is not
+            required, you can use <codeph>FLOAT</codeph> or 
<codeph>DOUBLE</codeph> on the Impala side depending on
+            the range of values. For applications that require precise decimal 
values, such as financial data, you
+            might need to make more extensive changes to table structure and 
application logic, such as using
+            separate integer columns for dollars and cents, or encoding 
numbers as string values and writing UDFs
+            to manipulate them. See <xref 
href="impala_datatypes.xml#datatypes"/> for details about ranges,
+            casting, and so on for the various numeric data types.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            <codeph>FLOAT</codeph>, <codeph>DOUBLE</codeph>, and 
<codeph>REAL</codeph> types are supported in
+            Impala. Remove any precision and scale specifications. (In Impala, 
<codeph>REAL</codeph> is just an
+            alias for <codeph>DOUBLE</codeph>; columns declared as 
<codeph>REAL</codeph> are turned into
+            <codeph>DOUBLE</codeph> behind the scenes.) See <xref 
href="impala_datatypes.xml#datatypes"/> for
+            details about ranges, casting, and so on for the various numeric 
data types.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Most integer types from other systems have equivalents in Impala, 
perhaps under different names such as
+            <codeph>BIGINT</codeph> instead of <codeph>INT8</codeph>. For any 
that are unavailable, for example
+            <codeph>MEDIUMINT</codeph>, switch to the smallest Impala integer 
type that has sufficient range.
+            Remove any precision specifications. See <xref 
href="impala_datatypes.xml#datatypes"/> for details
+            about ranges, casting, and so on for the various numeric data 
types.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Remove any <codeph>UNSIGNED</codeph> constraints. All Impala 
numeric types are signed. See
+            <xref href="impala_datatypes.xml#datatypes"/> for details about 
ranges, casting, and so on for the
+            various numeric data types.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            For any types holding bitwise values, use an integer type with 
enough range to hold all the relevant
+            bits within a positive integer. See <xref 
href="impala_datatypes.xml#datatypes"/> for details about
+            ranges, casting, and so on for the various numeric data types.
+          </p>
+          <p>
+            For example, <codeph>TINYINT</codeph> has a maximum positive value 
of 127, not 256, so to manipulate
+            8-bit bitfields as positive numbers switch to the next largest 
type <codeph>SMALLINT</codeph>.
+          </p>
+<codeblock>[localhost:21000] &gt; select cast(127*2 as tinyint);
++--------------------------+
+| cast(127 * 2 as tinyint) |
++--------------------------+
+| -2                       |
++--------------------------+
+[localhost:21000] &gt; select cast(128 as tinyint);
++----------------------+
+| cast(128 as tinyint) |
++----------------------+
+| -128                 |
++----------------------+
+[localhost:21000] &gt; select cast(127*2 as smallint);
++---------------------------+
+| cast(127 * 2 as smallint) |
++---------------------------+
+| 254                       |
++---------------------------+</codeblock>
+          <p>
+            Impala does not support notation such as <codeph>b'0101'</codeph> 
for bit literals.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            For BLOB values, use <codeph>STRING</codeph> to represent 
<codeph>CLOB</codeph> or
+            <codeph>TEXT</codeph> types (character based large objects) up to 
32 KB in size. Binary large objects
+            such as <codeph>BLOB</codeph>, <codeph>RAW</codeph> 
<codeph>BINARY</codeph>, and
+            <codeph>VARBINARY</codeph> do not currently have an equivalent in 
Impala.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            For Boolean-like types such as <codeph>BOOL</codeph>, use the 
Impala <codeph>BOOLEAN</codeph> type.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Because Impala currently does not support composite or nested 
types, any spatial data types in other
+            database systems do not have direct equivalents in Impala. You 
could represent spatial values in string
+            format and write UDFs to process them. See <xref 
href="impala_udf.xml#udfs"/> for details. Where
+            practical, separate spatial types into separate tables so that 
Impala can still work with the
+            non-spatial data.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Take out any <codeph>DEFAULT</codeph> clauses. Impala can use data 
files produced from many different
+            sources, such as Pig, Hive, or MapReduce jobs. The fast import 
mechanisms of <codeph>LOAD DATA</codeph>
+            and external tables mean that Impala is flexible about the format 
of data files, and Impala does not
+            necessarily validate or cleanse data before querying it. When 
copying data through Impala
+            <codeph>INSERT</codeph> statements, you can use conditional 
functions such as <codeph>CASE</codeph> or
+            <codeph>NVL</codeph> to substitute some other value for 
<codeph>NULL</codeph> fields; see
+            <xref 
href="impala_conditional_functions.xml#conditional_functions"/> for details.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Take out any constraints from your <codeph>CREATE TABLE</codeph> 
and <codeph>ALTER TABLE</codeph>
+            statements, for example <codeph>PRIMARY KEY</codeph>, 
<codeph>FOREIGN KEY</codeph>,
+            <codeph>UNIQUE</codeph>, <codeph>NOT NULL</codeph>, 
<codeph>UNSIGNED</codeph>, or
+            <codeph>CHECK</codeph> constraints. Impala can use data files 
produced from many different sources,
+            such as Pig, Hive, or MapReduce jobs. Therefore, Impala expects 
initial data validation to happen
+            earlier during the ETL or ELT cycle. After data is loaded into 
Impala tables, you can perform queries
+            to test for <codeph>NULL</codeph> values. When copying data 
through Impala <codeph>INSERT</codeph>
+            statements, you can use conditional functions such as 
<codeph>CASE</codeph> or <codeph>NVL</codeph> to
+            substitute some other value for <codeph>NULL</codeph> fields; see
+            <xref 
href="impala_conditional_functions.xml#conditional_functions"/> for details.
+          </p>
+          <p>
+            Do as much verification as practical before loading data into 
Impala. After data is loaded into Impala,
+            you can do further verification using SQL queries to check if 
values have expected ranges, if values
+            are <codeph>NULL</codeph> or not, and so on. If there is a problem 
with the data, you will need to
+            re-run earlier stages of the ETL process, or do an <codeph>INSERT 
... SELECT</codeph> statement in
+            Impala to copy the faulty data to a new table and transform or 
filter out the bad values.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Take out any <codeph>CREATE INDEX</codeph>, <codeph>DROP 
INDEX</codeph>, and <codeph>ALTER
+            INDEX</codeph> statements, and equivalent <codeph>ALTER 
TABLE</codeph> statements. Remove any
+            <codeph>INDEX</codeph>, <codeph>KEY</codeph>, or <codeph>PRIMARY 
KEY</codeph> clauses from
+            <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> 
statements. Impala is optimized for bulk
+            read operations for data warehouse-style queries, and therefore 
does not support indexes for its
+            tables.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Calls to built-in functions with out-of-range or otherwise 
incorrect arguments, return
+            <codeph>NULL</codeph> in Impala as opposed to raising exceptions. 
(This rule applies even when the
+            <codeph>ABORT_ON_ERROR=true</codeph> query option is in effect.) 
Run small-scale queries using
+            representative data to doublecheck that calls to built-in 
functions are returning expected values
+            rather than <codeph>NULL</codeph>. For example, unsupported 
<codeph>CAST</codeph> operations do not
+            raise an error in Impala:
+          </p>
+<codeblock>select cast('foo' as int);
++--------------------+
+| cast('foo' as int) |
++--------------------+
+| NULL               |
++--------------------+</codeblock>
+        </li>
+
+        <li>
+          <p>
+            For any other type not supported in Impala, you could represent 
their values in string format and write
+            UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for 
details.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            To detect the presence of unsupported or unconvertable data types 
in data files, do initial testing
+            with the <codeph>ABORT_ON_ERROR=true</codeph> query option in 
effect. This option causes queries to
+            fail immediately if they encounter disallowed type conversions. See
+            <xref href="impala_abort_on_error.xml#abort_on_error"/> for 
details. For example:
+          </p>
+<codeblock>set abort_on_error=true;
+select count(*) from (select * from t1);
+-- The above query will fail if the data files for T1 contain any
+-- values that can't be converted to the expected Impala data types.
+-- For example, if T1.C1 is defined as INT but the column contains
+-- floating-point values like 1.1, the query will return an error.</codeblock>
+        </li>
+      </ul>
+    </conbody>
+  </concept>
+
+  <concept id="porting_statements">
+
+    <title>SQL Statements to Remove or Adapt</title>
+
+    <conbody>
+
+      <p>
+        Some SQL statements or clauses that you might be familiar with are not 
currently supported in Impala:
+      </p>
+
+      <ul>
+        <li>
+          <p>
+            Impala has no <codeph>DELETE</codeph> statement. Impala is 
intended for data warehouse-style operations
+            where you do bulk moves and transforms of large quantities of 
data. Instead of using
+            <codeph>DELETE</codeph>, use <codeph>INSERT OVERWRITE</codeph> to 
entirely replace the contents of a
+            table or partition, or use <codeph>INSERT ... SELECT</codeph> to 
copy a subset of data (everything but
+            the rows you intended to delete) from one table to another. See 
<xref href="impala_dml.xml#dml"/> for
+            an overview of Impala DML statements.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Impala has no <codeph>UPDATE</codeph> statement. Impala is 
intended for data warehouse-style operations
+            where you do bulk moves and transforms of large quantities of 
data. Instead of using
+            <codeph>UPDATE</codeph>, do all necessary transformations early in 
the ETL process, such as in the job
+            that generates the original data, or when copying from one table 
to another to convert to a particular
+            file format or partitioning scheme. See <xref 
href="impala_dml.xml#dml"/> for an overview of Impala DML
+            statements.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Impala has no transactional statements, such as 
<codeph>COMMIT</codeph> or <codeph>ROLLBACK</codeph>.
+            Impala effectively works like the <codeph>AUTOCOMMIT</codeph> mode 
in some database systems, where
+            changes take effect as soon as they are made.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            If your database, table, column, or other names conflict with 
Impala reserved words, use different
+            names or quote the names with backticks. See <xref 
href="impala_reserved_words.xml#reserved_words"/>
+            for the current list of Impala reserved words.
+          </p>
+          <p>
+            Conversely, if you use a keyword that Impala does not recognize, 
it might be interpreted as a table or
+            column alias. For example, in <codeph>SELECT * FROM t1 NATURAL 
JOIN t2</codeph>, Impala does not
+            recognize the <codeph>NATURAL</codeph> keyword and interprets it 
as an alias for the table
+            <codeph>t1</codeph>. If you experience any unexpected behavior 
with queries, check the list of reserved
+            words to make sure all keywords in join and <codeph>WHERE</codeph> 
clauses are recognized.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Impala supports subqueries only in the <codeph>FROM</codeph> 
clause of a query, not within the
+            <codeph>WHERE</codeph> clauses. Therefore, you cannot use clauses 
such as <codeph>WHERE
+            <varname>column</varname> IN 
(<varname>subquery</varname>)</codeph>. Also, Impala does not allow
+            <codeph>EXISTS</codeph> or <codeph>NOT EXISTS</codeph> clauses 
(although <codeph>EXISTS</codeph> is a
+            reserved keyword).
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Impala supports <codeph>UNION</codeph> and <codeph>UNION 
ALL</codeph> set operators, but not
+            <codeph>INTERSECT</codeph>. <ph 
conref="../shared/impala_common.xml#common/union_all_vs_union"/>
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Within queries, Impala requires query aliases for any subqueries:
+          </p>
+<codeblock>-- Without the alias 'contents_of_t1' at the end, query gives 
syntax error.
+select count(*) from (select * from t1) contents_of_t1;</codeblock>
+        </li>
+
+        <li>
+          <p>
+            When an alias is declared for an expression in a query, that alias 
cannot be referenced again within
+            the same query block:
+          </p>
+<codeblock>-- Can't reference AVERAGE twice in the SELECT list where it's 
defined.
+select avg(x) as average, average+1 from t1 group by x;
+ERROR: AnalysisException: couldn't resolve column reference: 'average'
+
+-- Although it can be referenced again later in the same query.
+select avg(x) as average from t1 group by x having average &gt; 3;</codeblock>
+          <p>
+            For Impala, either repeat the expression again, or abstract the 
expression into a <codeph>WITH</codeph>
+            clause, creating named columns that can be referenced multiple 
times anywhere in the base query:
+          </p>
+<codeblock>-- The following 2 query forms are equivalent.
+select avg(x) as average, avg(x)+1 from t1 group by x;
+with avg_t as (select avg(x) average from t1 group by x) select average, 
average+1 from avg_t;</codeblock>
+<!-- An alternative bunch of queries to use in the example above.
+[localhost:21000] > select x*x as x_squared from t1;
+
+[localhost:21000] > select x*x as x_squared from t1 where x_squared < 100;
+ERROR: AnalysisException: couldn't resolve column reference: 'x_squared'
+[localhost:21000] > select x*x as x_squared, x_squared * pi() as pi_x_squared 
from t1;
+ERROR: AnalysisException: couldn't resolve column reference: 'x_squared'
+[localhost:21000] > select x*x as x_squared from t1 group by x_squared;
+
+[localhost:21000] > select x*x as x_squared from t1 group by x_squared having 
x_squared < 100;
+-->
+        </li>
+
+        <li>
+          <p>
+            Impala does not support certain rarely used join types that are 
less appropriate for high-volume tables
+            used for data warehousing. In some cases, Impala supports join 
types but requires explicit syntax to
+            ensure you do not do inefficient joins of huge tables by accident. 
For example, Impala does not support
+            natural joins or anti-joins, and requires the <codeph>CROSS 
JOIN</codeph> operator for Cartesian
+            products. See <xref href="impala_joins.xml#joins"/> for details on 
the syntax for Impala join clauses.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Impala has a limited choice of partitioning types. Partitions are 
defined based on each distinct
+            combination of values for one or more partition key columns. 
Impala does not redistribute or check data
+            to create evenly distributed partitions; you must choose partition 
key columns based on your knowledge
+            of the data volume and distribution. Adapt any tables that use 
range, list, hash, or key partitioning
+            to use the Impala partition syntax for <codeph>CREATE 
TABLE</codeph> and <codeph>ALTER TABLE</codeph>
+            statements. Impala partitioning is similar to range partitioning 
where every range has exactly one
+            value, or key partitioning where the hash function produces a 
separate bucket for every combination of
+            key values. See <xref 
href="impala_partitioning.xml#partitioning"/> for usage details, and
+            <xref href="impala_create_table.xml#create_table"/> and
+            <xref href="impala_alter_table.xml#alter_table"/> for syntax.
+          </p>
+          <note>
+            Because the number of separate partitions is potentially higher 
than in other database systems, keep a
+            close eye on the number of partitions and the volume of data in 
each one; scale back the number of
+            partition key columns if you end up with too many partitions with 
a small volume of data in each one.
+            Remember, to distribute work for a query across a cluster, you 
need at least one HDFS block per node.
+            HDFS blocks are typically multiple megabytes, <ph 
rev="parquet_block_size">especially</ph> for Parquet
+            files. Therefore, if each partition holds only a few megabytes of 
data, you are unlikely to see much
+            parallelism in the query because such a small amount of data is 
typically processed by a single node.
+          </note>
+        </li>
+
+        <li>
+          <p>
+            For <q>top-N</q> queries, Impala uses the <codeph>LIMIT</codeph> 
clause rather than comparing against a
+            pseudocolumn named <codeph>ROWNUM</codeph> or 
<codeph>ROW_NUM</codeph>. See
+            <xref href="impala_limit.xml#limit"/> for details.
+          </p>
+        </li>
+      </ul>
+    </conbody>
+  </concept>
+
+  <concept id="porting_antipatterns">
+
+    <title>SQL Constructs to Doublecheck</title>
+
+    <conbody>
+
+      <p>
+        Some SQL constructs that are supported have behavior or defaults more 
oriented towards convenience than
+        optimal performance. Also, sometimes machine-generated SQL, perhaps 
issued through JDBC or ODBC
+        applications, might have inefficiencies or exceed internal Impala 
limits. As you port SQL code, be alert
+        and change these things where appropriate:
+      </p>
+
+      <ul>
+        <li>
+          <p>
+            A <codeph>CREATE TABLE</codeph> statement with no <codeph>STORED 
AS</codeph> clause creates data files
+            in plain text format, which is convenient for data interchange but 
not a good choice for high-volume
+            data with high-performance queries. See <xref 
href="impala_file_formats.xml#file_formats"/> for why and
+            how to use specific file formats for compact data and 
high-performance queries. Especially see
+            <xref href="impala_parquet.xml#parquet"/>, for details about the 
file format most heavily optimized for
+            large-scale data warehouse queries.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            A <codeph>CREATE TABLE</codeph> statement with no 
<codeph>PARTITIONED BY</codeph> clause stores all the
+            data files in the same physical location, which can lead to 
scalability problems when the data volume
+            becomes large.
+          </p>
+          <p>
+            On the other hand, adapting tables that were already partitioned 
in a different database system could
+            produce an Impala table with a high number of partitions and not 
enough data in each one, leading to
+            underutilization of Impala's parallel query features.
+          </p>
+          <p>
+            See <xref href="impala_partitioning.xml#partitioning"/> for 
details about setting up partitioning and
+            tuning the performance of queries on partitioned tables.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            The <codeph>INSERT ... VALUES</codeph> syntax is suitable for 
setting up toy tables with a few rows for
+            functional testing, but because each such statement creates a 
separate tiny file in HDFS, it is not a
+            scalable technique for loading megabytes or gigabytes (let alone 
petabytes) of data. Consider revising
+            your data load process to produce raw data files outside of 
Impala, then setting up Impala external
+            tables or using the <codeph>LOAD DATA</codeph> statement to use 
those data files instantly in Impala
+            tables, with no conversion or indexing stage. See <xref 
href="impala_tables.xml#external_tables"/> and
+            <xref href="impala_load_data.xml#load_data"/> for details about 
the Impala techniques for working with
+            data files produced outside of Impala; see <xref 
href="impala_tutorial.xml#tutorial_etl"/> for examples
+            of ETL workflow for Impala.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            If your ETL process is not optimized for Hadoop, you might end up 
with highly fragmented small data
+            files, or a single giant data file that cannot take advantage of 
distributed parallel queries or
+            partitioning. In this case, use an <codeph>INSERT ... 
SELECT</codeph> statement to copy the data into a
+            new table and reorganize into a more efficient layout in the same 
operation. See
+            <xref href="impala_insert.xml#insert"/> for details about the 
<codeph>INSERT</codeph> statement.
+          </p>
+          <p>
+            You can do <codeph>INSERT ... SELECT</codeph> into a table with a 
more efficient file format (see
+            <xref href="impala_file_formats.xml#file_formats"/>) or from an 
unpartitioned table into a partitioned
+            one (see <xref href="impala_partitioning.xml#partitioning"/>).
+          </p>
+        </li>
+
+        <li>
+          <p>
+            The number of expressions allowed in an Impala query might be 
smaller than for some other database
+            systems, causing failures for very complicated queries (typically 
produced by automated SQL
+            generators). Where practical, keep the number of expressions in 
the <codeph>WHERE</codeph> clauses to
+            approximately 2000 or fewer. As a workaround, set the query option
+            <codeph>DISABLE_CODEGEN=true</codeph> if queries fail for this 
reason. See
+            <xref href="impala_disable_codegen.xml#disable_codegen"/> for 
details.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            If practical, rewrite <codeph>UNION</codeph> queries to use the 
<codeph>UNION ALL</codeph> operator
+            instead. <ph 
conref="../shared/impala_common.xml#common/union_all_vs_union"/>
+          </p>
+        </li>
+      </ul>
+    </conbody>
+  </concept>
+
+  <concept id="porting_next">
+
+    <title>Next Porting Steps after Verifying Syntax and Semantics</title>
+
+    <conbody>
+
+      <p>
+        Throughout this section, some of the decisions you make during the 
porting process also have a substantial
+        impact on performance. After your SQL code is ported and working 
correctly, doublecheck the
+        performance-related aspects of your schema design, physical layout, 
and queries to make sure that the
+        ported application is taking full advantage of Impala's parallelism, 
performance-related SQL features, and
+        integration with Hadoop components.
+      </p>
+
+      <ul>
+        <li>
+          Have you run the <codeph>COMPUTE STATS</codeph> statement on each 
table involved in join queries? Have
+          you also run <codeph>COMPUTE STATS</codeph> for each table used as 
the source table in an <codeph>INSERT
+          ... SELECT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph> 
statement?
+        </li>
+
+        <li>
+          Are you using the most efficient file format for your data volumes, 
table structure, and query
+          characteristics?
+        </li>
+
+        <li>
+          Are you using partitioning effectively? That is, have you 
partitioned on columns that are often used for
+          filtering in <codeph>WHERE</codeph> clauses? Have you partitioned at 
the right granularity so that there
+          is enough data in each partition to parallelize the work for each 
query?
+        </li>
+
+        <li>
+          Does your ETL process produce a relatively small number of 
multi-megabyte data files (good) rather than a
+          huge number of small files (bad)?
+        </li>
+      </ul>
+
+      <p>
+        See <xref href="impala_performance.xml#performance"/> for details 
about the whole performance tuning
+        process.
+      </p>
+    </conbody>
+  </concept>
+</concept>


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_ports.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_ports.xml b/docs/topics/impala_ports.xml
new file mode 100644
index 0000000..80f217f
--- /dev/null
+++ b/docs/topics/impala_ports.xml
@@ -0,0 +1,440 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="ports">
+
+  <title>Ports Used by Impala</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Ports"/>
+      <data name="Category" value="Network"/>
+      <data name="Category" value="Administrators"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
+
+  <conbody id="conbody_ports">
+
+    <p>
+      <indexterm audience="Cloudera">ports</indexterm>
+      Impala uses the TCP ports listed in the following table. Before 
deploying Impala, ensure these ports are open
+      on each system.
+    </p>
+
+    <table>
+      <tgroup cols="5">
+        <colspec colname="1" colwidth="20*"/>
+        <colspec colname="2" colwidth="30*"/>
+        <colspec colname="3" colwidth="10*"/>
+        <colspec colname="4" colwidth="20*"/>
+        <colspec colname="5" colwidth="30*"/>
+        <thead>
+          <row>
+            <entry>
+              Component
+            </entry>
+            <entry>
+              Service
+            </entry>
+            <entry>
+              Port
+            </entry>
+            <entry>
+              Access Requirement
+            </entry>
+            <entry>
+              Comment
+            </entry>
+          </row>
+        </thead>
+        <tbody>
+          <row>
+            <entry>
+              <p>
+                Impala Daemon
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Impala Daemon Frontend Port
+              </p>
+            </entry>
+            <entry>
+              <p>
+                21000
+              </p>
+            </entry>
+            <entry>
+              <p>
+                External
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Used to transmit commands and receive results by 
<codeph>impala-shell</codeph> and
+                version 1.2 of the Cloudera ODBC driver.
+              </p>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <p>
+                Impala Daemon
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Impala Daemon Frontend Port
+              </p>
+            </entry>
+            <entry>
+              <p>
+                21050
+              </p>
+            </entry>
+            <entry>
+              <p>
+                External
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Used to transmit commands and receive results by applications, 
such as Business Intelligence tools,
+                using JDBC, the Beeswax query editor in Hue, and version 2.0 
or higher of the Cloudera ODBC driver.
+              </p>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <p>
+                Impala Daemon
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Impala Daemon Backend Port
+              </p>
+            </entry>
+            <entry>
+              <p>
+                22000
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Internal
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Internal use only. Impala daemons use this port to communicate 
with each other.
+              </p>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <p>
+                Impala Daemon
+              </p>
+            </entry>
+            <entry>
+              <p>
+                StateStoreSubscriber Service Port
+              </p>
+            </entry>
+            <entry>
+              <p>
+                23000
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Internal
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Internal use only. Impala daemons listen on this port for 
updates from the statestore daemon.
+              </p>
+            </entry>
+          </row>
+          <row rev="2.1.0">
+            <entry>
+              <p>
+                Catalog Daemon
+              </p>
+            </entry>
+            <entry>
+              <p>
+                StateStoreSubscriber Service Port
+              </p>
+            </entry>
+            <entry>
+              <p>
+                23020
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Internal
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Internal use only. The catalog daemon listens on this port for 
updates from the statestore daemon.
+              </p>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <p>
+                Impala Daemon
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Impala Daemon HTTP Server Port
+              </p>
+            </entry>
+            <entry>
+              <p>
+                25000
+              </p>
+            </entry>
+            <entry>
+              <p>
+                External
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Impala web interface for administrators to monitor and 
troubleshoot.
+              </p>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <p>
+                Impala StateStore Daemon
+              </p>
+            </entry>
+            <entry>
+              <p>
+                StateStore HTTP Server Port
+              </p>
+            </entry>
+            <entry>
+              <p>
+                25010
+              </p>
+            </entry>
+            <entry>
+              <p>
+                External
+              </p>
+            </entry>
+            <entry>
+              <p>
+                StateStore web interface for administrators to monitor and 
troubleshoot.
+              </p>
+            </entry>
+          </row>
+          <row rev="1.2">
+            <entry>
+              <p>
+                Impala Catalog Daemon
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Catalog HTTP Server Port
+              </p>
+            </entry>
+            <entry>
+              <p>
+                25020
+              </p>
+            </entry>
+            <entry>
+              <p>
+                External
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Catalog service web interface for administrators to monitor 
and troubleshoot. New in Impala 1.2 and
+                higher.
+              </p>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <p>
+                Impala StateStore Daemon
+              </p>
+            </entry>
+            <entry>
+              <p>
+                StateStore Service Port
+              </p>
+            </entry>
+            <entry>
+              <p>
+                24000
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Internal
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Internal use only. The statestore daemon listens on this port 
for registration/unregistration
+                requests.
+              </p>
+            </entry>
+          </row>
+          <row rev="1.2">
+            <entry>
+              <p>
+                Impala Catalog Daemon
+              </p>
+            </entry>
+            <entry>
+              <p>
+                StateStore Service Port
+              </p>
+            </entry>
+            <entry>
+              <p>
+                26000
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Internal
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Internal use only. The catalog service uses this port to 
communicate with the Impala daemons. New
+                in Impala 1.2 and higher.
+              </p>
+            </entry>
+          </row>
+          <row rev="1.3.0">
+            <entry>
+              <p>
+                Impala Daemon
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Llama Callback Port
+              </p>
+            </entry>
+            <entry>
+              <p>
+                28000
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Internal
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Internal use only. Impala daemons use to communicate with 
Llama. New in <ph rev="upstream">CDH 5.0.0</ph> and higher.
+              </p>
+            </entry>
+          </row>
+          <row rev="1.3.0">
+            <entry>
+              <p>
+                Impala Llama ApplicationMaster
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Llama Thrift Admin Port
+              </p>
+            </entry>
+            <entry>
+              <p>
+                15002
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Internal
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Internal use only. New in <ph rev="upstream">CDH 5.0.0</ph> 
and higher.
+              </p>
+            </entry>
+          </row>
+          <row rev="1.3.0">
+            <entry>
+              <p>
+                Impala Llama ApplicationMaster
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Llama Thrift Port
+              </p>
+            </entry>
+            <entry>
+              <p>
+                15000
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Internal
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Internal use only. New in <ph rev="upstream">CDH 5.0.0</ph> 
and higher.
+              </p>
+            </entry>
+          </row>
+          <row rev="1.3.0">
+            <entry>
+              <p>
+                Impala Llama ApplicationMaster
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Llama HTTP Port
+              </p>
+            </entry>
+            <entry>
+              <p>
+                15001
+              </p>
+            </entry>
+            <entry>
+              <p>
+                External
+              </p>
+            </entry>
+            <entry>
+              <p>
+                Llama service web interface for administrators to monitor and 
troubleshoot. New in CDH 5.0.0 and
+                higher.
+              </p>
+            </entry>
+          </row>
+        </tbody>
+      </tgroup>
+    </table>
+  </conbody>
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_prefetch_mode.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_prefetch_mode.xml 
b/docs/topics/impala_prefetch_mode.xml
new file mode 100644
index 0000000..fc85c11
--- /dev/null
+++ b/docs/topics/impala_prefetch_mode.xml
@@ -0,0 +1,49 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="prefetch_mode" rev="2.6.0 IMPALA-3286">
+
+  <title>PREFETCH_MODE Query Option (<keyword keyref="impala26"/> or higher 
only)</title>
+  <titlealts audience="PDF"><navtitle>PREFETCH_MODE</navtitle></titlealts>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Impala Query Options"/>
+      <data name="Category" value="Performance"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p rev="2.6.0 IMPALA-3286">
+      <indexterm audience="Cloudera">PREFETCH_MODE query option</indexterm>
+      Determines whether the prefetching optimization is applied during
+      join query processing.
+    </p>
+
+    <p>
+      <b>Type:</b> numeric (0, 1)
+      or corresponding mnemonic strings (<codeph>NONE</codeph>, 
<codeph>HT_BUCKET</codeph>).
+    </p>
+
+    <p>
+      <b>Default:</b> 1 (equivalent to <codeph>HT_BUCKET</codeph>)
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/added_in_260"/>
+
+    <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+    <p>
+      The default mode is 1, which means that hash table buckets are
+      prefetched during join query processing.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/related_info"/>
+    <p>
+      <xref href="impala_joins.xml#joins"/>,
+      <xref href="impala_perf_joins.xml#perf_joins"/>.
+    </p>
+
+  </conbody>
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_prereqs.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_prereqs.xml b/docs/topics/impala_prereqs.xml
new file mode 100644
index 0000000..8572738
--- /dev/null
+++ b/docs/topics/impala_prereqs.xml
@@ -0,0 +1,357 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="prereqs">
+
+  <title>Impala Requirements</title>
+  <titlealts audience="PDF"><navtitle>Requirements</navtitle></titlealts>
+
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Requirements"/>
+      <data name="Category" value="Planning"/>
+      <data name="Category" value="Installing"/>
+      <data name="Category" value="Upgrading"/>
+      <data name="Category" value="Administrators"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+      <!-- Another instance of a topic pulled into the map twice, resulting in 
a second HTML page with a *1.html filename. -->
+      <data name="Category" value="Duplicate Topics"/>
+      <!-- Using a separate category, 'Multimap', to flag those pages that are 
duplicate because of multiple DITA map references. -->
+      <data name="Category" value="Multimap"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">prerequisites</indexterm>
+      <indexterm audience="Cloudera">requirements</indexterm>
+      To perform as expected, Impala depends on the availability of the 
software, hardware, and configurations
+      described in the following sections.
+    </p>
+
+    <p outputclass="toc inpage"/>
+  </conbody>
+
+  <concept id="product_compatibility_matrix">
+
+    <title>Product Compatibility Matrix</title>
+
+    <conbody>
+
+      <p> The ultimate source of truth about compatibility between various
+        versions of CDH, Cloudera Manager, and various CDH components is the 
<ph
+          audience="integrated"><xref
+            href="rn_consolidated_pcm.xml"
+            >Product Compatibility Matrix for CDH and Cloudera
+          Manager</xref></ph><ph audience="standalone">online <xref
+            
href="http://www.cloudera.com/documentation/enterprise/latest/topics/rn_consolidated_pcm.html";
+            format="html" scope="external">Product Compatibility
+          Matrix</xref></ph>. </p>
+
+      <p>
+        For Impala, see the
+        <xref 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/pcm_impala.html";
 scope="external" format="html">Impala
+        compatibility matrix page</xref>.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="prereqs_os">
+
+    <title>Supported Operating Systems</title>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">software requirements</indexterm>
+        <indexterm audience="Cloudera">Red Hat Enterprise Linux</indexterm>
+        <indexterm audience="Cloudera">RHEL</indexterm>
+        <indexterm audience="Cloudera">CentOS</indexterm>
+        <indexterm audience="Cloudera">SLES</indexterm>
+        <indexterm audience="Cloudera">Ubuntu</indexterm>
+        <indexterm audience="Cloudera">SUSE</indexterm>
+        <indexterm audience="Cloudera">Debian</indexterm> The relevant 
supported operating systems
+        and versions for Impala are the same as for the corresponding CDH 5 
platforms. For
+        details, see the <cite>Supported Operating Systems</cite> page for
+        <ph audience="integrated"><xref 
href="rn_consolidated_pcm.xml#cdh_cm_supported_os">CDH
+            5</xref></ph><ph audience="standalone"><xref
+            
href="http://www.cloudera.com/documentation/enterprise/latest/topics/rn_consolidated_pcm.html#cdh_cm_supported_os";
+            scope="external" format="html">CDH 5</xref></ph>. </p>
+    </conbody>
+  </concept>
+
+  <concept id="prereqs_hive">
+
+    <title>Hive Metastore and Related Configuration</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Metastore"/>
+      <data name="Category" value="Hive"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">Hive</indexterm>
+        <indexterm audience="Cloudera">MySQL</indexterm>
+        <indexterm audience="Cloudera">PostgreSQL</indexterm>
+        Impala can interoperate with data stored in Hive, and uses the same 
infrastructure as Hive for tracking
+        metadata about schema objects such as tables and columns. The 
following components are prerequisites for
+        Impala:
+      </p>
+
+      <ul>
+        <li>
+          MySQL or PostgreSQL, to act as a metastore database for both Impala 
and Hive.
+          <note>
+            <p>
+              Installing and configuring a Hive metastore is an Impala 
requirement. Impala does not work without
+              the metastore database. For the process of installing and 
configuring the metastore, see
+              <xref href="impala_install.xml#install"/>.
+            </p>
+            <p>
+              Always configure a <b>Hive metastore service</b> rather than 
connecting directly to the metastore
+              database. The Hive metastore service is required to interoperate 
between possibly different levels of
+              metastore APIs used by CDH and Impala, and avoids known issues 
with connecting directly to the
+              metastore database. The Hive metastore service is set up for you 
by default if you install through
+              Cloudera Manager 4.5 or higher.
+            </p>
+            <p>
+              A summary of the metastore installation process is as follows:
+            </p>
+            <ul>
+              <li>
+                Install a MySQL or PostgreSQL database. Start the database if 
it is not started after installation.
+              </li>
+
+              <li>
+                Download the
+                <xref href="http://www.mysql.com/products/connector/"; 
scope="external" format="html">MySQL
+                connector</xref> or the
+                <xref href="http://jdbc.postgresql.org/download.html"; 
scope="external" format="html">PostgreSQL
+                connector</xref> and place it in the 
<codeph>/usr/share/java/</codeph> directory.
+              </li>
+
+              <li>
+                Use the appropriate command line tool for your database to 
create the metastore database.
+              </li>
+
+              <li>
+                Use the appropriate command line tool for your database to 
grant privileges for the metastore
+                database to the <codeph>hive</codeph> user.
+              </li>
+
+              <li>
+                Modify <codeph>hive-site.xml</codeph> to include information 
matching your particular database: its
+                URL, username, and password. You will copy the 
<codeph>hive-site.xml</codeph> file to the Impala
+                Configuration Directory later in the Impala installation 
process.
+              </li>
+            </ul>
+          </note>
+        </li>
+
+        <li>
+          <b>Optional:</b> Hive. Although only the Hive metastore database is 
required for Impala to function, you
+          might install Hive on some client machines to create and load data 
into tables that use certain file
+          formats. See <xref href="impala_file_formats.xml#file_formats"/> for 
details. Hive does not need to be
+          installed on the same DataNodes as Impala; it just needs access to 
the same metastore database.
+        </li>
+      </ul>
+    </conbody>
+  </concept>
+
+  <concept id="prereqs_java">
+
+    <title>Java Dependencies</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Java"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">Java</indexterm>
+        <indexterm audience="Cloudera">impala-dependencies.jar</indexterm>
+        Although Impala is primarily written in C++, it does use Java to 
communicate with various Hadoop
+        components:
+      </p>
+
+      <ul>
+        <li>
+          The officially supported JVM for Impala is the Oracle JVM. Other 
JVMs might cause issues, typically
+          resulting in a failure at <cmdname>impalad</cmdname> startup. In 
particular, the JamVM used by default on
+          certain levels of Ubuntu systems can cause 
<cmdname>impalad</cmdname> to fail to start.
+          <!-- To do:
+            Could say something here about JDK 6 vs. JDK 7 in CDH 5. Since we 
didn't specify the JDK version before,
+            don't know the impact from the user perspective so not calling it 
out at the moment.
+          -->
+        </li>
+
+        <li>
+          Internally, the <cmdname>impalad</cmdname> daemon relies on the 
<codeph>JAVA_HOME</codeph> environment
+          variable to locate the system Java libraries. Make sure the 
<cmdname>impalad</cmdname> service is not run
+          from an environment with an incorrect setting for this variable.
+        </li>
+
+        <li>
+          All Java dependencies are packaged in the 
<codeph>impala-dependencies.jar</codeph> file, which is located
+          at <codeph>/usr/lib/impala/lib/</codeph>. These map to everything 
that is built under
+          <codeph>fe/target/dependency</codeph>.
+        </li>
+      </ul>
+    </conbody>
+  </concept>
+
+  <concept id="prereqs_network">
+
+    <title>Networking Configuration Requirements</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Network"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">network configuration</indexterm>
+        As part of ensuring best performance, Impala attempts to complete 
tasks on local data, as opposed to using
+        network connections to work with remote data. To support this goal, 
Impala matches
+        theÂ <b>hostname</b>Â provided to each Impala daemon with theÂ <b>IP 
address</b>Â of each DataNode by
+        resolving the hostname flag to an IP address. For Impala to work with 
local data, use a single IP interface
+        for the DataNode and the Impala daemon on each machine. Ensure that 
the Impala daemon's hostname flag
+        resolves to the IP address of the DataNode. For single-homed machines, 
this is usually automatic, but for
+        multi-homed machines, ensure that the Impala daemon's hostname 
resolves to the correct interface. Impala
+        tries to detect the correct hostname at start-up, and prints the 
derived hostname at the start of the log
+        in a message of the form:
+      </p>
+
+<codeblock>Using hostname: impala-daemon-1.example.com</codeblock>
+
+      <p>
+        In the majority of cases, this automatic detection works correctly. If 
you need to explicitly set the
+        hostname, do so by setting theÂ <codeph>--hostname</codeph>Â flag.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="prereqs_hardware">
+
+    <title>Hardware Requirements</title>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">hardware requirements</indexterm>
+        <indexterm audience="Cloudera">capacity</indexterm>
+        <indexterm audience="Cloudera">RAM</indexterm>
+        <indexterm audience="Cloudera">memory</indexterm>
+        <indexterm audience="Cloudera">CPU</indexterm>
+        <indexterm audience="Cloudera">processor</indexterm>
+        <indexterm audience="Cloudera">Intel</indexterm>
+        <indexterm audience="Cloudera">AMD</indexterm>
+        During join operations, portions of data from each joined table are 
loaded into memory. Data sets can be
+        very large, so ensure your hardware has sufficient memory to 
accommodate the joins you anticipate
+        completing.
+      </p>
+
+      <p>
+        While requirements vary according to data set size, the following is 
generally recommended:
+      </p>
+
+      <ul>
+        <li rev="2.0.0">
+          CPU - Impala version 2.2 and higher uses the SSSE3 instruction set, 
which is included in newer processors.
+          <note>
+            This required level of processor is the same as in Impala version 
1.x. The Impala 2.0 and 2.1 releases
+            had a stricter requirement for the SSE4.1 instruction set, which 
has now been relaxed.
+          </note>
+<!--
+          For best performance use:
+          <ul>
+            <li>
+              Intel - Nehalem (released 2008) or later processors.
+            </li>
+
+            <li>
+              AMD - Bulldozer (released 2011) or later processors.
+            </li>
+          </ul>
+-->
+        </li>
+
+        <li rev="1.2">
+          Memory - 128 GB or more recommended, ideally 256 GB or more. If the 
intermediate results during query
+          processing on a particular node exceed the amount of memory 
available to Impala on that node, the query
+          writes temporary work data to disk, which can lead to long query 
times. Note that because the work is
+          parallelized, and intermediate results for aggregate queries are 
typically smaller than the original
+          data, Impala can query and join tables that are much larger than the 
memory available on an individual
+          node.
+        </li>
+
+        <li>
+          Storage - DataNodes with 12 or more disks each. I/O speeds are often 
the limiting factor for disk
+          performance with Impala. Ensure that you have sufficient disk space 
to store the data Impala will be
+          querying.
+        </li>
+      </ul>
+    </conbody>
+  </concept>
+
+  <concept id="prereqs_account">
+
+    <title>User Account Requirements</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Users"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">impala user</indexterm>
+        <indexterm audience="Cloudera">impala group</indexterm>
+        <indexterm audience="Cloudera">root user</indexterm>
+        Impala creates and uses a user and group named 
<codeph>impala</codeph>. Do not delete this account or group
+        and do not modify the account's or group's permissions and rights. 
Ensure no existing systems obstruct the
+        functioning of these accounts and groups. For example, if you have 
scripts that delete user accounts not in
+        a white-list, add these accounts to the list of permitted accounts.
+      </p>
+
+<!-- Taking out because no longer applicable in CDH 5.5 and up. -->
+      <p id="impala_hdfs_group" rev="1.2" audience="Cloudera">
+        For the resource management feature to work (in combination with CDH 5 
and the YARN and Llama components),
+        the <codeph>impala</codeph> user must be a member of the 
<codeph>hdfs</codeph> group. This setup is
+        performed automatically during a new install, but not when upgrading 
from earlier Impala releases to Impala
+        1.2. If you are upgrading a node to CDH 5 that already had Impala 1.1 
or 1.0 installed, manually add the
+        <codeph>impala</codeph> user to the <codeph>hdfs</codeph> group.
+      </p>
+
+      <p>
+        For correct file deletion during <codeph>DROP TABLE</codeph> 
operations, Impala must be able to move files
+        to the HDFS trashcan. You might need to create an HDFS directory 
<filepath>/user/impala</filepath>,
+        writeable by the <codeph>impala</codeph> user, so that the trashcan 
can be created. Otherwise, data files
+        might remain behind after a <codeph>DROP TABLE</codeph> statement.
+      </p>
+
+      <p>
+        Impala should not run as root. Best Impala performance is achieved 
using direct reads, but root is not
+        permitted to use direct reads. Therefore, running Impala as root 
negatively affects performance.
+      </p>
+
+      <p>
+        By default, any user can connect to Impala and access all the 
associated databases and tables. You can
+        enable authorization and authentication based on the Linux OS user who 
connects to the Impala server, and
+        the associated groups for that user. <xref 
href="impala_security.xml#security"/> for details. These
+        security features do not change the underlying file permission 
requirements; the <codeph>impala</codeph>
+        user still needs to be able to access the data files.
+      </p>
+    </conbody>
+  </concept>
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_processes.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_processes.xml b/docs/topics/impala_processes.xml
new file mode 100644
index 0000000..05f2274
--- /dev/null
+++ b/docs/topics/impala_processes.xml
@@ -0,0 +1,134 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="processes">
+
+  <title>Starting Impala</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Starting and Stopping"/>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Administrators"/>
+      <data name="Category" value="Operators"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p rev="1.2">
+      <indexterm audience="Cloudera">state store</indexterm>
+      <indexterm audience="Cloudera">starting services</indexterm>
+      <indexterm audience="Cloudera">services</indexterm>
+      To activate Impala if it is installed but not yet started:
+    </p>
+
+    <ol>
+      <li>
+        Set any necessary configuration options for the Impala services. See
+        <xref href="impala_config_options.xml#config_options"/> for details.
+      </li>
+
+      <li>
+        Start one instance of the Impala statestore. The statestore helps 
Impala to distribute work efficiently,
+        and to continue running in the event of availability problems for 
other Impala nodes. If the statestore
+        becomes unavailable, Impala continues to function.
+      </li>
+
+      <li>
+        Start one instance of the Impala catalog service.
+      </li>
+
+      <li>
+        Start the main Impala service on one or more DataNodes, ideally on all 
DataNodes to maximize local
+        processing and avoid network traffic due to remote reads.
+      </li>
+    </ol>
+
+    <p>
+      Once Impala is running, you can conduct interactive experiments using 
the instructions in
+      <xref href="impala_tutorial.xml#tutorial"/> and try <xref 
href="impala_impala_shell.xml#impala_shell"/>.
+    </p>
+
+    <p outputclass="toc inpage"/>
+  </conbody>
+
+  <concept id="starting_via_cm">
+
+    <title>Starting Impala through Cloudera Manager</title>
+
+    <conbody>
+
+      <p>
+        If you installed Impala with Cloudera Manager, use Cloudera Manager to 
start and stop services. The
+        Cloudera Manager GUI is a convenient way to check that all services 
are running, to set configuration
+        options using form fields in a browser, and to spot potential issues 
such as low disk space before they
+        become serious. Cloudera Manager automatically starts all the 
Impala-related services as a group, in the
+        correct order. See
+        <xref 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_start_stop_service.html";
 scope="external" format="html">the
+        Cloudera Manager Documentation</xref> for details.
+      </p>
+
+      <note>
+        <p 
conref="../shared/impala_common.xml#common/udf_persistence_restriction"/>
+      </note>
+    </conbody>
+  </concept>
+
+  <concept id="starting_via_cmdline">
+
+    <title>Starting Impala from the Command Line</title>
+
+    <conbody>
+
+      <p>
+        To start the Impala state store and Impala from the command line or a 
script, you can either use the
+        <cmdname>service</cmdname> command or you can start the daemons 
directly through the
+        <cmdname>impalad</cmdname>, <codeph>statestored</codeph>, and 
<cmdname>catalogd</cmdname> executables.
+      </p>
+
+      <p>
+        Start the Impala statestore and then start <codeph>impalad</codeph> 
instances. You can modify the values
+        the service initialization scripts use when starting the statestore 
and Impala by editing
+        <codeph>/etc/default/impala</codeph>.
+      </p>
+
+      <p>
+        Start the statestore service using a command similar to the following:
+      </p>
+
+      <p>
+<codeblock>$ sudo service impala-state-store start</codeblock>
+      </p>
+
+      <p rev="1.2">
+        Start the catalog service using a command similar to the following:
+      </p>
+
+<codeblock rev="1.2">$ sudo service impala-catalog start</codeblock>
+
+      <p>
+        Start the Impala service on each DataNode using a command similar to 
the following:
+      </p>
+
+      <p>
+<codeblock>$ sudo service impala-server start</codeblock>
+      </p>
+
+      <note>
+        <p 
conref="../shared/impala_common.xml#common/udf_persistence_restriction"/>
+      </note>
+
+      <p>
+        If any of the services fail to start, review:
+        <ul>
+          <li>
+            <xref href="impala_logging.xml#logs_debug"/>
+          </li>
+
+          <li>
+            <xref href="impala_troubleshooting.xml#troubleshooting"/>
+          </li>
+        </ul>
+      </p>
+    </conbody>
+  </concept>
+</concept>

[12/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch.

Reply via email to