http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_insert.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_insert.xml b/docs/topics/impala_insert.xml new file mode 100644 index 0000000..02ad8c6 --- /dev/null +++ b/docs/topics/impala_insert.xml @@ -0,0 +1,689 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="insert"> + + <title>INSERT Statement</title> + <titlealts audience="PDF"><navtitle>INSERT</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="ETL"/> + <data name="Category" value="Ingest"/> + <data name="Category" value="DML"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Tables"/> + <data name="Category" value="S3"/> + <!-- <data name="Category" value="Kudu"/> --> + <!-- This is such an important statement, think if there are more applicable categories. --> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">INSERT statement</indexterm> + Impala supports inserting into tables and partitions that you create with the Impala <codeph>CREATE + TABLE</codeph> statement, or pre-defined tables and partitions created through Hive. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>[<varname>with_clause</varname>] +INSERT { INTO | OVERWRITE } [TABLE] <varname>table_name</varname> + [(<varname>column_list</varname>)] + [ PARTITION (<varname>partition_clause</varname>)] +{ + [<varname>hint_clause</varname>] <varname>select_statement</varname> + | VALUES (<varname>value</varname> [, <varname>value</varname> ...]) [, (<varname>value</varname> [, <varname>value</varname> ...]) ...] +} + +partition_clause ::= <varname>col_name</varname> [= <varname>constant</varname>] [, <varname>col_name</varname> [= <varname>constant</varname>] ...] + +hint_clause ::= [SHUFFLE] | [NOSHUFFLE] (Note: the square brackets are part of the syntax.) +</codeblock> + + <p> + <b>Appending or replacing (INTO and OVERWRITE clauses):</b> + </p> + + <p> + The <codeph>INSERT INTO</codeph> syntax appends data to a table. The existing data files are left as-is, and + the inserted data is put into one or more new data files. + </p> + + <p> + The <codeph>INSERT OVERWRITE</codeph> syntax replaces the data in a table. +<!-- What happens with INSERT OVERWRITE if the target is a single partition or multiple partitions? --> +<!-- If that gets too detailed, cover later under "Partitioning Considerations". --> + Currently, the overwritten data files are deleted immediately; they do not go through the HDFS trash + mechanism. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p rev="2.3.0"> + The <codeph>INSERT</codeph> statement currently does not support writing data files + containing complex types (<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>). + To prepare Parquet data for such tables, you generate the data files outside Impala and then + use <codeph>LOAD DATA</codeph> or <codeph>CREATE EXTERNAL TABLE</codeph> to associate those + data files with the table. Currently, such tables must use the Parquet file format. + See <xref href="impala_complex_types.xml#complex_types"/> for details about working with complex types. + </p> + + <p rev="kudu"> + <b>Ignoring duplicate partition keys for Kudu tables (IGNORE clause)</b> + </p> + + <p rev="kudu"> + Normally, an <codeph>INSERT</codeph> operation into a Kudu table fails if + it would result in duplicate partition key columns for any rows. + Specify <codeph>INSERT IGNORE <varname>rest_of_statement</varname></codeph> to + make the <codeph>INSERT</codeph> continue in this case. The rows that would + have duplicate partition key columns are not inserted. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + Impala currently supports: + </p> + + <ul> + <li> + Copy data from another table using <codeph>SELECT</codeph> query. In Impala 1.2.1 and higher, you can + combine <codeph>CREATE TABLE</codeph> and <codeph>INSERT</codeph> operations into a single step with the + <codeph>CREATE TABLE AS SELECT</codeph> syntax, which bypasses the actual <codeph>INSERT</codeph> keyword. + </li> + + <li> + An optional <xref href="impala_with.xml#with"><codeph>WITH</codeph> clause</xref> before the + <codeph>INSERT</codeph> keyword, to define a subquery referenced in the <codeph>SELECT</codeph> portion. + </li> + + <li> + Create one or more new rows using constant expressions through <codeph>VALUES</codeph> clause. (The + <codeph>VALUES</codeph> clause was added in Impala 1.0.1.) + </li> + + <li rev="1.1"> + <p> + By default, the first column of each newly inserted row goes into the first column of the table, the + second column into the second column, and so on. + </p> + <p> + You can also specify the columns to be inserted, an arbitrarily ordered subset of the columns in the + destination table, by specifying a column list immediately after the name of the destination table. This + feature lets you adjust the inserted columns to match the layout of a <codeph>SELECT</codeph> statement, + rather than the other way around. (This feature was added in Impala 1.1.) + </p> + <p> + The number of columns mentioned in the column list (known as the <q>column permutation</q>) must match + the number of columns in the <codeph>SELECT</codeph> list or the <codeph>VALUES</codeph> tuples. The + order of columns in the column permutation can be different than in the underlying table, and the columns + of each input row are reordered to match. If the number of columns in the column permutation is less than + in the destination table, all unmentioned columns are set to <codeph>NULL</codeph>. + </p> + </li> + + <li> + <p> + For a partitioned table, the optional <codeph>PARTITION</codeph> clause identifies which partition or + partitions the new values go into. If a partition key column is given a constant value such as + <codeph>PARTITION (year=2012)</codeph> or <codeph>PARTITION (year=2012, month=2)</codeph>, all the + inserted rows use those same values for those partition key columns and you omit any corresponding + columns in the source table from the <codeph>SELECT</codeph> list. This form is known as <q>static + partitioning</q>. + </p> + <p> + If a partition key column is mentioned but not assigned a value, such as in <codeph>PARTITION (year, + region)</codeph> (both columns unassigned) or <codeph>PARTITION(year, region='CA')</codeph> + (<codeph>year</codeph> column unassigned), the unassigned columns are filled in with the final columns of + the <codeph>SELECT</codeph> list. In this case, the number of columns in the <codeph>SELECT</codeph> list + must equal the number of columns in the column permutation plus the number of partition key columns not + assigned a constant value. This form is known as <q>dynamic partitioning</q>. + </p> + <p> + See <xref href="impala_partitioning.xml#partition_static_dynamic"/> for examples and performance + characteristics of static and dynamic partitioned inserts. + </p> + </li> + + <li rev="1.2.2"> + An optional hint clause immediately before the <codeph>SELECT</codeph> keyword, to fine-tune the behavior + when doing an <codeph>INSERT ... SELECT</codeph> operation into partitioned Parquet tables. The hint + keywords are <codeph>[SHUFFLE]</codeph> and <codeph>[NOSHUFFLE]</codeph>, including the square brackets. + Inserting into partitioned Parquet tables can be a resource-intensive operation because it potentially + involves many files being written to HDFS simultaneously, and separate + <ph rev="parquet_block_size">large</ph> memory buffers being allocated to buffer the data for each + partition. For usage details, see <xref href="impala_parquet.xml#parquet_etl"/>. + </li> + </ul> + + <note> + <ul> + <li> + Insert commands that partition or add files result in changes to Hive metadata. Because Impala uses Hive + metadata, such changes may necessitate a metadata refresh. For more information, see the + <xref href="impala_refresh.xml#refresh">REFRESH</xref> function. + </li> + + <li> + Currently, Impala can only insert data into tables that use the text and Parquet formats. For other file + formats, insert the data using Hive and use Impala to query it. + </li> + + <li> + As an alternative to the <codeph>INSERT</codeph> statement, if you have existing data files elsewhere in + HDFS, the <codeph>LOAD DATA</codeph> statement can move those files into a table. This statement works + with tables of any file format. + </li> + </ul> + </note> + + <p conref="../shared/impala_common.xml#common/dml_blurb"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + When you insert the results of an expression, particularly of a built-in function call, into a small numeric + column such as <codeph>INT</codeph>, <codeph>SMALLINT</codeph>, <codeph>TINYINT</codeph>, or + <codeph>FLOAT</codeph>, you might need to use a <codeph>CAST()</codeph> expression to coerce values into the + appropriate type. Impala does not automatically convert from a larger type to a smaller one. For example, to + insert cosine values into a <codeph>FLOAT</codeph> column, write <codeph>CAST(COS(angle) AS FLOAT)</codeph> + in the <codeph>INSERT</codeph> statement to make the conversion explicit. + </p> + + <p conref="../shared/impala_common.xml#common/file_format_blurb"/> + + <p rev="DOCS-1523"> + Because Impala can read certain file formats that it cannot write, + the <codeph>INSERT</codeph> statement does not work for all kinds of + Impala tables. See <xref href="impala_file_formats.xml#file_formats"/> + for details about what file formats are supported by the + <codeph>INSERT</codeph> statement. + </p> + + <p conref="../shared/impala_common.xml#common/insert_parquet_blocksize"/> + + <p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/> + + <note conref="../shared/impala_common.xml#common/compute_stats_next"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example sets up new tables with the same definition as the <codeph>TAB1</codeph> table from the + <xref href="impala_tutorial.xml#tutorial">Tutorial</xref> section, using different file + formats, and demonstrates inserting data into the tables created with the <codeph>STORED AS TEXTFILE</codeph> + and <codeph>STORED AS PARQUET</codeph> clauses: + </p> + +<codeblock>CREATE DATABASE IF NOT EXISTS file_formats; +USE file_formats; + +DROP TABLE IF EXISTS text_table; +CREATE TABLE text_table +( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) +STORED AS TEXTFILE; + +DROP TABLE IF EXISTS parquet_table; +CREATE TABLE parquet_table +( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) +STORED AS PARQUET;</codeblock> + + <p> + With the <codeph>INSERT INTO TABLE</codeph> syntax, each new set of inserted rows is appended to any existing + data in the table. This is how you would record small amounts of data that arrive continuously, or ingest new + batches of data alongside the existing data. For example, after running 2 <codeph>INSERT INTO TABLE</codeph> + statements with 5 rows each, the table contains 10 rows total: + </p> + +<codeblock>[localhost:21000] > insert into table text_table select * from default.tab1; +Inserted 5 rows in 0.41s + +[localhost:21000] > insert into table text_table select * from default.tab1; +Inserted 5 rows in 0.46s + +[localhost:21000] > select count(*) from text_table; ++----------+ +| count(*) | ++----------+ +| 10 | ++----------+ +Returned 1 row(s) in 0.26s</codeblock> + + <p> + With the <codeph>INSERT OVERWRITE TABLE</codeph> syntax, each new set of inserted rows replaces any existing + data in the table. This is how you load data to query in a data warehousing scenario where you analyze just + the data for a particular day, quarter, and so on, discarding the previous data each time. You might keep the + entire set of data in one raw table, and transfer and transform certain rows into a more compact and + efficient form to perform intensive analysis on that subset. + </p> + + <p> + For example, here we insert 5 rows into a table using the <codeph>INSERT INTO</codeph> clause, then replace + the data by inserting 3 rows with the <codeph>INSERT OVERWRITE</codeph> clause. Afterward, the table only + contains the 3 rows from the final <codeph>INSERT</codeph> statement. + </p> + +<codeblock>[localhost:21000] > insert into table parquet_table select * from default.tab1; +Inserted 5 rows in 0.35s + +[localhost:21000] > insert overwrite table parquet_table select * from default.tab1 limit 3; +Inserted 3 rows in 0.43s +[localhost:21000] > select count(*) from parquet_table; ++----------+ +| count(*) | ++----------+ +| 3 | ++----------+ +Returned 1 row(s) in 0.43s</codeblock> + + <p> + The <codeph><xref href="impala_insert.xml#values">VALUES</xref></codeph> clause lets you insert one or more + rows by specifying constant values for all the columns. The number, types, and order of the expressions must + match the table definition. + </p> + + <note id="insert_values_warning"> + The <codeph>INSERT ... VALUES</codeph> technique is not suitable for loading large quantities of data into + HDFS-based tables, because the insert operations cannot be parallelized, and each one produces a separate + data file. Use it for setting up small dimension tables or tiny amounts of data for experimenting with SQL + syntax, or with HBase tables. Do not use it for large ETL jobs or benchmark tests for load operations. Do not + run scripts with thousands of <codeph>INSERT ... VALUES</codeph> statements that insert a single row each + time. If you do run <codeph>INSERT ... VALUES</codeph> operations to load data into a staging table as one + stage in an ETL pipeline, include multiple row values if possible within each <codeph>VALUES</codeph> clause, + and use a separate database to make cleanup easier if the operation does produce many tiny files. + </note> + + <p> + The following example shows how to insert one row or multiple rows, with expressions of different types, + using literal values, expressions, and function return values: + </p> + +<codeblock>create table val_test_1 (c1 int, c2 float, c3 string, c4 boolean, c5 timestamp); +insert into val_test_1 values (100, 99.9/10, 'abc', true, now()); +create table val_test_2 (id int, token string); +insert overwrite val_test_2 values (1, 'a'), (2, 'b'), (-1,'xyzzy');</codeblock> + + <p> + These examples show the type of <q>not implemented</q> error that you see when attempting to insert data into + a table with a file format that Impala currently does not write to: + </p> + +<codeblock>DROP TABLE IF EXISTS sequence_table; +CREATE TABLE sequence_table +( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) +STORED AS SEQUENCEFILE; + +DROP TABLE IF EXISTS rc_table; +CREATE TABLE rc_table +( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) +STORED AS RCFILE; + +[localhost:21000] > insert into table rc_table select * from default.tab1; +Remote error +Backend 0:RC_FILE not implemented. + +[localhost:21000] > insert into table sequence_table select * from default.tab1; +Remote error +Backend 0:SEQUENCE_FILE not implemented. </codeblock> + + <p> + Inserting data into partitioned tables requires slightly different syntax that divides the partitioning + columns from the others: + </p> + +<codeblock>create table t1 (i int) <b>partitioned by (x int, y string)</b>; +-- Select an INT column from another table. +-- All inserted rows will have the same x and y values, as specified in the INSERT statement. +-- This technique of specifying all the partition key values is known as static partitioning. +insert into t1 <b>partition(x=10, y='a')</b> select c1 from some_other_table; +-- Select two INT columns from another table. +-- All inserted rows will have the same y value, as specified in the INSERT statement. +-- Values from c2 go into t1.x. +-- Any partitioning columns whose value is not specified are filled in +-- from the columns specified last in the SELECT list. +-- This technique of omitting some partition key values is known as dynamic partitioning. +insert into t1 <b>partition(x, y='b')</b> select c1, c2 from some_other_table; +-- Select an INT and a STRING column from another table. +-- All inserted rows will have the same x value, as specified in the INSERT statement. +-- Values from c3 go into t1.y. +insert into t1 <b>partition(x=20, y)</b> select c1, c3 from some_other_table;</codeblock> + + <p rev="1.1"> + The following examples show how you can copy the data in all the columns from one table to another, copy the + data from only some columns, or specify the columns in the select list in a different order than they + actually appear in the table: + </p> + +<codeblock>-- Start with 2 identical tables. +create table t1 (c1 int, c2 int); +create table t2 like t1; + +-- If there is no () part after the destination table name, +-- all columns must be specified, either as * or by name. +insert into t2 select * from t1; +insert into t2 select c1, c2 from t1; + +-- With the () notation following the destination table name, +-- you can omit columns (all values for that column are NULL +-- in the destination table), and/or reorder the values +-- selected from the source table. This is the "column permutation" feature. +insert into t2 (c1) select c1 from t1; +insert into t2 (c2, c1) select c1, c2 from t1; + +-- The column names can be entirely different in the source and destination tables. +-- You can copy any columns, not just the corresponding ones, from the source table. +-- But the number and type of selected columns must match the columns mentioned in the () part. +alter table t2 replace columns (x int, y int); +insert into t2 (y) select c1 from t1; + +-- For partitioned tables, all the partitioning columns must be mentioned in the () column list +-- or a PARTITION clause; these columns cannot be defaulted to NULL. +create table pt1 (x int, y int) partitioned by (z int); +-- The values from c1 are copied into the column x in the new table, +-- all in the same partition based on a constant value for z. +-- The values of y in the new table are all NULL. +insert into pt1 (x) partition (z=5) select c1 from t1; +-- Again we omit the values for column y so they are all NULL. +-- The inserted x values can go into different partitions, based on +-- the different values inserted into the partitioning column z. +insert into pt1 (x,z) select x, z from t2; +</codeblock> + + <p> + <codeph>SELECT *</codeph> for a partitioned table requires that all partition key columns in the source table + be declared as the last columns in the <codeph>CREATE TABLE</codeph> statement. You still include a + <codeph>PARTITION BY</codeph> clause listing all the partition key columns. These partition columns are + automatically mapped to the last columns from the <codeph>SELECT *</codeph> list. + </p> + +<codeblock>create table source (x int, y int, year int, month int, day int); +create table destination (x int, y int) partitioned by (year int, month int, day int); +...load some data into the unpartitioned source table... +-- Insert a single partition of data. +-- The SELECT * means you cannot specify partition (year=2014, month, day). +insert overwrite destination partition (year, month, day) select * from source where year=2014; +-- Insert the data for all year/month/day combinations. +insert overwrite destination partition (year, month, day) select * from source; + +-- If one of the partition columns is omitted from the source table, +-- then you can specify a specific value for that column in the PARTITION clause. +-- Here the source table holds only data from 2014, and so does not include a year column. +create table source_2014 (x int, y int, month, day); +...load some data into the unpartitioned source_2014 table... +insert overwrite destination partition (year=2014, month, day) select * from source_2014; +</codeblock> + + <p conref="../shared/impala_common.xml#common/insert_sort_blurb"/> + + <p> + <b>Concurrency considerations:</b> Each <codeph>INSERT</codeph> operation creates new data files with unique + names, so you can run multiple <codeph>INSERT INTO</codeph> statements simultaneously without filename + conflicts. +<!-- +If data is inserted into a table by a statement issued to a different +<cmdname>impalad</cmdname> node, +issue a <codeph>REFRESH <varname>table_name</varname></codeph> +statement to make the node you are connected to aware of this new data. +--> + While data is being inserted into an Impala table, the data is staged temporarily in a subdirectory inside + the data directory; during this period, you cannot issue queries against that table in Hive. If an + <codeph>INSERT</codeph> operation fails, the temporary data file and the subdirectory could be left behind in + the data directory. If so, remove the relevant subdirectory and any data files it contains manually, by + issuing an <codeph>hdfs dfs -rm -r</codeph> command, specifying the full path of the work subdirectory, whose + name ends in <codeph>_dir</codeph>. + </p> + </conbody> + + <concept id="values"> + + <title>VALUES Clause</title> + + <conbody> + + <p> + The <codeph>VALUES</codeph> clause is a general-purpose way to specify the columns of one or more rows, + typically within an <codeph><xref href="impala_insert.xml#insert">INSERT</xref></codeph> statement. + </p> + + <note conref="../shared/impala_common.xml#common/insert_values_warning"> + <p/> + </note> + + <p> + The following examples illustrate: + </p> + + <ul> + <li> + How to insert a single row using a <codeph>VALUES</codeph> clause. + </li> + + <li> + How to insert multiple rows using a <codeph>VALUES</codeph> clause. + </li> + + <li> + How the row or rows from a <codeph>VALUES</codeph> clause can be appended to a table through + <codeph>INSERT INTO</codeph>, or replace the contents of the table through <codeph>INSERT + OVERWRITE</codeph>. + </li> + + <li> + How the entries in a <codeph>VALUES</codeph> clause can be literals, function results, or any other kind + of expression. See <xref href="impala_literals.xml#literals"/> for the notation to use for literal + values, especially <xref href="impala_literals.xml#string_literals"/> for quoting and escaping + conventions for strings. See <xref href="impala_operators.xml#operators"/> and + <xref href="impala_functions.xml#builtins"/> for other things you can include in expressions with the + <codeph>VALUES</codeph> clause. + </li> + </ul> + +<codeblock>[localhost:21000] > describe val_example; +Query: describe val_example +Query finished, fetching results ... ++-------+---------+---------+ +| name | type | comment | ++-------+---------+---------+ +| id | int | | +| col_1 | boolean | | +| col_2 | double | | ++-------+---------+---------+ + +[localhost:21000] > insert into val_example values (1,true,100.0); +Inserted 1 rows in 0.30s +[localhost:21000] > select * from val_example; ++----+-------+-------+ +| id | col_1 | col_2 | ++----+-------+-------+ +| 1 | true | 100 | ++----+-------+-------+ + +[localhost:21000] > insert overwrite val_example values (10,false,pow(2,5)), (50,true,10/3); +Inserted 2 rows in 0.16s +[localhost:21000] > select * from val_example; ++----+-------+-------------------+ +| id | col_1 | col_2 | ++----+-------+-------------------+ +| 10 | false | 32 | +| 50 | true | 3.333333333333333 | ++----+-------+-------------------+</codeblock> + + <p> + When used in an <codeph>INSERT</codeph> statement, the Impala <codeph>VALUES</codeph> clause can specify + some or all of the columns in the destination table, and the columns can be specified in a different order + than they actually appear in the table. To specify a different set or order of columns than in the table, + use the syntax: + </p> + +<codeblock>INSERT INTO <varname>destination</varname> + (<varname>col_x</varname>, <varname>col_y</varname>, <varname>col_z</varname>) + VALUES + (<varname>val_x</varname>, <varname>val_y</varname>, <varname>val_z</varname>); +</codeblock> + + <p> + Any columns in the table that are not listed in the <codeph>INSERT</codeph> statement are set to + <codeph>NULL</codeph>. + </p> + +<!-- + <p> + does not support specifying a subset of the + columns in the table or specifying the columns in a different order. Use a + <codeph>VALUES</codeph> clause with all the column values in the same order as + the table definition, using <codeph>NULL</codeph> values for any columns you + want to omit from the <codeph>INSERT</codeph> operation. + </p> +--> + + <p> + To use a <codeph>VALUES</codeph> clause like a table in other statements, wrap it in parentheses and use + <codeph>AS</codeph> clauses to specify aliases for the entire object and any columns you need to refer to: + </p> + +<codeblock>[localhost:21000] > select * from (values(4,5,6),(7,8,9)) as t; ++---+---+---+ +| 4 | 5 | 6 | ++---+---+---+ +| 4 | 5 | 6 | +| 7 | 8 | 9 | ++---+---+---+ +[localhost:21000] > select * from (values(1 as c1, true as c2, 'abc' as c3),(100,false,'xyz')) as t; ++-----+-------+-----+ +| c1 | c2 | c3 | ++-----+-------+-----+ +| 1 | true | abc | +| 100 | false | xyz | ++-----+-------+-----+</codeblock> + + <p> + For example, you might use a tiny table constructed like this from constant literals or function return + values as part of a longer statement involving joins or <codeph>UNION ALL</codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/hdfs_blurb"/> + + <p> + Impala physically writes all inserted files under the ownership of its default user, typically + <codeph>impala</codeph>. Therefore, this user must have HDFS write permission in the corresponding table + directory. + </p> + + <p> + The permission requirement is independent of the authorization performed by the Sentry framework. (If the + connected user is not authorized to insert into a table, Sentry blocks that operation immediately, + regardless of the privileges available to the <codeph>impala</codeph> user.) Files created by Impala are + not owned by and do not inherit permissions from the connected user. + </p> + + <p> + The number of data files produced by an <codeph>INSERT</codeph> statement depends on the size of the + cluster, the number of data blocks that are processed, the partition key columns in a partitioned table, + and the mechanism Impala uses for dividing the work in parallel. Do not assume that an + <codeph>INSERT</codeph> statement will produce some particular number of output files. In case of + performance issues with data written by Impala, check that the output files do not suffer from issues such + as many tiny files or many tiny partitions. (In the Hadoop context, even files or partitions of a few tens + of megabytes are considered <q>tiny</q>.) + </p> + + <p conref="../shared/impala_common.xml#common/insert_hidden_work_directory"/> + + <p conref="../shared/impala_common.xml#common/hbase_blurb"/> + + <p> + You can use the <codeph>INSERT</codeph> statement with HBase tables as follows: + </p> + + <ul> + <li> + <p> + You can insert a single row or a small set of rows into an HBase table with the <codeph>INSERT ... + VALUES</codeph> syntax. This is a good use case for HBase tables with Impala, because HBase tables are + not subject to the same kind of fragmentation from many small insert operations as HDFS tables are. + </p> + </li> + + <li> + <p> + You can insert any number of rows at once into an HBase table using the <codeph>INSERT ... + SELECT</codeph> syntax. + </p> + </li> + + <li> + <p> + If more than one inserted row has the same value for the HBase key column, only the last inserted row + with that value is visible to Impala queries. You can take advantage of this fact with <codeph>INSERT + ... VALUES</codeph> statements to effectively update rows one at a time, by inserting new rows with the + same key values as existing rows. Be aware that after an <codeph>INSERT ... SELECT</codeph> operation + copying from an HDFS table, the HBase table might contain fewer rows than were inserted, if the key + column in the source table contained duplicate values. + </p> + </li> + + <li> + <p> + You cannot <codeph>INSERT OVERWRITE</codeph> into an HBase table. New rows are always appended. + </p> + </li> + + <li> + <p> + When you create an Impala or Hive table that maps to an HBase table, the column order you specify with + the <codeph>INSERT</codeph> statement might be different than the order you declare with the + <codeph>CREATE TABLE</codeph> statement. Behind the scenes, HBase arranges the columns based on how + they are divided into column families. This might cause a mismatch during insert operations, especially + if you use the syntax <codeph>INSERT INTO <varname>hbase_table</varname> SELECT * FROM + <varname>hdfs_table</varname></codeph>. Before inserting data, verify the column order by issuing a + <codeph>DESCRIBE</codeph> statement for the table, and adjust the order of the select list in the + <codeph>INSERT</codeph> statement. + </p> + </li> + </ul> + + <p> + See <xref href="impala_hbase.xml#impala_hbase"/> for more details about using Impala with HBase. + </p> + + <p conref="../shared/impala_common.xml#common/s3_blurb"/> + <p conref="../shared/impala_common.xml#common/s3_dml"/> + <p conref="../shared/impala_common.xml#common/s3_dml_performance"/> + <p>See <xref href="../topics/impala_s3.xml#s3"/> for details about reading and writing S3 data with Impala.</p> + + <p conref="../shared/impala_common.xml#common/security_blurb"/> + <p conref="../shared/impala_common.xml#common/redaction_yes"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_yes"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb"/> + <p rev="CDH-19187"> + The user ID that the <cmdname>impalad</cmdname> daemon runs under, + typically the <codeph>impala</codeph> user, must have read + permission for the files in the source directory of an <codeph>INSERT ... SELECT</codeph> + operation, and write permission for all affected directories in the destination table. + (An <codeph>INSERT</codeph> operation could write files to multiple different HDFS directories + if the destination table is partitioned.) + This user must also have write permission to create a temporary work directory + in the top-level HDFS directory of the destination table. + An <codeph>INSERT OVERWRITE</codeph> operation does not require write permission on + the original data files in the table, only on the table directories themselves. + </p> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p conref="../shared/impala_common.xml#common/char_varchar_cast_from_string"/> + + <p conref="../shared/impala_common.xml#common/related_options"/> + + <p rev="1.3.1" conref="../shared/impala_common.xml#common/insert_inherit_permissions"/> + </conbody> + </concept> + +<!-- Values clause --> +</concept> +<!-- INSERT statement -->
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_install.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_install.xml b/docs/topics/impala_install.xml new file mode 100644 index 0000000..ab68314 --- /dev/null +++ b/docs/topics/impala_install.xml @@ -0,0 +1,106 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="install"> + + <title><ph audience="standalone">Installing Impala</ph><ph audience="integrated">Impala Installation</ph></title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Installing"/> + <data name="Category" value="Administrators"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">installation</indexterm> + <indexterm audience="Cloudera">pseudo-distributed cluster</indexterm> + <indexterm audience="Cloudera">cluster</indexterm> + <indexterm audience="Cloudera">DataNodes</indexterm> + <indexterm audience="Cloudera">NameNode</indexterm> + <indexterm audience="Cloudera">Cloudera Manager</indexterm> + <indexterm audience="Cloudera">impalad</indexterm> + <indexterm audience="Cloudera">impala-shell</indexterm> + <indexterm audience="Cloudera">statestored</indexterm> + Impala is an open-source add-on to the Cloudera Enterprise Core that returns rapid responses to + queries. + </p> + + <note> + <p> + Under CDH 5, Impala is included as part of the CDH installation and no separate steps are needed. + <ph audience="standalone">Therefore, the instruction steps in this section apply to CDH 4 only.</ph> + </p> + </note> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="install_details"> + + <title>What is Included in an Impala Installation</title> + + <conbody> + + <p> + Impala is made up of a set of components that can be installed on multiple nodes throughout your cluster. + The key installation step for performance is to install the <cmdname>impalad</cmdname> daemon (which does + most of the query processing work) on <i>all</i> DataNodes in the cluster. + </p> + + <p> + The Impala package installs these binaries: + </p> + + <ul> + <li> + <p> + <cmdname>impalad</cmdname> - The Impala daemon. Plans and executes queries against HDFS, HBase, <ph rev="2.2.0">and Amazon S3 data</ph>. + <xref href="impala_processes.xml#processes">Run one impalad process</xref> on each node in the cluster + that has a DataNode. + </p> + </li> + + <li> + <p> + <cmdname>statestored</cmdname> - Name service that tracks location and status of all + <codeph>impalad</codeph> instances in the cluster. <xref href="impala_processes.xml#processes">Run one + instance of this daemon</xref> on a node in your cluster. Most production deployments run this daemon + on the namenode. + </p> + </li> + + <li rev="1.2"> + <p> + <cmdname>catalogd</cmdname> - Metadata coordination service that broadcasts changes from Impala DDL and + DML statements to all affected Impala nodes, so that new tables, newly loaded data, and so on are + immediately visible to queries submitted through any Impala node. +<!-- Consider removing this when 1.2 gets far in the past. --> + (Prior to Impala 1.2, you had to run the <codeph>REFRESH</codeph> or <codeph>INVALIDATE + METADATA</codeph> statement on each node to synchronize changed metadata. Now those statements are only + required if you perform the DDL or DML through an external mechanism such as Hive <ph rev="2.2.0">or by uploading + data to the Amazon S3 filesystem</ph>.) + <xref href="impala_processes.xml#processes">Run one instance of this daemon</xref> on a node in your cluster, + preferably on the same host as the <codeph>statestored</codeph> daemon. + </p> + </li> + + <li> + <p> + <cmdname>impala-shell</cmdname> - <xref href="impala_impala_shell.xml#impala_shell">Command-line + interface</xref> for issuing queries to the Impala daemon. You install this on one or more hosts + anywhere on your network, not necessarily DataNodes or even within the same cluster as Impala. It can + connect remotely to any instance of the Impala daemon. + </p> + </li> + </ul> + + <p> + Before doing the installation, ensure that you have all necessary prerequisites. See + <xref href="impala_prereqs.xml#prereqs"/> for details. + </p> + </conbody> + </concept> + +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_int.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_int.xml b/docs/topics/impala_int.xml new file mode 100644 index 0000000..aeead5b --- /dev/null +++ b/docs/topics/impala_int.xml @@ -0,0 +1,95 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="int"> + + <title>INT Data Type</title> + <titlealts audience="PDF"><navtitle>INT</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Schemas"/> + </metadata> + </prolog> + + <conbody> + + <p> + A 4-byte integer data type used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p> + In the column definition of a <codeph>CREATE TABLE</codeph> statement: + </p> + +<codeblock><varname>column_name</varname> INT</codeblock> + + <p> + <b>Range:</b> -2147483648 .. 2147483647. There is no <codeph>UNSIGNED</codeph> subtype. + </p> + + <p> + <b>Conversions:</b> Impala automatically converts to a larger integer type (<codeph>BIGINT</codeph>) or a + floating-point type (<codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph>) automatically. Use + <codeph>CAST()</codeph> to convert to <codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, + <codeph>STRING</codeph>, or <codeph>TIMESTAMP</codeph>. + <ph conref="../shared/impala_common.xml#common/cast_int_to_timestamp"/> + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + The data type <codeph>INTEGER</codeph> is an alias for <codeph>INT</codeph>. + </p> + + <p> + For a convenient and automated way to check the bounds of the <codeph>INT</codeph> type, call the functions + <codeph>MIN_INT()</codeph> and <codeph>MAX_INT()</codeph>. + </p> + + <p> + If an integer value is too large to be represented as a <codeph>INT</codeph>, use a <codeph>BIGINT</codeph> + instead. + </p> + + <p conref="../shared/impala_common.xml#common/null_bad_numeric_cast"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<codeblock>CREATE TABLE t1 (x INT); +SELECT CAST(1000 AS INT); +</codeblock> + + <p conref="../shared/impala_common.xml#common/partitioning_good"/> + + <p conref="../shared/impala_common.xml#common/hbase_ok"/> + + <p conref="../shared/impala_common.xml#common/parquet_blurb"/> + + <p conref="../shared/impala_common.xml#common/text_bulky"/> + +<!-- <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> --> + + <p conref="../shared/impala_common.xml#common/internals_4_bytes"/> + + <p conref="../shared/impala_common.xml#common/added_forever"/> + + <p conref="../shared/impala_common.xml#common/column_stats_constant"/> + +<!-- <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> --> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_literals.xml#numeric_literals"/>, <xref href="impala_tinyint.xml#tinyint"/>, + <xref href="impala_smallint.xml#smallint"/>, <xref href="impala_int.xml#int"/>, + <xref href="impala_bigint.xml#bigint"/>, <xref href="impala_decimal.xml#decimal"/>, + <xref href="impala_math_functions.xml#math_functions"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_intro.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_intro.xml b/docs/topics/impala_intro.xml new file mode 100644 index 0000000..c599bc5 --- /dev/null +++ b/docs/topics/impala_intro.xml @@ -0,0 +1,81 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="intro"> + + <title id="impala"><ph audience="standalone">Introducing Apache Impala (incubating)</ph><ph audience="integrated">Apache Impala (incubating) Overview</ph></title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Getting Started"/> + <data name="Category" value="Concepts"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + </metadata> + </prolog> + + <conbody id="intro_body"> + + <p> + Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS, + HBase, <ph rev="2.2.0">or the Amazon Simple Storage Service (S3)</ph>. + In addition to using the same unified storage platform, + Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface + (Impala query UI in Hue) as Apache Hive. This + provides a familiar and unified platform for real-time or batch-oriented queries. + </p> + + <p> + Impala is an addition to tools available for querying big data. Impala does not replace the batch + processing frameworks built on MapReduce such as Hive. Hive and other frameworks built on MapReduce are + best suited for long running batch jobs, such as those involving batch processing of Extract, Transform, + and Load (ETL) type jobs. + </p> + + <note> + Impala was accepted into the Apache incubator on December 2, 2015. + In places where the documentation formerly referred to <q>Cloudera Impala</q>, + now the official name is <q>Apache Impala (incubating)</q>. + </note> + + </conbody> + + <concept id="benefits"> + + <title>Impala Benefits</title> + + <conbody> + + <p conref="../shared/impala_common.xml#common/impala_benefits"/> + + </conbody> + </concept> + + <concept id="impala_cdh"> + + <title>How Impala Works with CDH</title> + <prolog> + <metadata> + <data name="Category" value="Concepts"/> + </metadata> + </prolog> + + <conbody> + + <p conref="../shared/impala_common.xml#common/impala_overview_diagram"/> + + <p conref="../shared/impala_common.xml#common/component_list"/> + + <p conref="../shared/impala_common.xml#common/query_overview"/> + </conbody> + </concept> + + <concept id="features"> + + <title>Primary Impala Features</title> + + <conbody> + + <p conref="../shared/impala_common.xml#common/feature_list"/> + </conbody> + </concept> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_invalidate_metadata.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_invalidate_metadata.xml b/docs/topics/impala_invalidate_metadata.xml new file mode 100644 index 0000000..c41a996 --- /dev/null +++ b/docs/topics/impala_invalidate_metadata.xml @@ -0,0 +1,233 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.1" id="invalidate_metadata"> + + <title>INVALIDATE METADATA Statement</title> + <titlealts audience="PDF"><navtitle>INVALIDATE METADATA</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="DDL"/> + <data name="Category" value="ETL"/> + <data name="Category" value="Ingest"/> + <data name="Category" value="Metastore"/> + <data name="Category" value="Schemas"/> + <data name="Category" value="Tables"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">INVALIDATE METADATA statement</indexterm> + Marks the metadata for one or all tables as stale. Required after a table is created through the Hive shell, + before the table is available for Impala queries. The next time the current Impala node performs a query + against a table whose metadata is invalidated, Impala reloads the associated metadata before the query + proceeds. This is a relatively expensive operation compared to the incremental metadata update done by the + <codeph>REFRESH</codeph> statement, so in the common scenario of adding new data files to an existing table, + prefer <codeph>REFRESH</codeph> rather than <codeph>INVALIDATE METADATA</codeph>. If you are not familiar + with the way Impala uses metadata and how it shares the same metastore database as Hive, see + <xref href="impala_hadoop.xml#intro_metastore"/> for background information. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>INVALIDATE METADATA [[<varname>db_name</varname>.]<varname>table_name</varname>]</codeblock> + + <p> + By default, the cached metadata for all tables is flushed. If you specify a table name, only the metadata for + that one table is flushed. Even for a single table, <codeph>INVALIDATE METADATA</codeph> is more expensive + than <codeph>REFRESH</codeph>, so prefer <codeph>REFRESH</codeph> in the common case where you add new data + files for an existing table. + </p> + + <p conref="../shared/impala_common.xml#common/internals_blurb"/> + + <p> + To accurately respond to queries, Impala must have current metadata about those databases and tables that + clients query directly. Therefore, if some other entity modifies information used by Impala in the metastore + that Impala and Hive share, the information cached by Impala must be updated. However, this does not mean + that all metadata updates require an Impala update. + </p> + + <note> + <p conref="../shared/impala_common.xml#common/catalog_server_124"/> + <p rev="1.2"> + In Impala 1.2 and higher, a dedicated daemon (<cmdname>catalogd</cmdname>) broadcasts DDL changes made + through Impala to all Impala nodes. Formerly, after you created a database or table while connected to one + Impala node, you needed to issue an <codeph>INVALIDATE METADATA</codeph> statement on another Impala node + before accessing the new database or table from the other node. Now, newly created or altered objects are + picked up automatically by all Impala nodes. You must still use the <codeph>INVALIDATE METADATA</codeph> + technique after creating or altering objects through Hive. See + <xref href="impala_components.xml#intro_catalogd"/> for more information on the catalog service. + </p> + <p> + The <codeph>INVALIDATE METADATA</codeph> statement is new in Impala 1.1 and higher, and takes over some of + the use cases of the Impala 1.0 <codeph>REFRESH</codeph> statement. Because <codeph>REFRESH</codeph> now + requires a table name parameter, to flush the metadata for all tables at once, use the <codeph>INVALIDATE + METADATA</codeph> statement. + </p> + <p conref="../shared/impala_common.xml#common/invalidate_then_refresh"/> + </note> + + <p conref="../shared/impala_common.xml#common/refresh_vs_invalidate"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + A metadata update for an <codeph>impalad</codeph> instance <b>is</b> required if: + </p> + + <ul> + <li> + A metadata change occurs. + </li> + + <li> + <b>and</b> the change is made from another <codeph>impalad</codeph> instance in your cluster, or through + Hive. + </li> + + <li> + <b>and</b> the change is made to a metastore database to which clients such as the Impala shell or ODBC directly + connect. + </li> + </ul> + + <p> + A metadata update for an Impala node is <b>not</b> required when you issue queries from the same Impala node + where you ran <codeph>ALTER TABLE</codeph>, <codeph>INSERT</codeph>, or other table-modifying statement. + </p> + + <p> + Database and table metadata is typically modified by: + </p> + + <ul> + <li> + Hive - via <codeph>ALTER</codeph>, <codeph>CREATE</codeph>, <codeph>DROP</codeph> or + <codeph>INSERT</codeph> operations. + </li> + + <li> + Impalad - via <codeph>CREATE TABLE</codeph>, <codeph>ALTER TABLE</codeph>, and <codeph>INSERT</codeph> + operations. + </li> + </ul> + + <p> + <codeph>INVALIDATE METADATA</codeph> causes the metadata for that table to be marked as stale, and reloaded + the next time the table is referenced. For a huge table, that process could take a noticeable amount of time; + thus you might prefer to use <codeph>REFRESH</codeph> where practical, to avoid an unpredictable delay later, + for example if the next reference to the table is during a benchmark test. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example shows how you might use the <codeph>INVALIDATE METADATA</codeph> statement after + creating new tables (such as SequenceFile or HBase tables) through the Hive shell. Before the + <codeph>INVALIDATE METADATA</codeph> statement was issued, Impala would give a <q>table not found</q> error + if you tried to refer to those table names. The <codeph>DESCRIBE</codeph> statements cause the latest + metadata to be immediately loaded for the tables, avoiding a delay the next time those tables are queried. + </p> + +<codeblock>[impalad-host:21000] > invalidate metadata; +[impalad-host:21000] > describe t1; +... +[impalad-host:21000] > describe t2; +... </codeblock> + + <p> + For more examples of using <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> with a + combination of Impala and Hive operations, see <xref href="impala_tutorial.xml#tutorial_impala_hive"/>. + </p> + + <p> + If you need to ensure that the metadata is up-to-date when you start an <cmdname>impala-shell</cmdname> + session, run <cmdname>impala-shell</cmdname> with the <codeph>-r</codeph> or + <codeph>--refresh_after_connect</codeph> command-line option. Because this operation adds a delay to the next + query against each table, potentially expensive for large tables with many partitions, try to avoid using + this option for day-to-day operations in a production environment. + </p> + + <p conref="../shared/impala_common.xml#common/permissions_blurb"/> + <p rev="CDH-19187"> + The user ID that the <cmdname>impalad</cmdname> daemon runs under, + typically the <codeph>impala</codeph> user, must have execute + permissions for all the relevant directories holding table data. + (A table could have data spread across multiple directories, + or in unexpected paths, if it uses partitioning or + specifies a <codeph>LOCATION</codeph> attribute for + individual partitions or the entire table.) + Issues with permissions might not cause an immediate error for this statement, + but subsequent statements such as <codeph>SELECT</codeph> + or <codeph>SHOW TABLE STATS</codeph> could fail. + </p> + + <p conref="../shared/impala_common.xml#common/hdfs_blurb"/> + + <p> + By default, the <codeph>INVALIDATE METADATA</codeph> command checks HDFS permissions of the underlying data + files and directories, caching this information so that a statement can be cancelled immediately if for + example the <codeph>impala</codeph> user does not have permission to write to the data directory for the + table. (This checking does not apply if you have set the <cmdname>catalogd</cmdname> configuration option + <codeph>--load_catalog_in_background=false</codeph>.) Impala reports any lack of write permissions as an + <codeph>INFO</codeph> message in the log file, in case that represents an oversight. If you change HDFS + permissions to make data readable or writeable by the Impala user, issue another <codeph>INVALIDATE + METADATA</codeph> to make Impala aware of the change. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p rev="1.2.4"> + This example illustrates creating a new database and new table in Hive, then doing an <codeph>INVALIDATE + METADATA</codeph> statement in Impala using the fully qualified table name, after which both the new table + and the new database are visible to Impala. The ability to specify <codeph>INVALIDATE METADATA + <varname>table_name</varname></codeph> for a table created in Hive is a new capability in Impala 1.2.4. In + earlier releases, that statement would have returned an error indicating an unknown table, requiring you to + do <codeph>INVALIDATE METADATA</codeph> with no table name, a more expensive operation that reloaded metadata + for all tables and databases. + </p> + +<codeblock rev="1.2.4">$ hive +hive> create database new_db_from_hive; +OK +Time taken: 4.118 seconds +hive> create table new_db_from_hive.new_table_from_hive (x int); +OK +Time taken: 0.618 seconds +hive> quit; +$ impala-shell +[localhost:21000] > show databases like 'new*'; +[localhost:21000] > refresh new_db_from_hive.new_table_from_hive; +ERROR: AnalysisException: Database does not exist: new_db_from_hive +[localhost:21000] > invalidate metadata new_db_from_hive.new_table_from_hive; +[localhost:21000] > show databases like 'new*'; ++--------------------+ +| name | ++--------------------+ +| new_db_from_hive | ++--------------------+ +[localhost:21000] > show tables in new_db_from_hive; ++---------------------+ +| name | ++---------------------+ +| new_table_from_hive | ++---------------------+</codeblock> + + <p conref="../shared/impala_common.xml#common/s3_blurb"/> + <p conref="../shared/impala_common.xml#common/s3_metadata"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_hadoop.xml#intro_metastore"/>, + <xref href="impala_refresh.xml#refresh"/> + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_isilon.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_isilon.xml b/docs/topics/impala_isilon.xml new file mode 100644 index 0000000..f631268 --- /dev/null +++ b/docs/topics/impala_isilon.xml @@ -0,0 +1,114 @@ +<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="5.4.3" id="impala_isilon"> + + <title>Using Impala with Isilon Storage</title> + <titlealts audience="PDF"><navtitle>Isilon Storage</navtitle></titlealts> + + <prolog> + <metadata> + <data name="Category" value="CDH"/> + <data name="Category" value="Impala"/> + <data name="Category" value="Isilon"/> + <data name="Category" value="Disk Storage"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">Isilon</indexterm> + You can use Impala to query data files that reside on EMC Isilon storage devices, rather than in HDFS. + This capability allows convenient query access to a storage system where you might already be + managing large volumes of data. The combination of the Impala query engine and Isilon storage is + certified on CDH 5.4.4 or higher. + </p> + + <p conref="../shared/impala_common.xml#common/isilon_block_size_caveat"/> + + <p> + The typical use case for Impala and Isilon together is to use Isilon for the + default filesystem, replacing HDFS entirely. In this configuration, + when you create a database, table, or partition, the data always resides on + Isilon storage and you do not need to specify any special <codeph>LOCATION</codeph> + attribute. If you do specify a <codeph>LOCATION</codeph> attribute, its value refers + to a path within the Isilon filesystem. + For example: + </p> +<codeblock>-- If the default filesystem is Isilon, all Impala data resides there +-- and all Impala databases and tables are located there. +CREATE TABLE t1 (x INT, s STRING); + +-- You can specify LOCATION for database, table, or partition, +-- using values from the Isilon filesystem. +CREATE DATABASE d1 LOCATION '/some/path/on/isilon/server/d1.db'; +CREATE TABLE d1.t2 (a TINYINT, b BOOLEAN); +</codeblock> + + <p> + Impala can write to, delete, and rename data files and database, table, + and partition directories on Isilon storage. Therefore, Impala statements such + as + <codeph>CREATE TABLE</codeph>, <codeph>DROP TABLE</codeph>, + <codeph>CREATE DATABASE</codeph>, <codeph>DROP DATABASE</codeph>, + <codeph>ALTER TABLE</codeph>, + and + <codeph>INSERT</codeph> work the same with Isilon storage as with HDFS. + </p> + + <p> + When the Impala spill-to-disk feature is activated by a query that approaches + the memory limit, Impala writes all the temporary data to a local (not Isilon) + storage device. Because the I/O bandwidth for the temporary data depends on + the number of local disks, and clusters using Isilon storage might not have + as many local disks attached, pay special attention on Isilon-enabled clusters + to any queries that use the spill-to-disk feature. Where practical, tune the + queries or allocate extra memory for Impala to avoid spilling. + Although you can specify an Isilon storage device as the destination for + the temporary data for the spill-to-disk feature, that configuration is + not recommended due to the need to transfer the data both ways using remote I/O. + </p> + + <p> + When tuning Impala queries on HDFS, you typically try to avoid any remote reads. + When the data resides on Isilon storage, all the I/O consists of remote reads. + Do not be alarmed when you see non-zero numbers for remote read measurements + in query profile output. The benefit of the Impala and Isilon integration is + primarily convenience of not having to move or copy large volumes of data to HDFS, + rather than raw query performance. You can increase the performance of Impala + I/O for Isilon systems by increasing the value for the + <codeph>num_remote_hdfs_io_threads</codeph> configuration parameter, + in the Cloudera Manager user interface for clusters using Cloudera Manager, + or through the <codeph>--num_remote_hdfs_io_threads</codeph> startup option + for the <cmdname>impalad</cmdname> daemon on clusters not using Cloudera Manager. + </p> + + <p> +<!-- + For information about tasks performed on + Isilon OneFS, see the information hub for Cloudera on the EMC Community Network: + <xref href="https://community.emc.com/docs/DOC-39522" format="html" scope="external">https://community.emc.com/docs/DOC-39522</xref>. +--> + <!-- This is a little bit of a circular loop when this topic is conrefed into the main Isilon page, + consider if there's a way to conditionalize it out in that case. --> + For information about managing Isilon storage devices through Cloudera Manager, see + <xref audience="integrated" href="cm_mc_isilon_service.xml"/><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_isilon_service.html" scope="external" format="html"/>. + </p> + + <!-- <p outputclass="toc inpage"/> --> + </conbody> +<concept id="isilon_cm_configs"> +<title>Required Configurations</title> +<conbody> +<p>Specify the following configurations in Cloudera Manager on the <menucascade><uicontrol>Clusters</uicontrol><uicontrol><varname>Isilon Service</varname></uicontrol><uicontrol>Configuration</uicontrol></menucascade> tab:<ul id="ul_vpx_bw5_vv"> +<li>In <uicontrol>HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml</uicontrol> <codeph>hdfs-site.xml</codeph> and the <uicontrol>Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml</uicontrol> properties for the Isilon service, set the value of the <codeph>dfs.client.file-block-storage-locations.timeout.millis</codeph> property to <codeph>10000</codeph>.</li> +<li>In the Isilon <uicontrol>Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml</uicontrol> property for the Isilon service, set the value of the <codeph>hadoop.security.token.service.use_ip</codeph> property to <codeph>FALSE</codeph>. </li> +<li>If you see errors that reference the <codeph>.Trash</codeph> directory, make sure that the <uicontrol>Use Trash</uicontrol> property is selected.</li> +</ul></p> + +</conbody> +</concept> + +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_jdbc.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_jdbc.xml b/docs/topics/impala_jdbc.xml new file mode 100644 index 0000000..d667170 --- /dev/null +++ b/docs/topics/impala_jdbc.xml @@ -0,0 +1,387 @@ +<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="impala_jdbc"> + + <title id="jdbc">Configuring Impala to Work with JDBC</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="JDBC"/> + <data name="Category" value="Java"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Configuring"/> + <data name="Category" value="Starting and Stopping"/> + <data name="Category" value="Developers"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">JDBC</indexterm> + Impala supports the standard JDBC interface, allowing access from commercial Business Intelligence tools and + custom software written in Java or other programming languages. The JDBC driver allows you to access Impala + from a Java program that you write, or a Business Intelligence or similar tool that uses JDBC to communicate + with various database products. + </p> + + <p> + Setting up a JDBC connection to Impala involves the following steps: + </p> + + <ul> + <li> + Verifying the communication port where the Impala daemons in your cluster are listening for incoming JDBC + requests. + </li> + + <li> + Installing the JDBC driver on every system that runs the JDBC-enabled application. + </li> + + <li> + Specifying a connection string for the JDBC application to access one of the servers running the + <cmdname>impalad</cmdname> daemon, with the appropriate security settings. + </li> + </ul> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="jdbc_port"> + + <title>Configuring the JDBC Port</title> + + <conbody> + + <p> + The default port used by JDBC 2.0 and later (as well as ODBC 2.x) is 21050. Impala server accepts JDBC + connections through this same port 21050 by default. Make sure this port is available for communication + with other hosts on your network, for example, that it is not blocked by firewall software. If your JDBC + client software connects to a different port, specify that alternative port number with the + <codeph>--hs2_port</codeph> option when starting <codeph>impalad</codeph>. See + <xref href="impala_processes.xml#processes"/> for details about Impala startup options. See + <xref href="impala_ports.xml#ports"/> for information about all ports used for communication between Impala + and clients or between Impala components. + </p> + </conbody> + </concept> + + <concept id="jdbc_driver_choice"> + + <title>Choosing the JDBC Driver</title> + <prolog> + <metadata> + <data name="Category" value="Planning"/> + </metadata> + </prolog> + + <conbody> + + <p> + In Impala 2.0 and later, you have the choice between the Cloudera JDBC Connector and the Hive 0.13 JDBC driver. + <ph rev="upstream">Cloudera</ph> recommends using the Cloudera JDBC Connector where practical. + </p> + + <p> + If you are already using JDBC applications with an earlier Impala release, you must update your JDBC driver + to one of these choices, because the Hive 0.12 driver that was formerly the only choice is not compatible + with Impala 2.0 and later. + </p> + + <p> + Both the Cloudera JDBC 2.5 Connector and the Hive JDBC driver provide a substantial speed increase for JDBC + applications with Impala 2.0 and higher, for queries that return large result sets. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/jdbc_odbc_complex_types"/> + <p conref="../shared/impala_common.xml#common/jdbc_odbc_complex_types_views"/> + + </conbody> + </concept> + + <concept id="jdbc_setup"> + + <title>Enabling Impala JDBC Support on Client Systems</title> + <prolog> + <metadata> + <data name="Category" value="Installing"/> + </metadata> + </prolog> + + <conbody> + + <section id="install_jdbc_connector"> + <title>Using the Cloudera JDBC Connector (recommended)</title> + + <p> + You download and install the Cloudera JDBC 2.5 connector on any Linux, Windows, or Mac system where you + intend to run JDBC-enabled applications. From the + <xref href="http://go.cloudera.com/odbc-driver-hive-impala.html" scope="external" format="html">Cloudera + Connectors download page</xref>, you choose the appropriate protocol (JDBC or ODBC) and target product + (Impala or Hive). The ease of downloading and installing on non-CDH systems makes this connector a + convenient choice for organizations with heterogeneous environments. + </p> + + </section> + + <section id="install_hive_driver"> + <title>Using the Hive JDBC Driver</title> + <p> + You install the Hive JDBC driver (<codeph>hive-jdbc</codeph> package) through the Linux package manager, on + hosts within the CDH cluster. The driver consists of several Java JAR files. The same driver can be used by Impala and Hive. + </p> + + <p> + To get the JAR files, install the Hive JDBC driver on each CDH-enabled host in the cluster that will run + JDBC applications. Follow the instructions for + <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hive_jdbc_install.html" scope="external" format="html">CDH 5</xref>. + </p> + + <note> + The latest JDBC driver, corresponding to Hive 0.13, provides substantial performance improvements for + Impala queries that return large result sets. Impala 2.0 and later are compatible with the Hive 0.13 + driver. If you already have an older JDBC driver installed, and are running Impala 2.0 or higher, consider + upgrading to the latest Hive JDBC driver for best performance with JDBC applications. + </note> + + <p> + If you are using JDBC-enabled applications on hosts outside the CDH cluster, you cannot use the CDH install + procedure on the non-CDH hosts. Install the JDBC driver on at least one CDH host using the preceding + procedure. Then download the JAR files to each client machine that will use JDBC with Impala: + </p> + + <codeblock>commons-logging-X.X.X.jar + hadoop-common.jar + hive-common-X.XX.X-cdhX.X.X.jar + hive-jdbc-X.XX.X-cdhX.X.X.jar + hive-metastore-X.XX.X-cdhX.X.X.jar + hive-service-X.XX.X-cdhX.X.X.jar + httpclient-X.X.X.jar + httpcore-X.X.X.jar + libfb303-X.X.X.jar + libthrift-X.X.X.jar + log4j-X.X.XX.jar + slf4j-api-X.X.X.jar + slf4j-logXjXX-X.X.X.jar + </codeblock> + + <p> + <b>To enable JDBC support for Impala on the system where you run the JDBC application:</b> + </p> + + <ol> + <li> + Download the JAR files listed above to each client machine. + <!-- + Download the + <xref href="https://downloads.cloudera.com/impala-jdbc/impala-jdbc-0.5-2.zip" scope="external" format="zip">Impala + JDBC zip file</xref> to the client machine that you will use to connect to Impala servers. + --> + <note> + For Maven users, see + <xref href="https://github.com/onefoursix/Cloudera-Impala-JDBC-Example" scope="external" format="html">this + sample github page</xref> for an example of the dependencies you could add to a <codeph>pom</codeph> + file instead of downloading the individual JARs. + </note> + </li> + + <li> + Store the JAR files in a location of your choosing, ideally a directory already referenced in your + <codeph>CLASSPATH</codeph> setting. For example: + <ul> + <li> + On Linux, you might use a location such as + <codeph>/</codeph><codeph>opt</codeph><codeph>/jars/</codeph>. + </li> + + <li> + On Windows, you might use a subdirectory underneath <filepath>C:\Program Files</filepath>. + </li> + </ul> + </li> + + <li> + To successfully load the Impala JDBC driver, client programs must be able to locate the associated JAR + files. This often means setting the <codeph>CLASSPATH</codeph> for the client process to include the + JARs. Consult the documentation for your JDBC client for more details on how to install new JDBC drivers, + but some examples of how to set <codeph>CLASSPATH</codeph> variables include: + <ul> + <li> + On Linux, if you extracted the JARs to <codeph>/opt/jars/</codeph>, you might issue the following + command to prepend the JAR files path to an existing classpath: + <codeblock>export CLASSPATH=/opt/jars/*.jar:$CLASSPATH</codeblock> + </li> + + <li> + On Windows, use the <b>System Properties</b> control panel item to modify the <b>Environment + Variables</b> for your system. Modify the environment variables to include the path to which you + extracted the files. + <note> + If the existing <codeph>CLASSPATH</codeph> on your client machine refers to some older version of + the Hive JARs, ensure that the new JARs are the first ones listed. Either put the new JAR files + earlier in the listings, or delete the other references to Hive JAR files. + </note> + </li> + </ul> + </li> + </ol> + </section> + + </conbody> + </concept> + + <concept id="jdbc_connect"> + + <title>Establishing JDBC Connections</title> + + <conbody> + + <p> + The JDBC driver class depends on which driver you select. + </p> + + <note conref="../shared/impala_common.xml#common/proxy_jdbc_caveat"/> + + <section id="class_jdbc_connector"> + + <title>Using the Cloudera JDBC Connector (recommended)</title> + + <p> + Depending on the level of the JDBC API your application is targeting, you can use + the following fully-qualified class names (FQCNs): + </p> + + <ul> + <li><codeph>com.cloudera.impala.jdbc41.Driver</codeph></li> + <li><codeph>com.cloudera.impala.jdbc41.DataSource</codeph></li> + </ul> + + <ul> + <li><codeph>com.cloudera.impala.jdbc4.Driver</codeph></li> + <li><codeph>com.cloudera.impala.jdbc4.DataSource</codeph></li> + </ul> + + <ul> + <li><codeph>com.cloudera.impala.jdbc3.Driver</codeph></li> + <li><codeph>com.cloudera.impala.jdbc3.DataSource</codeph></li> + </ul> + + <p> + The connection string has the following format: + </p> + +<codeblock>jdbc:impala://<varname>Host</varname>:<varname>Port</varname>[/<varname>Schema</varname>];<varname>Property1</varname>=<varname>Value</varname>;<varname>Property2</varname>=<varname>Value</varname>;...</codeblock> + + <p> + The <codeph>port</codeph> value is typically 21050 for Impala. + </p> + + <p> + For full details about the classes and the connection string (especially the property values available + for the connection string), download the appropriate driver documentation for your platform from + <xref href="http://www.cloudera.com/content/cloudera/en/downloads/connectors/impala/jdbc/impala-jdbc-v2-5-5.html" scope="external" format="html">the Impala JDBC Connector download page</xref>. + </p> + + </section> + + <section id="class_hive_driver"> + <title>Using the Hive JDBC Driver</title> + + <p> + For example, with the Hive JDBC driver, the class name is <codeph>org.apache.hive.jdbc.HiveDriver</codeph>. + Once you have configured Impala to work with JDBC, you can establish connections between the two. + To do so for a cluster that does not use + Kerberos authentication, use a connection string of the form + <codeph>jdbc:hive2://<varname>host</varname>:<varname>port</varname>/;auth=noSasl</codeph>. +<!-- + Include the <codeph>auth=noSasl</codeph> argument + only when connecting to a non-Kerberos cluster; if Kerberos is enabled, omit the <codeph>auth</codeph> argument. +--> + For example, you might use: + </p> + +<codeblock>jdbc:hive2://myhost.example.com:21050/;auth=noSasl</codeblock> + + <p> + To connect to an instance of Impala that requires Kerberos authentication, use a connection string of the + form + <codeph>jdbc:hive2://<varname>host</varname>:<varname>port</varname>/;principal=<varname>principal_name</varname></codeph>. + The principal must be the same user principal you used when starting Impala. For example, you might use: + </p> + +<codeblock>jdbc:hive2://myhost.example.com:21050/;principal=impala/[email protected]</codeblock> + + <p> + To connect to an instance of Impala that requires LDAP authentication, use a connection string of the form + <codeph>jdbc:hive2://<varname>host</varname>:<varname>port</varname>/<varname>db_name</varname>;user=<varname>ldap_userid</varname>;password=<varname>ldap_password</varname></codeph>. + For example, you might use: + </p> + +<codeblock>jdbc:hive2://myhost.example.com:21050/test_db;user=fred;password=xyz123</codeblock> + + <note> + <p conref="../shared/impala_common.xml#common/hive_jdbc_ssl_kerberos_caveat"/> + </note> + + </section> + + </conbody> + </concept> + + <concept rev="2.3.0" id="jdbc_odbc_notes"> + <title>Notes about JDBC and ODBC Interaction with Impala SQL Features</title> + <conbody> + <p> + Most Impala SQL features work equivalently through the <cmdname>impala-shell</cmdname> interpreter + of the JDBC or ODBC APIs. The following are some exceptions to keep in mind when switching between + the interactive shell and applications using the APIs: + </p> + <ul> + <li> + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + <ul> + <li> + <p> + Queries involving the complex types (<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>) + require notation that might not be available in all levels of JDBC and ODBC drivers. + If you have trouble querying such a table due to the driver level or + inability to edit the queries used by the application, you can create a view that exposes + a <q>flattened</q> version of the complex columns and point the application at the view. + See <xref href="impala_complex_types.xml#complex_types"/> for details. + </p> + </li> + <li> + <p> + The complex types available in <keyword keyref="impala23_full"/> and higher are supported by the + JDBC <codeph>getColumns()</codeph> API. + Both <codeph>MAP</codeph> and <codeph>ARRAY</codeph> are reported as the JDBC SQL Type <codeph>ARRAY</codeph>, + because this is the closest matching Java SQL type. This behavior is consistent with Hive. + <codeph>STRUCT</codeph> types are reported as the JDBC SQL Type <codeph>STRUCT</codeph>. + </p> + <p> + To be consistent with Hive's behavior, the TYPE_NAME field is populated + with the primitive type name for scalar types, and with the full <codeph>toSql()</codeph> + for complex types. The resulting type names are somewhat inconsistent, + because nested types are printed differently than top-level types. For example, + the following list shows how <codeph>toSQL()</codeph> for Impala types are + translated to <codeph>TYPE_NAME</codeph> values: +<codeblock><![CDATA[DECIMAL(10,10) becomes DECIMAL +CHAR(10) becomes CHAR +VARCHAR(10) becomes VARCHAR +ARRAY<DECIMAL(10,10)> becomes ARRAY<DECIMAL(10,10)> +ARRAY<CHAR(10)> becomes ARRAY<CHAR(10)> +ARRAY<VARCHAR(10)> becomes ARRAY<VARCHAR(10)> +]]> +</codeblock> + </p> + </li> + </ul> + </li> + </ul> + </conbody> + </concept> + +</concept>
